[datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Tue May 20 21:27:52 CEST 2014

Hello everyone,

With the latest commit #1266, the extra functionality offered via rbind (use.names and fill) is also now available to rbindlist. In addition, the implementation is completely moved to C, and is therefore tremendously fast, especially for cases where one has to bind using with use.names=TRUE and/or with fill=TRUE. I’ll try to put out a benchmark comparing speed differences with the older implementation ASAP.

Note that this change comes with a very low cost to the default speed to rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding 10,000 data.tables with 20 columns each, resulted in the new version running in 0.107 seconds, where as the older version ran in 0.095 seconds.

In addition the documentation for ?rbindlist also has been improved (#5158 from Alexander). Here’s the change log from NEWS:

  o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
         -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default)
         -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default,  
            for compatibility with base (and backwards compatibility).
         -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
         -> At least one item of the input list has to have non-null column names.
         -> Duplicate columns are bound in the order of occurrence, like base.
         -> Attributes that might exist in individual items would be lost in the bound result.
         -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
         -> And incredibly fast ;).
         -> Documentation updated in much detail. Closes DR #5158.
     Eddi's (excellent) work on finding factor levels, type coercion of columns etc. are all retained.
Please try it and write back if things aren’t working as it was before. The tests that had to be fixed are extremely rare cases. I suspect there should be minimal issue, if at all, in this version. However, I do find the changes here bring consistency to the function.

One (very rare) feature that is not available due to this implementation is the ability to recycle.

dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
lst1 <- list(x=4, y=5, z=as.list(1:3))

rbind(dt1, lst1)
#    x y       z
# 1: 1 4     1,2
# 2: 2 5   1,2,3
# 3: 3 6 1,2,3,4
# 4: 4 5       1
# 5: 4 5       2
# 6: 4 5       3
The 4,5 are recycled very nicely here.. This is not possible at the moment. This is because the earlier rbind implementation used as.data.table to convert to data.table, however it takes a copy (very inefficient on huge / many tables). I’d love to add this feature in C as well, as it would help incredibly for use within [.data.table (now that we can fill columns and bind by names faster). Will add a FR.

In summary, I think there should be minimal issues, if any and should be much faster (for rbind cases). Please write back what you think, if you happen to try out.

Arun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/1e1ee6ca/attachment.html>