[datatable-help] := construct doesn't seem to work in lists of data.tables

Bacou, Melanie mel at mbacou.com
Thu Aug 16 15:48:43 CEST 2012


Hi Matthew,

Sorry for not filing earlier -- the behavior is not a major annoyance as my 
data.tables are rather small this time around.

The reason I'm using data.tables in a list, though that might seem odd, is I'm 
harvesting quantities of external data files that I eventually want to combine into 
one data.table, but before I can rbind() everything, I'm running lots of validation 
and cleaning tasks on the harvested files using lapply() and some indexing magic. The 
combination of data.table() and lapply() makes the syntax /really /efficient.

I'm afraid I can't provide further input into a possible workaround as the 
alternatives you listed below sound all good to me! Hopefully others on the list can 
contribute.

Best, --Mel.


On 8/15/2012 4:30 AM, Matthew Dowle wrote:
> Hi,
>
> That's interesting, thanks. I'm delighted the warning came up and that no
> crash happened. This is just what .internal.selfref was designed to catch.
>
> list() itself appears to be copying its NAM(2)-ed inputs. If you run the
> following, you should see the pointer addresses show that.
>
>      X=data.table(a=1:3)
>      .Internal(inspect(X))
>      .Internal(inspect(list(X)))   # list() copies X
>
> The problem isn't just the copy, but that when R does that copy it
> collapses the over-allocated vector of column vector pointers (that
> data.table carefully created) down to just the columns used. Causing := a
> problem if it's then asked to add a column by reference (no free slots).
>
> Three possible dev solutions spring to mind :
>
> 1. Try again to return data.table as NAM(0) not NAM(2) [there's already a
> FR for that]. Assuming that list() only copies NAM(2) inputs.
>
> 2. Add a new function to data.table (reflist()?) that doesn't copy
> data.table inputs but works the same as base::list otherwise.
>
> 3. Get even more fancy inside [.data.table to inspect its caller. If
> that's L[[i]] then update L's pointer to the (new) re-over-allocated
> column pointer vector. The copy by list() would still happen but at least
> the column would be added. The next add column by reference after that
> would then work without warning.
>
> Please file a bug report, with a link to this thread. That way you'll get
> automatic updates when the status changes. Option 2 is most likely.
>
> Is list() of data.table really needed? Could it be one data.table with an
> extra first column, or an environment of data.table's perhaps?
>
> The more significant problem is that a list column containing data.tables
> is likely copying all those data.tables, then. Regardless of whether or
> not := is then used to add a column by reference to those embedded tables.
>
> Matthew
>
>
>> Hello,
>>
>> I just noticed an odd behavior with lists of data.tables:
>>
>> dt1 <- data.table(a=1:3, b=4:6, c=7:9)
>> dt2 <- data.table(a=10:12, b=13:15, c=16:18)
>>
>> # Combine in a list
>> myList <- list(dt1, dt2)
>>
>> # Adding a new column to first data.table -- this doesn't work
>> myList[[1]][, d := 4:6]
>> #    a b c d
>> # 1: 1 4 7 4
>> # 2: 2 5 8 5
>> # 3: 3 6 9 6
>> # Warning message:
>> # In `[.data.table`(myList[[1]], , `:=`(d, 4:6)) :
>> #   Invalid .internal.selfref detected and fixed by taking a copy of the
>> whole table,
>> so that := can add this new column by reference. At an earlier point, this
>> data.table
>> has been copied by R. Avoid key<-, names<- and attr<- which in R currently
>> (and oddly)
>> all copy the whole data.table. Use set* syntax instead to avoid copying:
>> setkey(),
>> setnames() and setattr(). If this message doesn't help, please report to
>> datatable-help so the root cause can be fixed.
>>
>> myList[[1]]
>> #    a b c
>> # 1: 1 4 7
>> # 2: 2 5 8
>> # 3: 3 6 9
>>
>> # I need to reassign -- this works
>> myList[[1]] <- myList[[1]][, d := 4:6]
>>
>> myList[[1]]
>> #    a b c d
>> # 1: 1 4 7 4
>> # 2: 2 5 8 5
>> # 3: 3 6 9 6
>>
>> # But on the other hand this works no problem
>> setcolorder(myList[[1]], 4:1)
>> myList[[1]]
>> #    d c b a
>> # 1: 4 7 4 1
>> # 2: 5 8 5 2
>> # 3: 6 9 6 3
>>
>> Is this normal behavior, seems a bit odd to me?
>>
>> Here is my session:
>>
>>   > sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  utils     datasets  grDevices methods base
>>
>> other attached packages:
>> [1] foreign_0.8-50      RJDBC_0.2-0         DBI_0.2-5
>> [4] XLConnect_0.2-0     XLConnectJars_0.2-0 rJava_0.9-3
>> [7] data.table_1.8.2    rj_1.1.0-4
>>
>> loaded via a namespace (and not attached):
>> [1] rj.gd_1.1.0-1 tools_2.15.1
>>
>>
>> Thanks very much for this fantastic package!
>>
>> --Mel.
>>
>> Melanie BACOU
>> International Food Policy Research Institute
>> Agricultural Economist, HarvestChoice
>> E-mail mel at mbacou.com <mailto:mel at mbacou.com>
>> Visit harvestchoice.org <http://www.harvestchoice.org/>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120816/19e821ee/attachment.html>


More information about the datatable-help mailing list