[datatable-help] := construct doesn't seem to work in lists of data.tables

Matthew Dowle mdowle at mdowle.plus.com
Thu Aug 16 17:13:58 CEST 2012


Hi. Ok, thanks. Btw, just to check you saw the new rbindlist() function then.


> Hi Matthew,
>
> Sorry for not filing earlier -- the behavior is not a major annoyance as
> my
> data.tables are rather small this time around.
>
> The reason I'm using data.tables in a list, though that might seem odd, is
> I'm
> harvesting quantities of external data files that I eventually want to
> combine into
> one data.table, but before I can rbind() everything, I'm running lots of
> validation
> and cleaning tasks on the harvested files using lapply() and some indexing
> magic. The
> combination of data.table() and lapply() makes the syntax /really
> /efficient.
>
> I'm afraid I can't provide further input into a possible workaround as the
> alternatives you listed below sound all good to me! Hopefully others on
> the list can
> contribute.
>
> Best, --Mel.
>
>
> On 8/15/2012 4:30 AM, Matthew Dowle wrote:
>> Hi,
>>
>> That's interesting, thanks. I'm delighted the warning came up and that
>> no
>> crash happened. This is just what .internal.selfref was designed to
>> catch.
>>
>> list() itself appears to be copying its NAM(2)-ed inputs. If you run the
>> following, you should see the pointer addresses show that.
>>
>>      X=data.table(a=1:3)
>>      .Internal(inspect(X))
>>      .Internal(inspect(list(X)))   # list() copies X
>>
>> The problem isn't just the copy, but that when R does that copy it
>> collapses the over-allocated vector of column vector pointers (that
>> data.table carefully created) down to just the columns used. Causing :=
>> a
>> problem if it's then asked to add a column by reference (no free slots).
>>
>> Three possible dev solutions spring to mind :
>>
>> 1. Try again to return data.table as NAM(0) not NAM(2) [there's already
>> a
>> FR for that]. Assuming that list() only copies NAM(2) inputs.
>>
>> 2. Add a new function to data.table (reflist()?) that doesn't copy
>> data.table inputs but works the same as base::list otherwise.
>>
>> 3. Get even more fancy inside [.data.table to inspect its caller. If
>> that's L[[i]] then update L's pointer to the (new) re-over-allocated
>> column pointer vector. The copy by list() would still happen but at
>> least
>> the column would be added. The next add column by reference after that
>> would then work without warning.
>>
>> Please file a bug report, with a link to this thread. That way you'll
>> get
>> automatic updates when the status changes. Option 2 is most likely.
>>
>> Is list() of data.table really needed? Could it be one data.table with
>> an
>> extra first column, or an environment of data.table's perhaps?
>>
>> The more significant problem is that a list column containing
>> data.tables
>> is likely copying all those data.tables, then. Regardless of whether or
>> not := is then used to add a column by reference to those embedded
>> tables.
>>
>> Matthew
>>
>>
>>> Hello,
>>>
>>> I just noticed an odd behavior with lists of data.tables:
>>>
>>> dt1 <- data.table(a=1:3, b=4:6, c=7:9)
>>> dt2 <- data.table(a=10:12, b=13:15, c=16:18)
>>>
>>> # Combine in a list
>>> myList <- list(dt1, dt2)
>>>
>>> # Adding a new column to first data.table -- this doesn't work
>>> myList[[1]][, d := 4:6]
>>> #    a b c d
>>> # 1: 1 4 7 4
>>> # 2: 2 5 8 5
>>> # 3: 3 6 9 6
>>> # Warning message:
>>> # In `[.data.table`(myList[[1]], , `:=`(d, 4:6)) :
>>> #   Invalid .internal.selfref detected and fixed by taking a copy of
>>> the
>>> whole table,
>>> so that := can add this new column by reference. At an earlier point,
>>> this
>>> data.table
>>> has been copied by R. Avoid key<-, names<- and attr<- which in R
>>> currently
>>> (and oddly)
>>> all copy the whole data.table. Use set* syntax instead to avoid
>>> copying:
>>> setkey(),
>>> setnames() and setattr(). If this message doesn't help, please report
>>> to
>>> datatable-help so the root cause can be fixed.
>>>
>>> myList[[1]]
>>> #    a b c
>>> # 1: 1 4 7
>>> # 2: 2 5 8
>>> # 3: 3 6 9
>>>
>>> # I need to reassign -- this works
>>> myList[[1]] <- myList[[1]][, d := 4:6]
>>>
>>> myList[[1]]
>>> #    a b c d
>>> # 1: 1 4 7 4
>>> # 2: 2 5 8 5
>>> # 3: 3 6 9 6
>>>
>>> # But on the other hand this works no problem
>>> setcolorder(myList[[1]], 4:1)
>>> myList[[1]]
>>> #    d c b a
>>> # 1: 4 7 4 1
>>> # 2: 5 8 5 2
>>> # 3: 6 9 6 3
>>>
>>> Is this normal behavior, seems a bit odd to me?
>>>
>>> Here is my session:
>>>
>>>   > sessionInfo()
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats     graphics  utils     datasets  grDevices methods base
>>>
>>> other attached packages:
>>> [1] foreign_0.8-50      RJDBC_0.2-0         DBI_0.2-5
>>> [4] XLConnect_0.2-0     XLConnectJars_0.2-0 rJava_0.9-3
>>> [7] data.table_1.8.2    rj_1.1.0-4
>>>
>>> loaded via a namespace (and not attached):
>>> [1] rj.gd_1.1.0-1 tools_2.15.1
>>>
>>>
>>> Thanks very much for this fantastic package!
>>>
>>> --Mel.
>>>
>>> Melanie BACOU
>>> International Food Policy Research Institute
>>> Agricultural Economist, HarvestChoice
>>> E-mail mel at mbacou.com <mailto:mel at mbacou.com>
>>> Visit harvestchoice.org <http://www.harvestchoice.org/>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>
>




More information about the datatable-help mailing list