[datatable-help] data.table segfaulting, need help verifying the reason

Matthew Dowle mdowle at mdowle.plus.com
Tue Sep 10 22:06:12 CEST 2013


Yes, seems like the columns themselves have names, with inconsistent length.

lapply(a,names)  should reveal the "hidden" names

To remove them :

for (i in 1:ncol(a)) setattr(a[[i]],"names",NULL)

Then lapply(a,names) should be clear.

Then try again the things that segfaulted before.

If this fixes it,  we'll need to establish how the erroneous names got 
in there.


On 10/09/13 19:51, Chris Neff wrote:
>
>
>
> On Tue, Sep 10, 2013 at 2:02 PM, Matthew Dowle <mdowle at mdowle.plus.com 
> <mailto:mdowle at mdowle.plus.com>> wrote:
>
>
>     Nothing springs to mind.  Latest version v1.8.10 from CRAN right? 
>     Or v1.8.11 on R-Forge?
>
>
> Both. And 1.8.8.
>
>
>     On this bit :
>
>     > So somewhere these key columns think they are different lengths
>     than they really are, and
>     > when I try to access it I go into memory I shouldn't so I
>     segfault.  How can I verify this? Is
>     > there something about the DT I can check to see what DT thinks
>     these columns are?
>
>     .Internal(inspect(DT)) reveals the internal structure including
>     length and truelength on the column pointer vector as well as each
>     column.
>
>     But it's a really odd way of using data.table. Iterating by row is
>     going to kill performance; data.table likes by column.
>
>
> Trust me I know this, this isn't my code :) I'm just the data.table 
> guy who helps debug. I am helping him with better ways, but I think we 
> can agree that it should at least not segfault.
>
>
> I ran inspect on the two versions of the data.table, the one that 
> crashes that is made by doing rbindlist(apply(d,1,...)) and the one 
> that doesn't that gets made by doing rbindlist(lapply(1:nrow(d),...)), 
> and changed the variable names and censored out values.
>
> First the one that fails (accessing either a$k1 or a$k2 will segfault):
>
> > .Internal(inspect(a))
> @2cc5be0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
>   @3b643d0 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   ATTRIB:
>     @ac6c20 02 LISTSXP g1c0 [MARK]
>       TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>       @3ba6ad8 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
>         @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>         @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>   @3b64e30 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   ATTRIB:
>     @ac6cc8 02 LISTSXP g1c0 [MARK]
>       TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>       @3ba6a68 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
>         @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>         @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>   @3b65890 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     ...
>   @1ff5850 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
>   @1fc6600 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
>   ...
> ATTRIB:
>   @21f6d48 02 LISTSXP g0c0 []
>     TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>     @3efc1f0 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
>       @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>       @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>       @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
>       @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
>       @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
>       ...
>     TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
>     @2556908 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
>     TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
>     @2701b38 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
>       @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
>       @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
>     TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
>     @21f6e28 22 EXTPTRSXP g0c0 []
>
>
>
>
>
>
> Secondly the one that works (all values can be accessed fine:
>
> > .Internal(inspect(a))
> @45b4850 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
>   @33a53a0 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   @33a5e00 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   @33a6860 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     ...
>   @1ff10f0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
>   @3a6d0d0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
>   ...
> ATTRIB:
>   @276c360 02 LISTSXP g0c0 []
>     TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>     @1fe5670 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
>       @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>       @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>       @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
>       @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
>       @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
>       ...
>     TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
>     @29cbf38 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
>     TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
>     @2d539a0 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
>       @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
>       @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
>     TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
>     @276c440 22 EXTPTRSXP g0c0 []
>
>
>
>
> It looks to me to be some differences in the ATTRs attached to k1 and 
> k2 in the first case?  I can't really parse this as well as you can.
>
>     If it really has to be by row  then   DT[, fun(.SD,...),
>     by=1:nrow(DT)]  should be better than apply().
>
>     Matthew
>
>
>     On 10/09/13 18:47, Chris Neff wrote:
>>     Narrowing it down further,
>>
>>     a$x
>>
>>     segfaults and
>>
>>     a[,x]
>>
>>     segfaults but
>>
>>     a[,"x", with=FALSE]
>>
>>     doesn't.
>>
>>
>>     On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com
>>     <mailto:caneff at gmail.com>> wrote:
>>
>>         I'm pretty sure it is some issue of a column that thinks it
>>         is bigger than it actually is.  I have tried, so far in vain,
>>         to make a reproducible example that I can share.  I have one,
>>         but can't share it.
>>
>>         What happens is this:
>>
>>         A data.frame is made:
>>
>>         > d = data.frame(...)
>>
>>         Then I call apply over every row, calling a different
>>         function that takes in a DT as well:
>>
>>         l = apply(d, 1, function(x) func(x[1], x[2], DT))
>>
>>         This returns a data.frame.  If I rbindlist this:
>>
>>         a = rbindlist(l)
>>
>>         I can print a just fine, and it will show me all data like
>>         normal. but if I try to just do
>>
>>         a$x
>>
>>         x is one of the columns that was a key in DT, then it
>>         segfaults.  If I ask for a column that was made by "func" and
>>         wasn't a column in DT, it works fine.  If I ask for only the
>>         first 10 rows and then ask for x:
>>
>>         a[1:10]$x
>>
>>         it works fine.
>>
>>         So somewhere these key columns think they are different
>>         lengths than they really are, and when I try to access it I
>>         go into memory I shouldn't so I segfault.  How can I verify
>>         this? Is there something about the DT I can check to see what
>>         DT thinks these columns are?
>>
>>
>>         Also, if instead of apply when making the list, I do
>>
>>         l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>>
>>         and rbindlist that, it works fine too.
>>
>>
>>
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/aeba1dce/attachment-0001.html>


More information about the datatable-help mailing list