[datatable-help] data.table segfaulting, need help verifying the reason
Chris Neff
caneff at gmail.com
Tue Sep 10 20:51:35 CEST 2013
On Tue, Sep 10, 2013 at 2:02 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
> Nothing springs to mind. Latest version v1.8.10 from CRAN right? Or
> v1.8.11 on R-Forge?
>
Both. And 1.8.8.
>
> On this bit :
>
> > So somewhere these key columns think they are different lengths than
> they really are, and
> > when I try to access it I go into memory I shouldn't so I segfault. How
> can I verify this? Is
> > there something about the DT I can check to see what DT thinks these
> columns are?
>
> .Internal(inspect(DT)) reveals the internal structure including length and
> truelength on the column pointer vector as well as each column.
>
> But it's a really odd way of using data.table. Iterating by row is going
> to kill performance; data.table likes by column.
>
Trust me I know this, this isn't my code :) I'm just the data.table guy who
helps debug. I am helping him with better ways, but I think we can agree
that it should at least not segfault.
I ran inspect on the two versions of the data.table, the one that crashes
that is made by doing rbindlist(apply(d,1,...)) and the one that doesn't
that gets made by doing rbindlist(lapply(1:nrow(d),...)), and changed the
variable names and censored out values.
First the one that fails (accessing either a$k1 or a$k2 will segfault):
> .Internal(inspect(a))
@2cc5be0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
@3b643d0 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
@253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
@253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
@253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
...
ATTRIB:
@ac6c20 02 LISTSXP g1c0 [MARK]
TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
@3ba6ad8 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
@184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
@184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
@3b64e30 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
...
ATTRIB:
@ac6cc8 02 LISTSXP g1c0 [MARK]
TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
@3ba6a68 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
@bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
@bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
@3b65890 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
@24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
...
@1ff5850 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
@1fc6600 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
...
ATTRIB:
@21f6d48 02 LISTSXP g0c0 []
TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
@3efc1f0 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
@184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
@bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
@108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
@108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
@108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
...
TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
@2556908 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
@2701b38 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
@bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
@9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
@21f6e28 22 EXTPTRSXP g0c0 []
Secondly the one that works (all values can be accessed fine:
> .Internal(inspect(a))
@45b4850 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
@33a53a0 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
@253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
@253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
@253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
...
@33a5e00 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
@253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
...
@33a6860 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
@24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
@24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
...
@1ff10f0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
@3a6d0d0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
...
ATTRIB:
@276c360 02 LISTSXP g0c0 []
TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
@1fe5670 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
@184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
@bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
@108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
@108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
@108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
...
TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
@29cbf38 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
@2d539a0 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
@bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
@9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
@276c440 22 EXTPTRSXP g0c0 []
It looks to me to be some differences in the ATTRs attached to k1 and k2 in
the first case? I can't really parse this as well as you can.
> If it really has to be by row then DT[, fun(.SD,...), by=1:nrow(DT)]
> should be better than apply().
>
> Matthew
>
>
> On 10/09/13 18:47, Chris Neff wrote:
>
> Narrowing it down further,
>
> a$x
>
> segfaults and
>
> a[,x]
>
> segfaults but
>
> a[,"x", with=FALSE]
>
> doesn't.
>
>
> On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com> wrote:
>
>> I'm pretty sure it is some issue of a column that thinks it is bigger
>> than it actually is. I have tried, so far in vain, to make a reproducible
>> example that I can share. I have one, but can't share it.
>>
>> What happens is this:
>>
>> A data.frame is made:
>>
>> > d = data.frame(...)
>>
>> Then I call apply over every row, calling a different function that
>> takes in a DT as well:
>>
>> l = apply(d, 1, function(x) func(x[1], x[2], DT))
>>
>> This returns a data.frame. If I rbindlist this:
>>
>> a = rbindlist(l)
>>
>> I can print a just fine, and it will show me all data like normal. but
>> if I try to just do
>>
>> a$x
>>
>> x is one of the columns that was a key in DT, then it segfaults. If I
>> ask for a column that was made by "func" and wasn't a column in DT, it
>> works fine. If I ask for only the first 10 rows and then ask for x:
>>
>> a[1:10]$x
>>
>> it works fine.
>>
>> So somewhere these key columns think they are different lengths than
>> they really are, and when I try to access it I go into memory I shouldn't
>> so I segfault. How can I verify this? Is there something about the DT I
>> can check to see what DT thinks these columns are?
>>
>>
>> Also, if instead of apply when making the list, I do
>>
>> l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>>
>> and rbindlist that, it works fine too.
>>
>>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/a8a3a504/attachment.html>
More information about the datatable-help
mailing list