[datatable-help] data.table heisenbug - large DT

DM tb2usd at gmail.com
Fri Jan 13 05:33:49 CET 2012


Happy New Year to all.

I suspect that this is the same problem that others have reported, which I
can add to: something is causing data.table to crash or corrupt R when
using  myDT[, newCol := oldCol]  or similar syntax.

For context: I am using R 2.13.2, and have found this issue with data.table
versions 1.7.6, 1.7.7, and 1.7.8.  This is 64-bit R, on 64-bit Ubuntu
(Amazon EC2 m4.2xlarge and m4.4xlarge instances).

Some observations:

1. The objects are large - it's about 10 columns and about 10M rows.

2. It seems to corrupt things with mention of something about type '12'.
For instance, I tried saving the workspace frequently, and R crashed during
a save (i.e. not even during a data.table execution), with the message:
"WriteItem: unknown type 12".

3. I typically have this code executing inside a function, and I either
call the function from the R console or execute a script from the command
line.  Since I've been debugging, I've done more in the former context
(i.e. from the console).  What's bizarre is that if I do not "debug()" my
function, it is likely to crash.  If, however, I run "debug(myFunction)",
then it is much less likely to crash (almost never).  Just setting "debug"
makes a difference, it seems, but this difference does not extend to usage
from a script.  I'm not sure why this is, though it is bizarre.

4. The objects tend to have NAs in some columns - that's okay, I put them
there.  All of the problems arise with one data table.  It is constructed
through merges of other data tables and operations on its own columns.  No
problems appear with the other data tables.  However, that's not entirely
true.  When the program crashes during a specific operation on this one DT,
it made me believe that that one operation is problematic.  I created the
checkpointing code mentioned in the 2nd note, and R crashed at an earlier
stage - while saving other data tables.  In other words, no data table
operations, before one particular operation, give any problems, but that
may be like the game hot potato: the problem (e.g. some kind of memory
corruption?) could have begun earlier and something, either "save()" or "
newCol := oldCol" is just the unlucky statement that was last attempted
before R crashed.

Unfortunately, I don't have code to reproduce this in a simple manner.  I
can give lots of error reports.

For instance, here is output from "traceback" when this one "newCol :=
oldCol"-esque operation occurred:

        tracemem[0x16f09c40 -> 0x1158fab8]: match unique intersect
[.data.table [ myFunc
        Error in data.table(i, j, Result) : unimplemented type (12) in
'duplicate'
            6: data.table(i, j, Result)
            5: eval(expr, envir, enclos)
            4: eval(jsub, envir = x, enclos = parent.frame())
            3: `[.data.table`(myDT, , list(i, j, Result))
            2: myDT[, list(i, j, Result)]
            1: myFunc(arguments)

I will try other workarounds, e.g. not use ":=", but it's not clear that
that will solve the problem, since the problem potentially begins before my
biggest snafu.   I'd prefer not to update R just yet, since I have no idea
how that will impact other packages, but that option isn't entirely off the
table.

Any thoughts or recommendations?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120112/49472db3/attachment.htm>


More information about the datatable-help mailing list