[datatable-help] "Chris crash" - followup questions

Matthew Dowle mdowle at mdowle.plus.com
Fri Jan 13 19:39:28 CET 2012


Hi,

Thankfully, I'm pretty sure I know what's up, and if so, fix is imminent.

More comments inline below ...

Btw, thanks for taking the time to provide all this info, much appreciated.

Matthew


> I've managed to get the code to proceed correctly by replacing the "myDT[,
> newCol := oldCol]" with:
>
>     tmpCol = myDT$oldCol
>     myDT$newCol = tmpCol
>
> This has avoided the issue.

Yes, makes sense. Is myDT the result from a merge()? Or, have you changed
column names before that point using names(myDT)<-, colnames(myDT) or
similar?

> Forgive me if I am using datatable
> incorrectly
> or abusing it, but this seems to do the trick for now,

$ is copying the whole table, and always will. Once Chris crash bug is
fixed, please change back to :=.  Or, just adding a line "myDT =
copy(myDT)" right after the merge, should fix it too. Then := should be ok
on myDT afterwards. That way you just have to delete the copy line, which
is a bit easier to do later.

> and I had several
> other questions that arose from all of this inspection.
>
> 1. As instructed in an earlier post, to a different user, I tried
> "gcinfo(TRUE)" and "options(datatable.verbose=TRUE)".  The former didn't
> given any information that could be helpful, but the latter was quite
> interesting.  I noticed that the following messages occurred frequently:
>
>   - setkey changed the type of column 'i' from numeric to integer, no
> fractional data present.
>   - First column i failed radixorder1, reverting to regularorder1
>   - setkey incurred a copy of the whole table, due to the coercion(s)
> above.
>   - Non-first column 2 failed radixorder1, reverting to regularorder1

That's all normal.  radixorder doesn't work if the range of integers is
greater than 100,000 for example,  so the coding style to use it is to
try() it, and if fails then revert to regular (slower) ordering. See
?sort.list in base.

>
> 1A: Would I benefit from changing the types to integers pre-emptively, so
> that setkey doesn't have to do these coercions? (See Q 2 - how do I do
> that?)

Maybe, but it's not related to the Chris crash.  If you provide some
system.time()s we could certainly discuss in a new thread.

> 1B: Why is the whole table copied if one column is coerced?
> That may get
> to be problematic for larger tables, or multiple copies (due to multiple
> keys that are not yet coerced to integers).

Hm, good point. Basically I didn't know how to coerce a column by
reference, when setkey was written.  := is new, and setkey should be using
it!  FR#1744 created.   Now (and if) the segfault problems are resolved,
I'm more confident in depending more on :=.

> 1C: What can I make of the 'failed radixorder1' messages?

They're ok, as above. Might be worth another look to see if it can be sped
up. There might be an FR on that.

> 2: My data table objects are created from several different sources, and
> several have matching columns.  However, the types are different in the
> different objects - some are numeric, some integer.  Integer is a
> perfectly
> fine universal type for these particular columns.  However, it seems that
> data.table only makes this coercion when "setkey()" is executed, rather
> than at the creation of the datatables.

Yes, that's how it works currently. We'll be coercing less in future, when
character cols are left as character and allowed in keys,  similarly for
numeric.

> How can I make this coercion?
> Solely via DT[, selCol := as.integer(selCol)] ?  This would seem to speed
> up all of that coercion & copying.

Oh, yes.  So if you do the coercion yourself using := first,  as setkey
should be doing,  then yes that'll speed it up.  Comment added to FR#1744
to revisit all coercions in [.data.table and elsewhere. Thanks.

>
> 3: In the datatable vignette, p. 12 (at least in my version), there is a
> statement that NA is type logical in R.  I don't know if this is causing
> issues, but that's not true.  NA as logical is the default (I think), but
> one can have an NA in a numeric - e.g. `x <- c(pi, NA); str(x)`.

c() coerces the NA (logical) to match the type of pi (numeric).  So, using
NA_real_ might be (a little bit) faster than NA,  for example inside a
loop if that coercion of a unit length vector is happening a lot. R
doesn't like lots of very small vectors, gc() has more to do.

>  3A: I have numeric NAs in my data tables - could this be related to
> issues
> observed (i.e. type complaints and segfaults)?

Not related to seg faults at all.  Type complaints, maybe. If still a
problem after Chris crash fix, then ping back, but I doubt it.

>  3B: Some columns are entirely (numeric) NA in some of the data tables.
> Is
> this setting me up for heartache?  :)

Nope, should be fine. However, there have been several fixes for (as yet
unreported) problems with character (not factor) and list columns. But all
those should already be fixed in latest svn revision of 1.7.8.

> These are combined with other
> datatables via rbind and merge.

Makes sense. If I'm right then the Chris crash would be triggered by
merge, or names<-.

>
> Thanks!
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list