[datatable-help] Auto-convert characters to factors when settings keys?

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue May 25 15:33:24 CEST 2010


Hi,

On Tue, May 25, 2010 at 9:15 AM, Short, Tom <TShort at epri.com> wrote:

>> Any preferences on the following options ? :
>>
>> 1. Change as.data.table to use data.table. It already does
>> when keep.rownames=TRUE but not when FALSE.  If a user really
>> wants a raw class change they can use class(x)="data.table"
>> directly. No change to data.table or setkey.  Since
>> ?as.data.table is an alias to ?data.table this would be consistent.
>>
>> 2. Change data.table and setkey. Only convert character to
>> factor at the point of setkey.  That may prevent radix being
>> used for an ad hoc by on character columns that are not in
>> the key. Would we then want to do auto-conversion in ad hoc
>> by too?  No change to as.data.table.
>>
>> 3. Steve's suggestion. Change setkey. Catch character columns
>> in setkey and auto convert them to factor at that point. No
>> change to data.table or as.data.table.
>>
>> 4. Change ?as.data.table to say its a class change only, and to use
>> data.table() if checks and auto-conversion of character to
>> factor is required. No code changes.
>>
>> 5. Another solution.
>>
>
> I lean towards #4 and also maybe #3. It's nice to be able to "raw"
> convert back and forth between data tables and data frames, and
> as.data.table seems useful for that. A direct class assignment is okay,
> but a data frame also needs a row.names attribute. I tend not to like
> autoconversions.
>
> A couple of utility functions to do in-place raw conversions would be
> useful:
>
> setdf(d) # changes class to "data.frame", creates the "row.names"
> attribute
>         # possibly removes the "sorted" attribute
> setdt(d) # changes class to "data.table", possibly deletes "row.names"
>
> This avoids a copy. I haven't needed them enough to write them, yet.
> This might be something to consider if we're making a change related to
> conversions.

I guess I don't understand why you'd want to make setdf and setdt
instead of using the as.data.frame/as.data.table functions?

Isn't the as.* more idiomatic S3-OOized R?

Also, if we're taking a vote, I think I'd still go with #3 because I
don't think I always want to convert my strings to factors w/o my say
so. I think it's ok for this to happen:

* explicitly: by replacing the character column with its `factor(...)`
* implicitly: when asking to make it use the column as a key (still
firing off a warning here might be useful -- if not annoying. perahps
it could be turned off w/ a no.warn=TRUE argument, or something).

I might vote for #2 also, but I can't appreciate any real differences
between #2 and #3 -- maybe because I don't use data.table enough.

Also, you suggested to try data.table(df) instead of
as.data.table(df), but this doesn't change anything with respect to
the behavior of setting key on a character column:

R> df <- data.frame(a=LETTERS[1:10], b=1:10, stringsAsFactors=F)
R> dt <- data.table(df)
R> key(dt) <- 'a'
  All keyed columns must be storage mode integer

Or

R> dt <- data.table(df, key='a')
Error in setkey(value, a) :
  All keyed columns must be storage mode integer

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list