[datatable-help] Efficient conversion of data table column to vector

Matthew Dowle mdowle at mdowle.plus.com
Tue Aug 31 09:35:11 CEST 2010


Nicolas,

Welcome to the list.

Where the documentation mentions 'quoted' it means the quote() function
to create an expression, not as in a character string. Alternatively you
can use [[ in the usual way since a data.table is a list.

> colexp = quote(y)   # rather than "y"
> a[,eval(colexp)]
 [1] "2010-01-01 GMT" "2010-01-02 GMT" "2010-01-03 GMT" "2010-01-04 GMT"
 [5] "2010-01-05 GMT" "2010-01-06 GMT" "2010-01-07 GMT" "2010-01-08 GMT"
 [9] "2010-01-09 GMT" "2010-01-10 GMT" "2010-01-11 GMT"
 
or

> colname = "y"
> a[[colname]]
 [1] "2010-01-01 GMT" "2010-01-02 GMT" "2010-01-03 GMT" "2010-01-04 GMT"
 [5] "2010-01-05 GMT" "2010-01-06 GMT" "2010-01-07 GMT" "2010-01-08 GMT"
 [9] "2010-01-09 GMT" "2010-01-10 GMT" "2010-01-11 GMT"
> 

A single column name is a special case of expressions so although this
can create a steeper learner curve, it results in more power and
flexibility later.

Suggestions on how to improve documentation so that 'quoting' is clearer
are very welcome. I've added an item to the list so we don't forget.

Matthew


On Mon, 2010-08-30 at 23:59 -0400, Nicolas Chapados wrote:
> Dear data.table friends and maintainers,
> 
> 
> First, thanks to the authors for this excellent package: it really
> fills a void in the R world.  However, I have a question: I'm looking
> to have an efficient conversion of a data table object to a vector (of
> the correct type) when querying a single column whose name is stored
> in a variable.  As per the vignette and the FAQ, I use the syntax
> 
> 
>     my.data.table[, colname, with=FALSE]
> 
> 
> (where colname is a variable containing my desired column name) but
> this returns another data table, not a vector.  Morever, the eval
> syntax suggested in the FAQ simply does not work:
> 
> 
>     my.data.table[, eval(colname)]
> 
> 
> See example below.  I could use as.matrix on the result, but this
> carries out undesirable type conversion in the case of columns
> containing dates: see below.
> 
> 
> Here is an example to reproduce this problem:
> 
> 
> > require(data.table)
> Loading required package: data.table
> > a <- data.table(x=seq(1, 2, by=0.1), y=seq(as.POSIXct("2010-01-01"),
> as.POSIXct("2010-01-11"), length.out=11))
> > a
>         x          y
>  [1,] 1.0 2010-01-01
>  [2,] 1.1 2010-01-02
>  [3,] 1.2 2010-01-03
>  [4,] 1.3 2010-01-04
>  [5,] 1.4 2010-01-05
>  [6,] 1.5 2010-01-06
>  [7,] 1.6 2010-01-07
>  [8,] 1.7 2010-01-08
>  [9,] 1.8 2010-01-09
> [10,] 1.9 2010-01-10
> [11,] 2.0 2010-01-11
> > colname <- "y"
> 
> 
> ## The following returns a data table.  How can I get a vector, and
> still preserve type information?
> > a[, colname, with=FALSE]
>                y
>  [1,] 2010-01-01
>  [2,] 2010-01-02
>  [3,] 2010-01-03
>  [4,] 2010-01-04
>  [5,] 2010-01-05
>  [6,] 2010-01-06
>  [7,] 2010-01-07
>  [8,] 2010-01-08
>  [9,] 2010-01-09
> [10,] 2010-01-10
> [11,] 2010-01-11
> 
> 
> ## The eval recipe suggested in the FAQ does not work.
> > a[, eval(colname)]
> [1] "y"
> 
> 
> ## as.vector does not convert away from data.table
> > as.vector(a[, colname, with=FALSE])
>                y
>  [1,] 2010-01-01
>  [2,] 2010-01-02
>  [3,] 2010-01-03
>  [4,] 2010-01-04
>  [5,] 2010-01-05
>  [6,] 2010-01-06
>  [7,] 2010-01-07
>  [8,] 2010-01-08
>  [9,] 2010-01-09
> [10,] 2010-01-10
> [11,] 2010-01-11
> > class(as.vector(a[, colname, with=FALSE]))
> [1] "data.table"
> 
> 
> ## as.matrix loses type information (NOTE: in my case it is not
> acceptable to
> ## convert this character vector back to a POSIXct, due to the loss of
> important
> ## timezone information. Furthermore, this would be very inefficient.)
> > as.matrix(a[, colname, with=FALSE])
>       y           
>  [1,] "2010-01-01"
>  [2,] "2010-01-02"
>  [3,] "2010-01-03"
>  [4,] "2010-01-04"
>  [5,] "2010-01-05"
>  [6,] "2010-01-06"
>  [7,] "2010-01-07"
>  [8,] "2010-01-08"
>  [9,] "2010-01-09"
> [10,] "2010-01-10"
> [11,] "2010-01-11"
> > mode(as.matrix(a[, colname, with=FALSE]))
> [1] "character"
> 
> 
> ## Finally, one could go through a data.frame, but this is inefficient
> ## and it sorts of defeats the purpose of using data.table...
> > as.data.frame(a[, colname, with=FALSE])[, colname]
>  [1] "2010-01-01 EST" "2010-01-02 EST" "2010-01-03 EST" "2010-01-04
> EST"
>  [5] "2010-01-05 EST" "2010-01-06 EST" "2010-01-07 EST" "2010-01-08
> EST"
>  [9] "2010-01-09 EST" "2010-01-10 EST" "2010-01-11 EST"
> 
> 
> 
> 
> So at this point, my imagination is running out and I'm turning to
> this list for suggestions. This should seem to be a fairly frequent
> use-case, and I'm surprised it does not appear to have previously been
> addressed.
> 
> 
> For the record, here is my sessionInfo()
> 
> 
> > sessionInfo()
> R version 2.9.2 (2009-08-24) 
> x86_64-pc-linux-gnu 
> 
> 
> locale:
> C
> 
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>     
> 
> 
> other attached packages:
> [1] data.table_1.4.1
> 
> 
> 
> 
> Thanks in advance for any help!
> + Nicolas Chapados
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list