[datatable-help] Efficient conversion of data table column to vector
Matthew Dowle
mdowle at mdowle.plus.com
Wed Sep 1 03:07:47 CEST 2010
Thanks David. It seems I was remembering emails or posts about
quote()-ing; that doesn't actually appear in the documentation.
Apologies to Nicolas who was mislead by FAQ 1.5.
I've added FAQ 1.6 and added use of [[ into FAQ 1.5, closing FR #693 and
#1038. I'm using your "quote()-ed" style too now - that's neat.
Background ... the last sentence of FAQ 1.5 used to be correct in that
mycol="x";DT[,eval(mycol)] *did* return the column data. That works in
1.4.1 on CRAN. However FAQ 1.1 and 1.2 are not true with 1.4.1. Fixing
that for consistency (see NEWS and posts) made DT[,eval(mycol)] untrue.
Think we're there now hopefully.
Latest committed vignettes are now on the homepage (after one hour to
publish) rather than links to the CRAN ones. If those changes to FAQ 1.5
and 1.6 aren't fully ok please just shout.
Thanks.
On Tue, 2010-08-31 at 16:17 -0400, David Winsemius wrote:
> I sent this to Matthew offlist but he wants it "on the record", so
> here is what I sent:
> On Aug 31, 2010, at 11:56 AM, David Winsemius wrote:
>
> >
> > On Aug 31, 2010, at 3:35 AM, Matthew Dowle wrote:
> >
> >>
> >> Nicolas,
> >>
> >> Welcome to the list.
> >>
> >> Where the documentation mentions 'quoted' it means the quote()
> >> function
> >> to create an expression, not as in a character string.
> >
>
> Matthew;
>
> I think you really should look at FAQ 1.5. It says nothing about
> "quoted". It does appear to imply that if someone had executed:
>
> colname="x"
>
> ... that both DT[, colname, with=FALSE] and DT[, eval(colname)]
> should "work". Now you are saying that isn't so, that only the first
> will return anything like the expected result.
>
> --
> David
> >
> >> Alternatively you
> >> can use [[ in the usual way since a data.table is a list.
> >>
> >>> colexp = quote(y) # rather than "y"
> >>> a[,eval(colexp)]
> >> [1] "2010-01-01 GMT" "2010-01-02 GMT" "2010-01-03 GMT" "2010-01-04
> >> GMT"
> >> [5] "2010-01-05 GMT" "2010-01-06 GMT" "2010-01-07 GMT" "2010-01-08
> >> GMT"
> >> [9] "2010-01-09 GMT" "2010-01-10 GMT" "2010-01-11 GMT"
> >>
> >> or
> >>
> >>> colname = "y"
> >>> a[[colname]]
> >> [1] "2010-01-01 GMT" "2010-01-02 GMT" "2010-01-03 GMT" "2010-01-04
> >> GMT"
> >> [5] "2010-01-05 GMT" "2010-01-06 GMT" "2010-01-07 GMT" "2010-01-08
> >> GMT"
> >> [9] "2010-01-09 GMT" "2010-01-10 GMT" "2010-01-11 GMT"
> >>>
> >>
> >> A single column name is a special case of expressions so although
> >> this
> >> can create a steeper learner curve, it results in more power and
> >> flexibility later.
> >>
> >> Suggestions on how to improve documentation so that 'quoting' is
> >> clearer
> >> are very welcome. I've added an item to the list so we don't forget.
> >>
> >> Matthew
> >>
> >>
> >> On Mon, 2010-08-30 at 23:59 -0400, Nicolas Chapados wrote:
> >>> Dear data.table friends and maintainers,
> >>>
> >>>
> >>> First, thanks to the authors for this excellent package: it really
> >>> fills a void in the R world. However, I have a question: I'm
> >>> looking
> >>> to have an efficient conversion of a data table object to a vector
> >>> (of
> >>> the correct type) when querying a single column whose name is stored
> >>> in a variable. As per the vignette and the FAQ, I use the syntax
> >>>
> >>>
> >>> my.data.table[, colname, with=FALSE]
> >>>
> >>>
> >>> (where colname is a variable containing my desired column name) but
> >>> this returns another data table, not a vector. Morever, the eval
> >>> syntax suggested in the FAQ simply does not work:
> >>>
> >>>
> >>> my.data.table[, eval(colname)]
> >>>
> >>>
> >>> See example below. I could use as.matrix on the result, but this
> >>> carries out undesirable type conversion in the case of columns
> >>> containing dates: see below.
> >>>
> >>>
> >>> Here is an example to reproduce this problem:
> >>>
> >>>
> >>>> require(data.table)
> >>> Loading required package: data.table
> >>>> a <- data.table(x=seq(1, 2, by=0.1),
> >>>> y=seq(as.POSIXct("2010-01-01"),
> >>> as.POSIXct("2010-01-11"), length.out=11))
> >>>> a
> >>> x y
> >>> [1,] 1.0 2010-01-01
> >>> [2,] 1.1 2010-01-02
> >>> [3,] 1.2 2010-01-03
> >>> [4,] 1.3 2010-01-04
> >>> [5,] 1.4 2010-01-05
> >>> [6,] 1.5 2010-01-06
> >>> [7,] 1.6 2010-01-07
> >>> [8,] 1.7 2010-01-08
> >>> [9,] 1.8 2010-01-09
> >>> [10,] 1.9 2010-01-10
> >>> [11,] 2.0 2010-01-11
> >>>> colname <- "y"
> >>>
> >>>
> >>> ## The following returns a data table. How can I get a vector, and
> >>> still preserve type information?
> >>>> a[, colname, with=FALSE]
> >>> y
> >>> [1,] 2010-01-01
> >>> [2,] 2010-01-02
> >>> [3,] 2010-01-03
> >>> [4,] 2010-01-04
> >>> [5,] 2010-01-05
> >>> [6,] 2010-01-06
> >>> [7,] 2010-01-07
> >>> [8,] 2010-01-08
> >>> [9,] 2010-01-09
> >>> [10,] 2010-01-10
> >>> [11,] 2010-01-11
> >>>
> >>>
> >>> ## The eval recipe suggested in the FAQ does not work.
> >>>> a[, eval(colname)]
> >>> [1] "y"
> >>>
> >>>
> >>> ## as.vector does not convert away from data.table
> >>>> as.vector(a[, colname, with=FALSE])
> >>> y
> >>> [1,] 2010-01-01
> >>> [2,] 2010-01-02
> >>> [3,] 2010-01-03
> >>> [4,] 2010-01-04
> >>> [5,] 2010-01-05
> >>> [6,] 2010-01-06
> >>> [7,] 2010-01-07
> >>> [8,] 2010-01-08
> >>> [9,] 2010-01-09
> >>> [10,] 2010-01-10
> >>> [11,] 2010-01-11
> >>>> class(as.vector(a[, colname, with=FALSE]))
> >>> [1] "data.table"
> >>>
> >>>
> >>> ## as.matrix loses type information (NOTE: in my case it is not
> >>> acceptable to
> >>> ## convert this character vector back to a POSIXct, due to the
> >>> loss of
> >>> important
> >>> ## timezone information. Furthermore, this would be very
> >>> inefficient.)
> >>>> as.matrix(a[, colname, with=FALSE])
> >>> y
> >>> [1,] "2010-01-01"
> >>> [2,] "2010-01-02"
> >>> [3,] "2010-01-03"
> >>> [4,] "2010-01-04"
> >>> [5,] "2010-01-05"
> >>> [6,] "2010-01-06"
> >>> [7,] "2010-01-07"
> >>> [8,] "2010-01-08"
> >>> [9,] "2010-01-09"
> >>> [10,] "2010-01-10"
> >>> [11,] "2010-01-11"
> >>>> mode(as.matrix(a[, colname, with=FALSE]))
> >>> [1] "character"
> >>>
> >>>
> >>> ## Finally, one could go through a data.frame, but this is
> >>> inefficient
> >>> ## and it sorts of defeats the purpose of using data.table...
> >>>> as.data.frame(a[, colname, with=FALSE])[, colname]
> >>> [1] "2010-01-01 EST" "2010-01-02 EST" "2010-01-03 EST" "2010-01-04
> >>> EST"
> >>> [5] "2010-01-05 EST" "2010-01-06 EST" "2010-01-07 EST" "2010-01-08
> >>> EST"
> >>> [9] "2010-01-09 EST" "2010-01-10 EST" "2010-01-11 EST"
> >>>
> >>>
> >>>
> >>>
> >>> So at this point, my imagination is running out and I'm turning to
> >>> this list for suggestions. This should seem to be a fairly frequent
> >>> use-case, and I'm surprised it does not appear to have previously
> >>> been
> >>> addressed.
> >>>
> >>>
> >>> For the record, here is my sessionInfo()
> >>>
> >>>
> >>>> sessionInfo()
> >>> R version 2.9.2 (2009-08-24)
> >>> x86_64-pc-linux-gnu
> >>>
> >>>
> >>> locale:
> >>> C
> >>>
> >>>
> >>> attached base packages:
> >>> [1] stats graphics grDevices utils datasets methods base
> >>>
> >>>
> >>>
> >>> other attached packages:
> >>> [1] data.table_1.4.1
> >>>
> >>>
> >>>
> >>>
> >>> Thanks in advance for any help!
> >>> + Nicolas Chapados
> >>> _______________________________________________
> >>> datatable-help mailing list
> >>> datatable-help at lists.r-forge.r-project.org
> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
>
> David Winsemius, MD
> West Hartford, CT
>
More information about the datatable-help
mailing list