[datatable-help] number of rows selected in .SD subset
Arunkumar Srinivasan
aragorn168b at gmail.com
Thu Jan 22 20:11:32 CET 2015
Ben,
Great to hear that you're going thro' the vignette..
To get the last row, you can similarly do:
DT[, tail(.SD, 1L), by=month] # ~ as you say
DT[, .SD[.N], by=month] # ~ since .N contains the number of observations in
this group
DT[, .SD[(.N-1L):.N], by=month] # ~ last two rows per group
However, `.SD[...]` per group is slightly slower (especially on many
groups) as it has to go through `[.data.table` (which is a S3 generic, and
takes time for dispatching the right method.. which can get noticeable on
large groups), and not all cases are optimised.
You can also use `.I` (which is deliberately not mentioned in the vignette
to keep things smooth and straightforward). Using it you could do:
idx = DT[, .I[1L], by=month][, V1]
DT[idx]
`.I` contains the row number in `x` (it doesn't reset per group..). So we
can get the row indices for each group for the first element, and then
simply subset. We hope to improve this subset in the future (to take care
of this optimisation internally).
Similarly:
idx = DT[, .I[.N], by=month][, V1]
DT[idx]
will get the last element for each group.
Otherwise, how do you find the vignette so far?
HTH,
Arun
On Thu, Jan 22, 2015 at 6:11 PM, Ben Tupper <btupper at bigelow.org> wrote:
> Hello,
>
> I have been learning to use data.table and studying the vignette located
> here...
>
>
> https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html
>
> Section 2f. shows how to subset a data.table to select an arbitrary number
> of rows in each .SD. That's really handy.
>
> 2. Aggregations
> f. Subset .SD for each group: ans <- flights[, head(.SD, 2), by=month]
>
> In a similar way, I can get the last row of the .SD using either tail,
> nrow or dim (I don't think it matters much, but dim seems to be a faster*).
>
> ans <- flights[,.SD[dim(.SD)[1]], by=month]
>
> I got to wondering if the number of rows in .SD might be exposed in each
> grouping iteration. Is there an equivalent to .N for the subset
> data.table, .SD? Something like .SDN or the like?
>
> Thanks for data.table!
>
> Ben
>
> * After reading this discussion
> http://r.789695.n4.nabble.com/What-is-the-fastest-way-to-determine-that-data-table-is-empty-td4638348.html#a4638451
> I tried out a couple of methods for getting the last element of a grouping
> using nrow(), tail() and dim().
>
> # using tail
> > microbenchmark( last1 <- flights[, tail(.SD, 1), by=month] )
> Unit: milliseconds
> expr min lq mean
> median uq max neval
> last1 <- flights[, tail(.SD, 1), by = month] 16.65898 16.89704 18.26415
> 17.37007 19.20147 40.12966 100
>
> # using dim
> > microbenchmark( last2 <- flights[,.SD[dim(.SD)[1]], by=month] )
> Unit: milliseconds
> expr min lq
> mean median uq max neval
> last2 <- flights[, .SD[dim(.SD)[1]], by = month] 15.51243 15.87788
> 17.40978 16.19426 17.83308 59.22429 100
>
> # using nrow
> > microbenchmark( last3 <- flights[,.SD[nrow(.SD)], by=month] )
> Unit: milliseconds
> expr min lq
> mean median uq max neval
> last3 <- flights[, .SD[nrow(.SD)], by = month] 15.63919 15.92073 17.28836
> 16.52588 18.33867 24.92624 100
>
> > identical(last1, last2)
> [1] TRUE
> > identical(last1, last3)
> [1] TRUE
>
> Ben Tupper
> Bigelow Laboratory for Ocean Sciences
> 60 Bigelow Drive, P.O. Box 380
> East Boothbay, Maine 04544
> http://www.bigelow.org
>
>
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150122/8d03b81e/attachment.html>
More information about the datatable-help
mailing list