[datatable-help] using sample() in data.table

Paulo van Breugel p.vanbreugel at gmail.com
Sat Jun 23 08:19:49 CEST 2012


Hi Matthew,

Thanks for the suggestions. The tapply in the code below transforms the
table from long format to a wide format with wdpaint as columns and pnvid
as rows. The main reason is that it includes all combinations of the two
variables, including those with 0 observations. The code you are suggesting
indeed seems to be the same as ordering the table.

Cheers,

Paulo



On Fri, Jun 22, 2012 at 4:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> Great. Thanks for keeping the list updated.
>
> One thing I don't quite see, instead of :
>
> for (i in 1:12) {
>     a3 <- a1[,V1:=sample(a2,replace=F)]
>    b <- a3[,.N,by=list(V1,V2)]
>    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
> }
>
> why not :
>
> for (i in 1:12){
>     a3 <- a1[,V1:=sample(a2,replace=F)]
>    b <- a3[,.N,by=list(V1,V2)]
>     b2 <- b[,sum(N),by=list(V2,V1)]
>    c[[i]] <- b2$V1
> }
>
> Idea being to save the tapply and the 2 as.factor. Further, I'm not sure
> that sum() will be summing anything will it?  Isn't b2 the same as
> b[order(V2,V1)], and if so that will be faster still?
>
> Matthew
>
> > I got some very useful further feed back from Matthew. Let me summarize
> > some key points from his suggestions concerning the code below:
> >
> > The following code is still fairly slow (although faster then using
> > table or tapply):
> >
> >    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
> >
> >    b <- a[,.N,by=list(V1,V2)]
> >
> >    c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
> >
> >    for(i in 1:11){
> >
> >      a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
> >
> >      b <- a[,.N,by=list(V1,V2)]
> >
> >      c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)),
> sum))
> >
> >   }
> >
> > As pointed out by Matthew, the rbind at the end of the loop will be
> > growing memory use and is generally inefficient. How badly it is
> > impacting performance will depend on the data size though. So step 1 is
> > to get that outside the loop (an useful link he provided is
> >
> http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r
> ).
> > Based on a hint in R-inferno
> > (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code
> > as follows:
> >
> > c <- vector('list', 12)
> >
> > a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
> >
> > a2 <- as.integer(SPFn$wdpaint)
> >
> > for(i in 1:12){
> >
> >      a3 <- a1[,V1:=sample(a2,replace=F)]
> >
> >      b <- a3[,.N,by=list(V1,V2)]
> >
> >      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
> >
> > }
> >
> > c <- do.call('rbind', c)
> >
> > This did improve the run time, but only very little bit (16.0 instead of
> > 16.4 seconds). Next step was to profile the code, to see what part is
> > taking most time. This can be done with Rprof(). The results showed that
> > ordernumtol, a data.table function which sorts numeric ('double'
> > floating point) columns was taking a lot of time. As it turns out, the
> > SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to
> > integer does speed up the code a lot.
> >
> > c <- vector('list', 12)
> >
> > a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
> >
> > a2 <- as.integer(SPFn$wdpaint)
> >
> > for(i in 1:12){
> >
> >      a3 <- a1[,V1:=sample(a2,replace=F)]
> >
> >      b <- a3[,.N,by=list(V1,V2)]
> >
> >      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
> >
> > }
> >
> > c <- do.call('rbind', c)
> >
> > 9
> > The second code took 16.0 seconds. The last attempt 2.4 seconds only!
> > That is a serious (> 6x) improvement. And it shows I really need to be
> > much more careful about my variables...
> > I checked and it also makes a smaller, but still very significant
> > difference when using table (3x) or tapply (2x).
> >
> > Big thanks to Matthew Dowle for all his help.. and any further
> > suggestions for improvements are obviously welcome.
> >
> > Cheers,
> >
> > Paulo
> >
> >
> >
> > On 06/19/2012 04:24 PM, Matthew Dowle wrote:
> >> The shuffling can form a different number of groups can't it?
> > YES, obvious.. I was half asleep I guess
> >>
> >> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
> >> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
> >> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups
> >>
> >>
> >>> Thanks Matthew
> >>>
> >>> I am not sure I understand the code (actually, I am sure I do not :-( .
> >>> More specifically, I would expect the two expressions below to yield
> >>> tables
> >>> of the same dimension (basically all combinations of wdpaint and
> >>> pnnid):
> >>>
> >>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
> >>> dim(aa)
> >>>> 254  3
> >>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
> >>> dim(bb)
> >>>> 170 3
> >>> What I am looking for is creating a cross table of pnvid and wdpaint,
> >>> i.e.,
> >>> the frequency or number of occurrences of each combination of pnvid and
> >>> wdpaint. Shuffling wdpaint should give in that case a different
> >>> frequency
> >>> distribution, like in the example below:
> >>>
> >>> table(c(1,1,2,2), c(3,3,4,4))
> >>> table(c(2,2,1,1), c(3,3,4,4))
> >>>
> >>> Basically what I want to do is run X permutations on a data set which I
> >>> will then use to create a confidence interval on the frequency
> >>> distribution
> >>> of sample points over wdpaint and pnvid
> >>>
> >>> Cheers,
> >>>
> >>> Paulo
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
> >>> <mdowle at mdowle.plus.com>wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Welcome to the list.
> >>>>
> >>>> Rather than picking a column and calling length() on it, .N is a
> >>>> little
> >>>> more convenient (and faster if that column isn't otherwise used, as in
> >>>> this example). Search ?data.table for the string ".N" to find out
> >>>> more.
> >>>>
> >>>> And to group by expressions of column names, wrap with list().  So,
> >>>>
> >>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
> >>>>
> >>>> But that won't calculate any different statistics, just return the
> >>>> groups
> >>>> in a different order. Seems like just an example, rather than the real
> >>>> task, iiuc, which is fine of course.
> >>>>
> >>>> Matthew
> >>>>
> >>>>
> >>>>> Hi, I am new to this package and not sure how to implement the
> >>>> sample()
> >>>>> function with data.table.
> >>>>>
> >>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint.
> >>>>> The
> >>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
> >>>>> interested in the count of all combinations of wdpaint and pnvid in
> >>>>> my
> >>>>> data
> >>>>> set, which can be calculated using table or tapply (I use the latter
> >>>> in
> >>>>> the
> >>>>> example code below).
> >>>>>
> >>>>> Normally I would use something like:
> >>>>>
> >>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
> >>>> as.factor(SPF$wdpaint),
> >>>>> function(x) length(x))*
> >>>>>
> >>>>> If I understand correctly, I would use the below when working with
> >>>> data
> >>>>> tables:
> >>>>>
> >>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
> >>>>>
> >>>>> But what if I want to reshuffle the column wdpaint first? When using
> >>>>> tapply, it would be something along the lines of:
> >>>>>
> >>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
> >>>>> replace=F)))
> >>>>> c <- tapply(SPF$cat, a, function(x) length(x))*
> >>>>>
> >>>>>
> >>>>> But how to do this with data.table?
> >>>>>
> >>>>> Paulo
> >>>>> _______________________________________________
> >>>>> datatable-help mailing list
> >>>>> datatable-help at lists.r-forge.r-project.org
> >>>>>
> >>>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>>>
> >>>>
> >>>>
> >>
> >>
> >>
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120623/0a02c092/attachment.html>


More information about the datatable-help mailing list