[datatable-help] using sample() in data.table

Paulo van Breugel p.vanbreugel at gmail.com
Fri Jun 22 15:14:04 CEST 2012


I got some very useful further feed back from Matthew. Let me summarize 
some key points from his suggestions concerning the code below:

The following code is still fairly slow (although faster then using 
table or tapply):

   a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)

   b <- a[,.N,by=list(V1,V2)]

   c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)

   for(i in 1:11){

     a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)

     b <- a[,.N,by=list(V1,V2)]

     c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum))

  }

As pointed out by Matthew, the rbind at the end of the loop will be 
growing memory use and is generally inefficient. How badly it is 
impacting performance will depend on the data size though. So step 1 is 
to get that outside the loop (an useful link he provided is 
http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r). 
Based on a hint in R-inferno 
(http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code 
as follows:

c <- vector('list', 12)

a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))

a2 <- as.integer(SPFn$wdpaint)

for(i in 1:12){

     a3 <- a1[,V1:=sample(a2,replace=F)]

     b <- a3[,.N,by=list(V1,V2)]

     c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)

}

c <- do.call('rbind', c)

This did improve the run time, but only very little bit (16.0 instead of 
16.4 seconds). Next step was to profile the code, to see what part is 
taking most time. This can be done with Rprof(). The results showed that 
ordernumtol, a data.table function which sorts numeric ('double' 
floating point) columns was taking a lot of time. As it turns out, the 
SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to 
integer does speed up the code a lot.

c <- vector('list', 12)

a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))

a2 <- as.integer(SPFn$wdpaint)

for(i in 1:12){

     a3 <- a1[,V1:=sample(a2,replace=F)]

     b <- a3[,.N,by=list(V1,V2)]

     c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)

}

c <- do.call('rbind', c)

9
The second code took 16.0 seconds. The last attempt 2.4 seconds only! 
That is a serious (> 6x) improvement. And it shows I really need to be 
much more careful about my variables...
I checked and it also makes a smaller, but still very significant 
difference when using table (3x) or tapply (2x).

Big thanks to Matthew Dowle for all his help.. and any further 
suggestions for improvements are obviously welcome.

Cheers,

Paulo



On 06/19/2012 04:24 PM, Matthew Dowle wrote:
> The shuffling can form a different number of groups can't it?
YES, obvious.. I was half asleep I guess
>
> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups
>
>
>> Thanks Matthew
>>
>> I am not sure I understand the code (actually, I am sure I do not :-( .
>> More specifically, I would expect the two expressions below to yield
>> tables
>> of the same dimension (basically all combinations of wdpaint and pnnid):
>>
>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>> dim(aa)
>>> 254  3
>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
>> dim(bb)
>>> 170 3
>> What I am looking for is creating a cross table of pnvid and wdpaint,
>> i.e.,
>> the frequency or number of occurrences of each combination of pnvid and
>> wdpaint. Shuffling wdpaint should give in that case a different frequency
>> distribution, like in the example below:
>>
>> table(c(1,1,2,2), c(3,3,4,4))
>> table(c(2,2,1,1), c(3,3,4,4))
>>
>> Basically what I want to do is run X permutations on a data set which I
>> will then use to create a confidence interval on the frequency
>> distribution
>> of sample points over wdpaint and pnvid
>>
>> Cheers,
>>
>> Paulo
>>
>>
>>
>>
>>
>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
>> <mdowle at mdowle.plus.com>wrote:
>>
>>> Hi,
>>>
>>> Welcome to the list.
>>>
>>> Rather than picking a column and calling length() on it, .N is a little
>>> more convenient (and faster if that column isn't otherwise used, as in
>>> this example). Search ?data.table for the string ".N" to find out more.
>>>
>>> And to group by expressions of column names, wrap with list().  So,
>>>
>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>>
>>> But that won't calculate any different statistics, just return the
>>> groups
>>> in a different order. Seems like just an example, rather than the real
>>> task, iiuc, which is fine of course.
>>>
>>> Matthew
>>>
>>>
>>>> Hi, I am new to this package and not sure how to implement the
>>> sample()
>>>> function with data.table.
>>>>
>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint. The
>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
>>>> interested in the count of all combinations of wdpaint and pnvid in my
>>>> data
>>>> set, which can be calculated using table or tapply (I use the latter
>>> in
>>>> the
>>>> example code below).
>>>>
>>>> Normally I would use something like:
>>>>
>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
>>> as.factor(SPF$wdpaint),
>>>> function(x) length(x))*
>>>>
>>>> If I understand correctly, I would use the below when working with
>>> data
>>>> tables:
>>>>
>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>>>>
>>>> But what if I want to reshuffle the column wdpaint first? When using
>>>> tapply, it would be something along the lines of:
>>>>
>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
>>>> replace=F)))
>>>> c <- tapply(SPF$cat, a, function(x) length(x))*
>>>>
>>>>
>>>> But how to do this with data.table?
>>>>
>>>> Paulo
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120622/dd9f7ef0/attachment.html>


More information about the datatable-help mailing list