[datatable-help] using sample() in data.table

Fri Jun 22 16:55:06 CEST 2012

Great. Thanks for keeping the list updated.

One thing I don't quite see, instead of :

for (i in 1:12) {
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
}

why not :

for (i in 1:12){
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    b2 <- b[,sum(N),by=list(V2,V1)]
    c[[i]] <- b2$V1
}

Idea being to save the tapply and the 2 as.factor. Further, I'm not sure
that sum() will be summing anything will it?  Isn't b2 the same as
b[order(V2,V1)], and if so that will be faster still?

Matthew

> I got some very useful further feed back from Matthew. Let me summarize
> some key points from his suggestions concerning the code below:
>
> The following code is still fairly slow (although faster then using
> table or tapply):
>
>    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>
>    b <- a[,.N,by=list(V1,V2)]
>
>    c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
>    for(i in 1:11){
>
>      a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
>
>      b <- a[,.N,by=list(V1,V2)]
>
>      c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum))
>
>   }
>
> As pointed out by Matthew, the rbind at the end of the loop will be
> growing memory use and is generally inefficient. How badly it is
> impacting performance will depend on the data size though. So step 1 is
> to get that outside the loop (an useful link he provided is
> http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r).
> Based on a hint in R-inferno
> (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) I adapted the code
> as follows:
>
> c <- vector('list', 12)
>
> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>
> a2 <- as.integer(SPFn$wdpaint)
>
> for(i in 1:12){
>
>      a3 <- a1[,V1:=sample(a2,replace=F)]
>
>      b <- a3[,.N,by=list(V1,V2)]
>
>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
> }
>
> c <- do.call('rbind', c)
>
> This did improve the run time, but only very little bit (16.0 instead of
> 16.4 seconds). Next step was to profile the code, to see what part is
> taking most time. This can be done with Rprof(). The results showed that
> ordernumtol, a data.table function which sorts numeric ('double'
> floating point) columns was taking a lot of time. As it turns out, the
> SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to
> integer does speed up the code a lot.
>
> c <- vector('list', 12)
>
> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))
>
> a2 <- as.integer(SPFn$wdpaint)
>
> for(i in 1:12){
>
>      a3 <- a1[,V1:=sample(a2,replace=F)]
>
>      b <- a3[,.N,by=list(V1,V2)]
>
>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
>
> }
>
> c <- do.call('rbind', c)
>
> 9
> The second code took 16.0 seconds. The last attempt 2.4 seconds only!
> That is a serious (> 6x) improvement. And it shows I really need to be
> much more careful about my variables...
> I checked and it also makes a smaller, but still very significant
> difference when using table (3x) or tapply (2x).
>
> Big thanks to Matthew Dowle for all his help.. and any further
> suggestions for improvements are obviously welcome.
>
> Cheers,
>
> Paulo
>
>
>
> On 06/19/2012 04:24 PM, Matthew Dowle wrote:
>> The shuffling can form a different number of groups can't it?
> YES, obvious.. I was half asleep I guess
>>
>> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups
>> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups
>> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups
>>
>>
>>> Thanks Matthew
>>>
>>> I am not sure I understand the code (actually, I am sure I do not :-( .
>>> More specifically, I would expect the two expressions below to yield
>>> tables
>>> of the same dimension (basically all combinations of wdpaint and
>>> pnnid):
>>>
>>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>> dim(aa)
>>>> 254  3
>>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)
>>> dim(bb)
>>>> 170 3
>>> What I am looking for is creating a cross table of pnvid and wdpaint,
>>> i.e.,
>>> the frequency or number of occurrences of each combination of pnvid and
>>> wdpaint. Shuffling wdpaint should give in that case a different
>>> frequency
>>> distribution, like in the example below:
>>>
>>> table(c(1,1,2,2), c(3,3,4,4))
>>> table(c(2,2,1,1), c(3,3,4,4))
>>>
>>> Basically what I want to do is run X permutations on a data set which I
>>> will then use to create a confidence interval on the frequency
>>> distribution
>>> of sample points over wdpaint and pnvid
>>>
>>> Cheers,
>>>
>>> Paulo
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle
>>> <mdowle at mdowle.plus.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> Welcome to the list.
>>>>
>>>> Rather than picking a column and calling length() on it, .N is a
>>>> little
>>>> more convenient (and faster if that column isn't otherwise used, as in
>>>> this example). Search ?data.table for the string ".N" to find out
>>>> more.
>>>>
>>>> And to group by expressions of column names, wrap with list().  So,
>>>>
>>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]
>>>>
>>>> But that won't calculate any different statistics, just return the
>>>> groups
>>>> in a different order. Seems like just an example, rather than the real
>>>> task, iiuc, which is fine of course.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>> Hi, I am new to this package and not sure how to implement the
>>>> sample()
>>>>> function with data.table.
>>>>>
>>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint.
>>>>> The
>>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am
>>>>> interested in the count of all combinations of wdpaint and pnvid in
>>>>> my
>>>>> data
>>>>> set, which can be calculated using table or tapply (I use the latter
>>>> in
>>>>> the
>>>>> example code below).
>>>>>
>>>>> Normally I would use something like:
>>>>>
>>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),
>>>> as.factor(SPF$wdpaint),
>>>>> function(x) length(x))*
>>>>>
>>>>> If I understand correctly, I would use the below when working with
>>>> data
>>>>> tables:
>>>>>
>>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*
>>>>>
>>>>> But what if I want to reshuffle the column wdpaint first? When using
>>>>> tapply, it would be something along the lines of:
>>>>>
>>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,
>>>>> replace=F)))
>>>>> c <- tapply(SPF$cat, a, function(x) length(x))*
>>>>>
>>>>>
>>>>> But how to do this with data.table?
>>>>>
>>>>> Paulo
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>
>>
>>
>
>
>