Hi Matthew,<br><br>Thanks for the suggestions. The tapply in the code below transforms the table from long format to a wide format with wdpaint as columns and pnvid as rows. The main reason is that it includes all combinations of the two variables, including those with 0 observations. The code you are suggesting indeed seems to be the same as ordering the table.<br>

<br>Cheers,<br><br>Paulo<br><br><br><br><div class="gmail_quote">On Fri, Jun 22, 2012 at 4:55 PM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com" target="_blank">mdowle@mdowle.plus.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

Great. Thanks for keeping the list updated.<br>

<br>

One thing I don't quite see, instead of :<br>

<br>

for (i in 1:12) {<br>

<div class="im">    a3 <- a1[,V1:=sample(a2,replace=F)]<br>

    b <- a3[,.N,by=list(V1,V2)]<br>

    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)<br>

}<br>

<br>

</div>why not :<br>

<br>

for (i in 1:12){<br>

<div class="im">    a3 <- a1[,V1:=sample(a2,replace=F)]<br>

    b <- a3[,.N,by=list(V1,V2)]<br>

</div>    b2 <- b[,sum(N),by=list(V2,V1)]<br>

    c[[i]] <- b2$V1<br>

}<br>

<br>

Idea being to save the tapply and the 2 as.factor. Further, I'm not sure<br>

that sum() will be summing anything will it?  Isn't b2 the same as<br>

b[order(V2,V1)], and if so that will be faster still?<br>

<span class="HOEnZb"><font color="#888888"><br>

Matthew<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

> I got some very useful further feed back from Matthew. Let me summarize<br>

> some key points from his suggestions concerning the code below:<br>

><br>

> The following code is still fairly slow (although faster then using<br>

> table or tapply):<br>

><br>

>    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)<br>

><br>

>    b <- a[,.N,by=list(V1,V2)]<br>

><br>

>    c <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)<br>

><br>

>    for(i in 1:11){<br>

><br>

>      a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)<br>

><br>

>      b <- a[,.N,by=list(V1,V2)]<br>

><br>

>      c <- rbind(c,tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum))<br>

><br>

>   }<br>

><br>

> As pointed out by Matthew, the rbind at the end of the loop will be<br>

> growing memory use and is generally inefficient. How badly it is<br>

> impacting performance will depend on the data size though. So step 1 is<br>

> to get that outside the loop (an useful link he provided is<br>

> <a href="http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r" target="_blank">http://stackoverflow.com/questions/10452249/divide-et-impera-on-a-data-frame-in-r</a>).<br>

> Based on a hint in R-inferno<br>

> (<a href="http://www.burns-stat.com/pages/Tutor/R_inferno.pdf" target="_blank">http://www.burns-stat.com/pages/Tutor/R_inferno.pdf</a>) I adapted the code<br>

> as follows:<br>

><br>

> c <- vector('list', 12)<br>

><br>

> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))<br>

><br>

> a2 <- as.integer(SPFn$wdpaint)<br>

><br>

> for(i in 1:12){<br>

><br>

>      a3 <- a1[,V1:=sample(a2,replace=F)]<br>

><br>

>      b <- a3[,.N,by=list(V1,V2)]<br>

><br>

>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)<br>

><br>

> }<br>

><br>

> c <- do.call('rbind', c)<br>

><br>

> This did improve the run time, but only very little bit (16.0 instead of<br>

> 16.4 seconds). Next step was to profile the code, to see what part is<br>

> taking most time. This can be done with Rprof(). The results showed that<br>

> ordernumtol, a data.table function which sorts numeric ('double'<br>

> floating point) columns was taking a lot of time. As it turns out, the<br>

> SPFn$wdpaint and SPFn$pnvid were both numerical. Changing these to<br>

> integer does speed up the code a lot.<br>

><br>

> c <- vector('list', 12)<br>

><br>

> a1 <- data.table(as.integer(SPFn$wdpaint),as.integer(SPFn$pnvid))<br>

><br>

> a2 <- as.integer(SPFn$wdpaint)<br>

><br>

> for(i in 1:12){<br>

><br>

>      a3 <- a1[,V1:=sample(a2,replace=F)]<br>

><br>

>      b <- a3[,.N,by=list(V1,V2)]<br>

><br>

>      c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)<br>

><br>

> }<br>

><br>

> c <- do.call('rbind', c)<br>

><br>

> 9<br>

> The second code took 16.0 seconds. The last attempt 2.4 seconds only!<br>

> That is a serious (> 6x) improvement. And it shows I really need to be<br>

> much more careful about my variables...<br>

> I checked and it also makes a smaller, but still very significant<br>

> difference when using table (3x) or tapply (2x).<br>

><br>

> Big thanks to Matthew Dowle for all his help.. and any further<br>

> suggestions for improvements are obviously welcome.<br>

><br>

> Cheers,<br>

><br>

> Paulo<br>

><br>

><br>

><br>

> On 06/19/2012 04:24 PM, Matthew Dowle wrote:<br>

>> The shuffling can form a different number of groups can't it?<br>

> YES, obvious.. I was half asleep I guess<br>

>><br>

>> table(c(1,1,2,2), c(3,3,4,4))   # 2 groups<br>

>> table(c(2,2,1,1), c(3,3,4,4))   # 2 groups<br>

>> table(c(2,1,2,1), c(3,3,4,4))   # 4 groups<br>

>><br>

>><br>

>>> Thanks Matthew<br>

>>><br>

>>> I am not sure I understand the code (actually, I am sure I do not :-( .<br>

>>> More specifically, I would expect the two expressions below to yield<br>

>>> tables<br>

>>> of the same dimension (basically all combinations of wdpaint and<br>

>>> pnnid):<br>

>>><br>

>>> aa <- SPFdt[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]<br>

>>> dim(aa)<br>

>>>> 254  3<br>

>>> bb <- SPFdt[, .N, by=list(wdpaint,pnvid)<br>

>>> dim(bb)<br>

>>>> 170 3<br>

>>> What I am looking for is creating a cross table of pnvid and wdpaint,<br>

>>> i.e.,<br>

>>> the frequency or number of occurrences of each combination of pnvid and<br>

>>> wdpaint. Shuffling wdpaint should give in that case a different<br>

>>> frequency<br>

>>> distribution, like in the example below:<br>

>>><br>

>>> table(c(1,1,2,2), c(3,3,4,4))<br>

>>> table(c(2,2,1,1), c(3,3,4,4))<br>

>>><br>

>>> Basically what I want to do is run X permutations on a data set which I<br>

>>> will then use to create a confidence interval on the frequency<br>

>>> distribution<br>

>>> of sample points over wdpaint and pnvid<br>

>>><br>

>>> Cheers,<br>

>>><br>

>>> Paulo<br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>> On Tue, Jun 19, 2012 at 3:30 PM, Matthew Dowle<br>

>>> <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>>wrote:<br>

>>><br>

>>>> Hi,<br>

>>>><br>

>>>> Welcome to the list.<br>

>>>><br>

>>>> Rather than picking a column and calling length() on it, .N is a<br>

>>>> little<br>

>>>> more convenient (and faster if that column isn't otherwise used, as in<br>

>>>> this example). Search ?data.table for the string ".N" to find out<br>

>>>> more.<br>

>>>><br>

>>>> And to group by expressions of column names, wrap with list().  So,<br>

>>>><br>

>>>>     SPF[, .N, by=list(sample(wdpaint,replace=FALSE),pnvid)]<br>

>>>><br>

>>>> But that won't calculate any different statistics, just return the<br>

>>>> groups<br>

>>>> in a different order. Seems like just an example, rather than the real<br>

>>>> task, iiuc, which is fine of course.<br>

>>>><br>

>>>> Matthew<br>

>>>><br>

>>>><br>

>>>>> Hi, I am new to this package and not sure how to implement the<br>

>>>> sample()<br>

>>>>> function with data.table.<br>

>>>>><br>

>>>>> I have a data frame SPF with three columns cat, pnvid and wdpaint.<br>

>>>>> The<br>

>>>>> pnvid variables has values 1:3, the wdpaint has values 1:10. I am<br>

>>>>> interested in the count of all combinations of wdpaint and pnvid in<br>

>>>>> my<br>

>>>>> data<br>

>>>>> set, which can be calculated using table or tapply (I use the latter<br>

>>>> in<br>

>>>>> the<br>

>>>>> example code below).<br>

>>>>><br>

>>>>> Normally I would use something like:<br>

>>>>><br>

>>>>> *c <- tapply(SPF$cat, list(as.factor(SPF$pnvid),<br>

>>>> as.factor(SPF$wdpaint),<br>

>>>>> function(x) length(x))*<br>

>>>>><br>

>>>>> If I understand correctly, I would use the below when working with<br>

>>>> data<br>

>>>>> tables:<br>

>>>>><br>

>>>>> *f <- SPF[,length(cat),by="wdpaint,pnvid"]*<br>

>>>>><br>

>>>>> But what if I want to reshuffle the column wdpaint first? When using<br>

>>>>> tapply, it would be something along the lines of:<br>

>>>>><br>

>>>>> *a <- list(as.factor(SPF$pnvid), as.factor(sample(SPF$wdpaint,<br>

>>>>> replace=F)))<br>

>>>>> c <- tapply(SPF$cat, a, function(x) length(x))*<br>

>>>>><br>

>>>>><br>

>>>>> But how to do this with data.table?<br>

>>>>><br>

>>>>> Paulo<br>

>>>>> _______________________________________________<br>

>>>>> datatable-help mailing list<br>

>>>>> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>

>>>>><br>

>>>> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>

>>>><br>

>>>><br>

>>>><br>

>><br>

>><br>

>><br>

><br>

><br>

><br>

<br>

<br>

</div></div></blockquote></div><br>