[datatable-help] Shuffle row-wise, column independently

Nicolas Paris niparisco at gmail.com
Fri Jan 6 01:09:01 CET 2017


Hey,
Thanks for suggestion but this didn't work.

Method 1 : use of data.table / sample
> set.seed(1); size <- 100000000; dt <-
data.table::data.table("a"=c(1:size),"b"=rep(letters[1:10],size/10));head(dt);system.time(
dt[,c("a","b"):=list(sample(a),sample(b))]
);head(dt)
   a b
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
utilisateur     système      écoulé
     10.190       0.252      10.456
          a b
1: 26550867 a
2: 37212390 b
3: 57285336 c
4: 90820777 e
5: 20168193 a
6: 89838965 h


Method 2 : use of factor / data.table / sample
> set.seed(1); size <- 100000000; dt <-
data.table::data.table("a"=c(1:size),"b"=as.factor(rep(letters[1:10],size/10)));head(dt);system.time(
   dt[,c("a","b"):=list(sample(a),sample(b))]
);head(dt)
   a b
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
utilisateur     système      écoulé
      9.271       0.276       9.559
          a b
1: 26550867 a
2: 37212390 b
3: 57285336 c
4: 90820777 e
5: 20168193 a
6: 89838965 h

Method 3: Use of internal / data.table / factor
> set.seed(1); size <- 100000000; dt <-
data.table::data.table("a"=c(1:size),"b"=as.factor(rep(letters[1:10],size/10)));head(dt);system.time(
    dt[,c("a","b"):=list(a[.Internal(sample(size, size, FALSE,
NULL))],b[.Internal(sample(size, size, FALSE, NULL))])]
);head(dt)
   a b
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
utilisateur     système      écoulé
      8.786       0.137       8.935
          a b
1: 26550867 a
2: 37212390 b
3: 57285336 c
4: 90820777 e
5: 20168193 a
6: 89838965 h

Method 4 (thanks for pointing it banded): set / factor / sample
> set.seed(1); size <- 100000000; dt <-
data.table::data.table("a"=c(1:size),"b"=as.factor(rep(letters[1:10],size/10)));head(dt);system.time({
set(dt,j="a",value=sample(dt$a));
set(dt,j="b",value=sample(dt$b))}
);head(dt);
   a b
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
utilisateur     système      écoulé
      8.790       0.204       9.006
          a b
1: 26550867 a
2: 37212390 b
3: 57285336 c
4: 90820777 e
5: 20168193 a
6: 89838965 h

Method 5 use of a data.frame
> set.seed(1); size <- 100000000; dt <-
data.frame("a"=c(1:size),"b"=as.factor(rep(letters[1:10],size/10)));head(dt);system.time({
dt$a <- sample(dt$a);dt$b <- sample(dt$b)
});head(dt);
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
utilisateur     système      écoulé
      8.755       0.152       8.921
         a b
1 26550867 a
2 37212390 b
3 57285336 c
4 90820777 e
5 20168193 a
6 89838965 h


sadly, data.table does not improve. sample  is the bottleneck


2017-01-05 14:20 GMT+01:00 banded08 <david.awam.jansen at gmail.com>:

> Maybe not the fastest of most efficient, but this should work
>
> for(ii in 1:dim(dt1)[1]) set(dt1, ii, 1:dim(dt1)[2] ,sample(dt1[ii]))
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/
> Shuffle-row-wise-column-independently-tp4727865p4727871.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20170106/ae35071d/attachment.html>


More information about the datatable-help mailing list