[datatable-help] Create matrix of columns from grouped rows

Eric Archer Eric.Archer at noaa.gov
Wed Nov 23 19:02:41 CET 2016


I am trying to find the most efficient (fastest) way of manipulating a
data.table object that contains genetic data. The format is in the following
toy example, where rows represent alleles for individuals and columns are
separate loci. Each individual will be represented in one or more rows
depending on what the ploidy of the loci are. In this example, there are
three tetraploid loci (4 alleles per locus) genotyped for three individuals
(1:3). In my real data, all loci will always have the same ploidy.

> library(data.table)
> dt <- data.table(
+     id = as.character(rep(1:3, each = 4)),
+     loc1 = factor(sample(c("C", "T"), 12, rep = T)),
+     loc2 = factor(sample(c("C", "T"), 12, rep = T)),
+     loc3 = factor(sample(c("C", "T"), 12, rep = T)), 
+     key = "id"
+ )
> dt
    id loc1 loc2 loc3
 1:  1    T    T    T
 2:  1    C    T    C
 3:  1    T    C    T
 4:  1    C    C    T
 5:  2    T    C    C
 6:  2    T    T    T
 7:  2    C    T    T
 8:  2    C    T    C
 9:  3    C    T    T
10:  3    T    T    T
11:  3    T    C    T
12:  3    T    T    C

What I'm looking for is the fastest way to convert this data.table to a
matrix where each row has the entire genotypes for one individual with the
alleles for a locus in sequential columns. The code I currently have for
this follows. 

> ids <- dt[, unique(id)]
> .cbindColFunc <- function(x) {
+     do.call(cbind, as.list(as.character(x)))
+ }
> mat <- do.call(rbind, lapply(ids, function(i) {
+     dt[i, do.call(cbind, lapply(.SD, .cbindColFunc)), .SDcols = !"id"]
+ }))
> num.alleles <- ncol(mat) / (ncol(dt) - 1)
> colnames(mat) <- paste(rep(colnames(dt)[-1], each = num.alleles),
> 1:num.alleles, sep = ".")
> mat <- cbind(id = ids, mat)
> mat
     id  loc1.1 loc1.2 loc1.3 loc1.4 loc2.1 loc2.2 loc2.3 loc2.4 loc3.1
loc3.2 loc3.3 loc3.4
[1,] "1" "T"    "C"    "T"    "C"    "T"    "T"    "C"    "C"    "T"    "C"   
"T"    "T"   
[2,] "2" "T"    "T"    "C"    "C"    "C"    "T"    "T"    "T"    "C"    "T"   
"T"    "C"   
[3,] "3" "C"    "T"    "T"    "T"    "T"    "T"    "C"    "T"    "T"    "T"   
"T"    "C"   


Is there a faster, more data.table friendly way to do it?
Thanks in advance!
Eric



--
View this message in context: http://r.789695.n4.nabble.com/Create-matrix-of-columns-from-grouped-rows-tp4726777.html
Sent from the datatable-help mailing list archive at Nabble.com.


More information about the datatable-help mailing list