[datatable-help] [R] Using plyr::dply more (memory) efficiently?

Short, Tom TShort at epri.com
Fri Apr 30 20:52:00 CEST 2010


Interesting issue. Thanks, Steve.

I'd prefer a check or force reordering in setkey rather than in
data.table or [.data.table. 

I'd rather not forbid out-of-order levels for non-key columns.
Out-of-order levels are sometimes nice to get legends and panels in the
order I like when plotting with lattice. 

By seems to work okay with out-of-order levels:

> a = data.table(a = rep(1:5, 2), b = factor(letters[rep(1:5, each =
2)], levels = letters[5:1]), key = "b")
> a[J("b")] # the problem
      a    b
[1,] NA <NA>
> a[, b, by = "a"]
      a b
 [1,] 1 c
 [2,] 1 a
 [3,] 2 d
 [4,] 2 a
 [5,] 3 d
 [6,] 3 b
 [7,] 4 e
 [8,] 4 b
 [9,] 5 e
[10,] 5 c
> a[, a, by = "b"]
      b a
 [1,] e 4
 [2,] e 5
 [3,] d 2
 [4,] d 3
 [5,] c 5
 [6,] c 1
 [7,] b 3
 [8,] b 4
 [9,] a 1
[10,] a 2

- Tom

-----Original Message-----
From: datatable-help-bounces at lists.r-forge.r-project.org
[mailto:datatable-help-bounces at lists.r-forge.r-project.org] On Behalf Of
Matthew Dowle
Sent: Friday, April 30, 2010 2:16 PM
To: datatable-help at lists.r-forge.r-project.org
Cc: lianos at cbio.mskcc.org
Subject: Re: [datatable-help] [R] Using plyr::dply more (memory)
efficiently?


Looks like Steve found a bug, see below. [ He gave ok to forward to the
list. ]  Thanks Steve.

If a data.frame df has a factor column x where the levels are not
sorted, perhaps if its been created from somewhere else or other
code, then dt=data.table(df) doesn't sort those levels.
setkey(dt,x) then doesn't sort it, and lookup doesn't work.

Change could be in data.table (to make ssre all factor columns have
sorted levels), or just in setkey for those columns in the key only.

Not sure if add hoc 'by' works ok on factor levels with out-of-order
levels,  so the change might need to be in data.table().

Or something else I didn't think of. Any views?

In the meantime, one workaround to sort the levels :

check.cds$symbol = factor(as.character(check.cds$symbol))
key(check.cds) = NULL   # to clear the key if its already there
setkey(check.cds,symbol)


Matthew


On Thu, 2010-04-29 at 12:46 -0400, Steve Lianoglou wrote:
> Actually, the keys aren't working for me as I expect. Witness that the
> "symbol" column is defined as a key in the `check.cds` object:
> 
> R> tables()
>      NAME          NROW MB COLS                                   KEY
> [1,] check.cds   18,829 3  transcript,symbol,counts,exon.width symbol
> [2,] intron      18,532 3  transcript,symbol,counts,exon.width
> [3,] x           18,829 3  transcript,symbol,counts,exon.width
> 
> R> head(check.cds)
>      transcript symbol counts exon.width
> [1,]      OR4F5  OR4F5      0        125
> [2,]     OR4F16 OR4F16      0          0
> [3,]     OR4F29 OR4F29      0          0
> [4,]      OR4F3  OR4F3      0          0
> [5,]     SAMD11 SAMD11      3       2040
> [6,]      NOC2L  NOC2L     12       1772
> 
> R> check.cds["NOC2L",]
>      transcript symbol counts exon.width
> [1,]       <NA>   <NA>     NA         NA
> 
> R> check.cds[symbol == "NOC2L",]
>      transcript symbol counts exon.width
> [1,]      NOC2L  NOC2L     12       1772
> 
> Am I doing something wrong?
> 
> I'm using R 2.11 and data.table_1.4
> 
> -steve
> 


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-h
elp


More information about the datatable-help mailing list