[datatable-help] data.table - grouping character values

Thu May 12 09:35:32 CEST 2011

Hi,

Like the other question recently, concatenating together with comma to
make lots of very long strings is quite bad practice in any language,
especially if it's numbers which should be left as numbers not converted
to character. However, of course there shouldn't be an error, and we
should look into it... There are some limits (I think) in R's global
CHARSXP cache and/or perhaps R's cache is re-organised, de-fragmented,
or gc'd after some size threshold. You might hit this when creating a
lot of new strings, and very long ones at that. If you could either send
your dataset to me off-list, or provide an obfuscated/randomly generated
dataset with the same properties, that would be great.

However, *must* the annots by group be collapsed in to one row? Is it
just for writing out to PED file or similar?  Otherwise, can the
subsequent operations be done with the annots left as-is in 'long' table
form? Some examples of what you do with concatenated annots would help
as I'm not familiar with it.

If it has to be, remember data.table columns may be type list(); i.e.,
each item of each list() column may itself be a vector.

> dt=data.table(grp=1:3)
> dt$annot = list("a",c("b","c"),c("a","d","e"))
> dt
     grp   annot
[1,]   1       a
[2,]   2    b, c  # prints with commas, but is a vector
[3,]   3 a, d, e
> dt[,sapply(annot,length),by=grp]  # cells are vectors
     grp V1
[1,]   1  1
[2,]   2  2
[3,]   3  3
> 

However, there is no construct yet to create a list() column in
grouping. A new feature could be similar to base R's I() (or perhaps we
should have a new name like V() for vector), as follows :

> dt[, V(annot), by=reads] 
                    reads  annot
[1,]  1279_1000_530_F3-ad  a,b   # prints with commas, but is a vector
[2,]  1279_1000_940_F3-ad  b,c,e
[3,] 1279_1018_1051_F3-ad  c
[4,]   1279_1019_49_F3-ad  f,e,g
[5,]  1279_1019_571_F3-ad  a,b,cot,d,e,f,r,t
[6,]  1279_1024_555_F3-ad  j,i,k

This would not thrash the R's global string cache as no new strings are
created, and it's easier to work with; e.g., you could extract elements
from the annot vector without having to strsplit it.

Finally, perhaps you noticed commit 228 a few weeks ago :

   "Added first steps in fast file reader straight into data.table. See
comments in read.R and readfile.c.  The test in read.R is working and
demonstrates a 4 times speedup so far.  See comments regarding columns
11 and 12 of BED format (thanks to the blog post recently). Perhaps,
data.table for genomics."

One feature will be a primary and secondary field separator. Space for
columns, and comma for items within a list() column, such as 11 and 12
of BED. 

If anyone is motivated to help finish it, or just comment/advise, would
be very welcome :)

Matthew

On Wed, 2011-05-11 at 18:00 +0200, Nicolas Servant wrote:
> My original data.table has 6251012 lines and 2 columns.
> After grouping of the reads, I have 223020 lines
> But the error is not reproducible and so not linked to a particular feature.
> I think that this is a memory issue, it mainly depends of the amount of
> data loaded in my session
> The problem is that with 6251012 lines, this is my smallest dataset ;),
> others can have up to 20 million reads.
> Thanks again
> 
> Best,
> Nicolas
> 
> 
> Steve Lianoglou a écrit :
> > On Tue, May 10, 2011 at 1:52 PM, Nicolas Servant
> > <Nicolas.Servant at curie.fr> wrote:
> >   
> >> Indeed, your example also works on my session ... it seems to be linked
> >> to one of my reads features.
> >> Because even with my data
> >>
> >>     
> >>> g$V1
> >>>       
> >>  [1]Error: 'getCharCE' must be called on a CHARSXP
> >>
> >>     
> >>> head(g)$V1
> >>>       
> >> [1] "Simple_repeat,LINE" "snRNA,snRNA"        "Simple_repeat"
> >>
> >>
> >>
> >> Finally I found it but data.table really doesn't like it :)
> >>     
> >>> dt["1335_868_1708_F3-ad"]
> >>>       
> >>  *** caught segfault ***
> >> address (nil), cause 'unknown'
> >>     
> >
> > Ouch ... here's where Matthew will likely have to step in ;-)
> >
> > Out of curiosity, how big is your data.table, ie. how many rows and columns?
> >
> > -steve
> >
> >   
> 
>