[datatable-help] Suggest a cool feature: Use data.table like a sorted / indexed data.list?

Matthew Dowle mdowle at mdowle.plus.com
Mon Sep 20 22:20:28 CEST 2010


Ok yes I admit, its a (nice) accident via inheritance probably.

Thanks for further details. I see 'by' not working when j uses a list()
column too :

> dt[,sapply(b,sum),by=a]
Error in `[.data.table`(dt, , sapply(b, sum), by = a) : 
  only integer,double,logical and character vectors are allowed so far.
Type 19 would need to be added.
> 

where that type 19 is a vector of pointers (i.e. list()). The memcpy in
dogroups could copy pointers, just needs the extra type adding in, I
think.

Another is returning a list(...,list(),...) from j. I was going to say
that will be very tricky, but thinking about it, maybe it's not too bad
either (would just memcpy the pointers into the result as usual).

FR#1092 added for the above i.e. 'Make 'by' work for list() columns'

The display problem though is to do with *non-vectors* as columns
(FR#202) which is not yet implemented. A data.table is not a vector, but
is a list, despite a list being a vector. It probably seems picky but
it's a big difference internally.

If you'd like to implement the improvements then you are very welcome to
join the project and commit the changes. I'm happy to answer questions
about the internals and give ideas and suggestions, as I'm sure Tom is
too.

Matthew


On Mon, 2010-09-20 at 09:24 -0500, Branson Owen wrote:
> Alright, I did it again. I was on data.table 1.4, where data.table
> doesn't inherit from data.frame. I have three versions of R on my
> machine and I mean to keep one of them using the old version for
> comparison and emergency. My idea and experiment came very quick and
> pay not enough attention to other things. Sorry about it.
> 
> This is great. It looks like that 1.5 mean to or by accident implement
> an long-time wish on the list!
> 
> I found that most of the syntax works as I expect but not this one:
> (follow Tom's example)
> 
> > dt[paste(b),by = list(a)]
> Error in `[.data.table`(dt, , paste(b), by = list(a)) :
>  only integer,double,logical and character vectors are allowed so
> far. Type 19 would need to be added.
> 
> However, the alternative for grouping using i expression (faster?) works:
> > dt[J(unique(a)),paste(b)]
> [1] "1:2" "3:5" "r"   "6:9"
> 
> Is there any performance issue? It looks like that data.table can
> support non-vectors column already (probably because it's now
> inherited from data.frame), but the "by" syntax auto-block the
> non-vector type. Do you think that we can disable the type
> auto-checking because it should work for 1.5? (not sure though)
> 
> Another example that works for data.frame but not data.table. (This
> seems only a display problem. Other data.table features except for
> "by" syntax still works if we are careful about the data type)
> 
> > dt$c = dt
> > dt
> Error in rep("", ncol(xi)) : invalid 'times' argument
> 
> > class(dt)
> [1] "data.table" "data.frame"
> > class(dt) = "data.frame"
> > dt
>  a          b c.a        c.b
> 1 1       1, 2   1       1, 2
> 2 2    3, 4, 5   2    3, 4, 5
> 3 3          r   3          r
> 4 4 6, 7, 8, 9   4 6, 7, 8, 9
> 
> I am sorry I was not aware of the existing old issue and was reckless
> about the version. :-P Thanks a lot for the correction and references.
> 
> Best regards,
> 
> 2010/9/17 Matthew Dowle <mdowle at mdowle.plus.com>:
> > This seems to work :
> >
> >> dt = data.table(a=1:4)
> >> dt$b = list(1:2,3:5,"r",6:9)
> >> dt
> >     a          b
> > [1,] 1       1, 2
> > [2,] 2    3, 4, 5
> > [3,] 3          r
> > [4,] 4 6, 7, 8, 9
> >> setkey(dt,a)
> >> dt[J(2)]
> >     a       b
> > [1,] 2 3, 4, 5
> >> dt[J(2),b]
> > [[1]]
> > [1] 3 4 5
> >>
> >
> > Branson - what didn't you get to work? Thanks.
> >
> > Yes FR#202 was about the $C in the example (being picky about it) i.e.
> > allowing non-vectors as a column (such as matrix, data.table and
> > data.frame which themselves have columns) but the $D is slightly
> > different since a list is a vector, too :
> >
> >> is.vector(list(i=1:6, j = 1,k="?"))
> > [1] TRUE
> >>
> >
> > Matthew
> >
> >
> > On Fri, 2010-09-17 at 17:00 -0400, Tom Short wrote:
> >> Branson, that's been on the wishlist for a while:
> >>
> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=202&group_id=240&atid=978
> >>
> >> It hasn't been an urgent enough need for anyone to dig into it. You
> >> can always use one data table to index a list. It may take more memory
> >> and a bit more bookkeeping for the user, but it's not that hard.
> >>
> >> - Tom
> >>
> >> On Fri, Sep 17, 2010 at 1:47 PM, Branson Owen <branson.owen at gmail.com> wrote:
> >> > I believe you have been already aware of what I know. Just add some
> >> > suggestions here.
> >> >
> >> > My understanding for data.frame is list of column VECTORs, so is
> >> > data.table. What I just learned is that data.frame now can be a list
> >> > of anything?
> >> >
> >> >> DF = data.frame(A = 1:3, B = rnorm(3))
> >> >> DF$C = data.frame(a=1:3,b=rnorm(3))
> >> >> DF$D = list(i=1:6, j = 1,k="?")
> >> >> print(DF)
> >> >
> >> >  A         B     C.a        C.b                D
> >> > 1 1 -0.949565   1 -0.5815717 1, 2, 3, 4, 5, 6
> >> > 2 2 -1.903233   2 -0.5087712                1
> >> > 3 3  1.559566   3  1.4596933                ?
> >> >
> >> >> class(DF$C)
> >> > [1] "data.frame"
> >> >
> >> >> class(DF$D)
> >> > [1] "list"
> >> >
> >> > This is very cool to me! I can think of many benefits from this features.
> >> >
> >> > A very common example: if D is a function of B but with variable
> >> > output size, and I want to do fast grouping or sorting based on key A.
> >> > Before I know this, I would have to save them as separate objects and
> >> > add complexity of my codes. This just adds coding and management
> >> > sugar. No benefits to performance yet.
> >> >
> >> > But, I think data.table can make a difference just like it makes
> >> > differences to data.frame! There is no sorted / indexed list object
> >> > yet, right? If my variable-size outputs are millions length, any
> >> > aggregating operation on a less structured object like it will be
> >> > painful. Technically, data.table can make it a sorted list to enjoy
> >> > data.table high performance and syntax.
> >> >
> >> > I did some tests, use data.table as data.list, but most of the
> >> > syntaxes that work for data.frame doesn't work for data.table.
> >> >
> >> > I would expect this could be an easy feature, since data.frame is kind
> >> > of smoothly support it. Just a suggestion. *^^*
> >> >
> >> > Best regards,
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list