[datatable-help] Unexpected behavior with mult="all"

Short, Tom TShort at epri.com
Mon Aug 2 02:00:07 CEST 2010


I think I agree with Harish (not enough to look into it myself right
now, though :).

- Tom

 

> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Matthew Dowle
> Sent: Sunday, August 01, 2010 17:06
> To: Harish
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Unexpected behavior with mult="all"
> 
> Thanks for sticking with it. The reason that's happening is 
> that internally the same code that does grouping via 'by' 
> also does grouping via mult='all'. So its the same reason as 
> the other thread where we talking about 'by' collapsing away 
> NULL groups.
> 
> But when you present it in that way I see what you mean. I'm 
> almost convinced. If Tom agrees that'll tip me and I'll add 
> it as a bug to fix.  
> 
> As an aside to illustrate, the j gets evaluated for each 
> group with the mult='all', but just once without :
> 
> > x[y,{cat('running j\n');list(d)}]
> running j
> [[1]]
> [1]  1  2 NA NA
> 
> > x[y,{cat('running j\n');list(d)},mult='all']
> running j
> running j
>      a b V1
> [1,] a A  1
> [2,] b A  2
> > 
> 
> 
> 
> On Sat, 2010-07-31 at 19:45 -0700, Harish wrote:
> > Thanks for the detailed explanation.  My question about #1 
> is resolved.  You certainly gave me a lot to ponder over.
> > 
> > I still am doubtful about my question #2 -- not getting NAs.
> > 
> > x <- 
> > 
> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3
> ,4), key="a,b") y <- 
> data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> > x[y]                      # Expected: Getting NAs
> > x[y,mult="all"]           # Expected: Getting NAs with mult="all"
> > x[y,list(d)]              # Expected: Getting NAs with i
> > x[y,list(d),mult="all"]   # Unexpected: No NAs with i & mult="all"
> > 
> > > x[y]
> >         a    b  d
> > [1,]    a    A  1
> > [2,]    b    A  2
> > [3,] <NA> <NA> NA
> > [4,] <NA> <NA> NA
> > 
> > > x[y,mult="all"]
> >         a    b  d
> > [1,]    a    A  1
> > [2,]    b    A  2
> > [3,] <NA> <NA> NA
> > [4,] <NA> <NA> NA
> > 
> > > x[y,list(d)]
> >       d
> > [1,]  1
> > [2,]  2
> > [3,] NA
> > [4,] NA
> > 
> > > x[y,list(d),mult="all"]
> >      a b d
> > [1,] a A 1
> > [2,] b A 2
> > 
> > 
> > As you can see, the combination of having both i and 
> mult="all" is not generating the NAs.  Is there a reason for this?
> > 
> > 
> > Regards,
> > Harish
> > 
> > 
> > --- On Sat, 7/31/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > 
> > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > Subject: Re: [datatable-help] Unexpected behavior with mult="all"
> > > To: "Harish" <harishv_99 at yahoo.com>
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Date: Saturday, July 31, 2010, 7:25 AM This is how I 
> think about it 
> > > currently :
> > > 
> > > [1] The syntax of "x[y,d]" plus knowing how mult's 
> default value is 
> > > set ('first' in this case) means that a vector as long as 
> the number 
> > > of rows in y is the result so data.table does the least 
> work it can 
> > > and returns just the vector without adding in the data 
> already in y.
> > > Changing mult
> > > to "all" however means you'll usually get a varying 
> number of items 
> > > back for each row in y, so data.table includes the y columns as a 
> > > convenience since if it didn't the result would be 
> difficult to use 
> > > (you wouldn't know the correspondence). data.table tries 
> to do the 
> > > minimum, most efficient thing. If you want to be less efficient 
> > > (e.g.
> > > adding columns
> > > you already know) then it's for the user to add them back.
> > > This is sort
> > > of a principle.
> > > 
> > > [2] nomatch is by default NA so this is the same as [1]. 
> Is that any 
> > > chance a typo and you meant nomatch=0 ?  If so then you 
> might have a 
> > > point and perhaps something needs changing there.
> > > 
> > > The other way I think about mult='all' is grouping. The 
> > > documentation sometimes mentions 'by without by', or I might be 
> > > recalling emails or posts. Remember mult='all' gets automatically 
> > > set to 'all'
> > > when you
> > > match to not all of the columns of x's key. When mult='all'
> > > I think to
> > > myself "for each row of y fetch me all the rows from x that match 
> > > and eval j for that group, then move on to the next row 
> in y".  Its 
> > > kind of like a data specific 'by'. Once you realise mult='all' is 
> > > like a 'by'
> > > remember that 'by' automatically adds in the 'by' columns to the 
> > > result.
> > > Hence mult='all' behaves more like a 'by' with respect to 
> returning 
> > > data.table rather than vector.
> > > 
> > > Example :
> > > 
> > >   X = data.table(x=1:3, y=1:4, z=rnorm(12),
> > > key="x,y")
> > >   Y = data.table(x=1:3) 
> > >   X[Y,sum(z)] same as X[,sum(z),by=x]
> > > 
> > > Then going further :
> > > 
> > >   X[Y[<having>],sum(z)] faster than
> > > X[,sum(z),by=x][<having>]
> > > 
> > > Lets say <having> are groups where x>2 (just one group in this
> > > example) :
> > > 
> > >   X[Y[x>2],sum(z)] same but faster than X[,sum(z),by=x][x>2]
> > > 
> > > which is the same as
> > > 
> > >   X[J(3),sum(z)]
> > > 
> > > if we knew we wanted group '3' in advance for example.
> > > 
> > > These constructs (e.g. 'by without by') generalise to
> > > list() of
> > > expressions and function calls of column variables in the 
> usual way.
> > > 
> > > Sometimes you do want mult='all', and run the j expression on the 
> > > result as a whole, not by row of Y.  In that case, assuming Y has 
> > > less columns than key(X) meaning mult='all' (as it is in this 
> > > example)
> > > :
> > > 
> > >     X[Y,length(z)]     
> > > # j eval'd by row of Y, result 3 rows
> > >     X[Y][,length(z)]  # length 1 vector value 12
> > > 
> > > HTH?
> > > Matthew
> > > 
> > > 
> > > On Fri, 2010-07-30 at 19:28 -0700, Harish wrote:
> > > > I am getting some unexpected behavior with
> > > mult="all".
> > > > 
> > > > 1) Getting a data table when I expect a vector
> > > > 2) Not getting NA's when I expect them (because of
> > > nomatch=NA)
> > > > 
> > > > ==========
> > > > 
> > > > Common code for examples below
> > > > 
> > > > x <-
> > > data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4),
> > > key="a,b")
> > > > y <-
> > > data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> > > > 
> > > > ==========
> > > > 
> > > > Issue #1: Getting a data table when I expect a vector
> > > > 
> > > > I am not following the logic of when a data.table is
> > > returned and when a vector is returned.  Initially, I 
> thought that 
> > > if j had only one item without a list(), a vector is 
> returned, but I 
> > > am seeing some contrary behavior.
> > > > 
> > > > x[y,d]  # Returns a vector as expected x[y,d,mult="all"]  # 
> > > > Returns a data.table.
> > > Why?
> > > > 
> > > > Would someone help me understand why I should not
> > > expect a vector in the last query?
> > > > 
> > > > ==========
> > > > Issue #2: Not getting NA's when I expect them (because
> > > of nomatch=NA)
> > > > 
> > > > x[y,d,nomatch=NA]  # Expected: returns a vector
> > > with NAs in them
> > > > x[y,d,nomatch=NA,mult="all]  # Unexpected: NAs
> > > not appearing
> > > > 
> > > > Am I missing something?
> > > > 
> > > > Harish
> > > > 
> > > > 
> > > > 
> > > >       
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org
> > > > 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datat
> > > > able-help
> > > 
> > > 
> > >
> > 
> > 
> > 
> >       
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
> 


More information about the datatable-help mailing list