[datatable-help] Unexpected behavior with mult="all"

Matthew Dowle mdowle at mdowle.plus.com
Sun Aug 1 23:06:03 CEST 2010


Thanks for sticking with it. The reason that's happening is that
internally the same code that does grouping via 'by' also does grouping
via mult='all'. So its the same reason as the other thread where we
talking about 'by' collapsing away NULL groups.

But when you present it in that way I see what you mean. I'm almost
convinced. If Tom agrees that'll tip me and I'll add it as a bug to
fix.  

As an aside to illustrate, the j gets evaluated for each group with the
mult='all', but just once without :

> x[y,{cat('running j\n');list(d)}]
running j
[[1]]
[1]  1  2 NA NA

> x[y,{cat('running j\n');list(d)},mult='all']
running j
running j
     a b V1
[1,] a A  1
[2,] b A  2
> 



On Sat, 2010-07-31 at 19:45 -0700, Harish wrote:
> Thanks for the detailed explanation.  My question about #1 is resolved.  You certainly gave me a lot to ponder over.
> 
> I still am doubtful about my question #2 -- not getting NAs.
> 
> x <- data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4), key="a,b")
> y <- data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> x[y]                      # Expected: Getting NAs
> x[y,mult="all"]           # Expected: Getting NAs with mult="all"
> x[y,list(d)]              # Expected: Getting NAs with i
> x[y,list(d),mult="all"]   # Unexpected: No NAs with i & mult="all"
> 
> > x[y]
>         a    b  d
> [1,]    a    A  1
> [2,]    b    A  2
> [3,] <NA> <NA> NA
> [4,] <NA> <NA> NA
> 
> > x[y,mult="all"]
>         a    b  d
> [1,]    a    A  1
> [2,]    b    A  2
> [3,] <NA> <NA> NA
> [4,] <NA> <NA> NA
> 
> > x[y,list(d)]
>       d
> [1,]  1
> [2,]  2
> [3,] NA
> [4,] NA
> 
> > x[y,list(d),mult="all"]
>      a b d
> [1,] a A 1
> [2,] b A 2
> 
> 
> As you can see, the combination of having both i and mult="all" is not generating the NAs.  Is there a reason for this?
> 
> 
> Regards,
> Harish
> 
> 
> --- On Sat, 7/31/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> 
> > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > Subject: Re: [datatable-help] Unexpected behavior with mult="all"
> > To: "Harish" <harishv_99 at yahoo.com>
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Date: Saturday, July 31, 2010, 7:25 AM
> > This is how I think about it
> > currently :
> > 
> > [1] The syntax of "x[y,d]" plus knowing how mult's default
> > value is set
> > ('first' in this case) means that a vector as long as the
> > number of rows
> > in y is the result so data.table does the least work it can
> > and returns
> > just the vector without adding in the data already in y.
> > Changing mult
> > to "all" however means you'll usually get a varying number
> > of items back
> > for each row in y, so data.table includes the y columns as
> > a convenience
> > since if it didn't the result would be difficult to use
> > (you wouldn't
> > know the correspondence). data.table tries to do the
> > minimum, most
> > efficient thing. If you want to be less efficient (e.g.
> > adding columns
> > you already know) then it's for the user to add them back.
> > This is sort
> > of a principle.
> > 
> > [2] nomatch is by default NA so this is the same as [1]. Is
> > that any
> > chance a typo and you meant nomatch=0 ?  If so then
> > you might have a
> > point and perhaps something needs changing there.
> > 
> > The other way I think about mult='all' is grouping. The
> > documentation
> > sometimes mentions 'by without by', or I might be recalling
> > emails or
> > posts. Remember mult='all' gets automatically set to 'all'
> > when you
> > match to not all of the columns of x's key. When mult='all'
> > I think to
> > myself "for each row of y fetch me all the rows from x that
> > match and
> > eval j for that group, then move on to the next row in
> > y".  Its kind of
> > like a data specific 'by'. Once you realise mult='all' is
> > like a 'by'
> > remember that 'by' automatically adds in the 'by' columns
> > to the result.
> > Hence mult='all' behaves more like a 'by' with respect to
> > returning
> > data.table rather than vector.
> > 
> > Example :
> > 
> >   X = data.table(x=1:3, y=1:4, z=rnorm(12),
> > key="x,y")
> >   Y = data.table(x=1:3) 
> >   X[Y,sum(z)] same as X[,sum(z),by=x]
> > 
> > Then going further :
> > 
> >   X[Y[<having>],sum(z)] faster than
> > X[,sum(z),by=x][<having>]
> > 
> > Lets say <having> are groups where x>2 (just one
> > group in this
> > example) :
> > 
> >   X[Y[x>2],sum(z)] same but faster than
> > X[,sum(z),by=x][x>2] 
> > 
> > which is the same as 
> > 
> >   X[J(3),sum(z)]
> > 
> > if we knew we wanted group '3' in advance for example.
> > 
> > These constructs (e.g. 'by without by') generalise to
> > list() of
> > expressions and function calls of column variables in the
> > usual way.
> > 
> > Sometimes you do want mult='all', and run the j expression
> > on the result
> > as a whole, not by row of Y.  In that case, assuming Y
> > has less columns
> > than key(X) meaning mult='all' (as it is in this example)
> > :
> > 
> >     X[Y,length(z)]     
> > # j eval'd by row of Y, result 3 rows
> >     X[Y][,length(z)]  # length 1 vector
> > value 12
> > 
> > HTH?
> > Matthew
> > 
> > 
> > On Fri, 2010-07-30 at 19:28 -0700, Harish wrote:
> > > I am getting some unexpected behavior with
> > mult="all".
> > > 
> > > 1) Getting a data table when I expect a vector
> > > 2) Not getting NA's when I expect them (because of
> > nomatch=NA)
> > > 
> > > ==========
> > > 
> > > Common code for examples below
> > > 
> > > x <-
> > data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4),
> > key="a,b")
> > > y <-
> > data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> > > 
> > > ==========
> > > 
> > > Issue #1: Getting a data table when I expect a vector
> > > 
> > > I am not following the logic of when a data.table is
> > returned and when a vector is returned.  Initially, I
> > thought that if j had only one item without a list(), a
> > vector is returned, but I am seeing some contrary behavior.
> > > 
> > > x[y,d]  # Returns a vector as expected
> > > x[y,d,mult="all"]  # Returns a data.table. 
> > Why?
> > > 
> > > Would someone help me understand why I should not
> > expect a vector in the last query?
> > > 
> > > ==========
> > > Issue #2: Not getting NA's when I expect them (because
> > of nomatch=NA)
> > > 
> > > x[y,d,nomatch=NA]  # Expected: returns a vector
> > with NAs in them
> > > x[y,d,nomatch=NA,mult="all]  # Unexpected: NAs
> > not appearing
> > > 
> > > Am I missing something?
> > > 
> > > Harish
> > > 
> > > 
> > > 
> > >       
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > 
> >
> 
> 
> 
>       




More information about the datatable-help mailing list