[datatable-help] Unexpected behavior with mult="all"

Matthew Dowle mdowle at mdowle.plus.com
Sat Jul 31 17:13:32 CEST 2010


Just to clarify in a different way :
==============
X[Y,sum(z)] *is* 'by without by' *because* ncol(Y)<length(key(X)) =>
mult='all'
==============

We have discussed the default for mult in the past. I changed FAQ 2.2 a
few weeks ago, and again just now, and it now says this :

2.2 Why is the default for mult "first" ? If there are duplicates in the
key, shouldn't it return them all by default?

Possibly, yes. The default might be changed to "all".
In versions up to v1.3, "all" was slower. Internally, "all" was
implemented by joining using "first", then again from scratch using
"last", after which a diff between them was performed to work out the
span of the matches in x for each row in i. Most often we join to single
rows, though, where "first","last" and "all" return the same result. We
prefer maximum performance for the majority of situations so the default
chosen was "first". If you are working with a non-unique key then you
need to specify "all".
>From v1.4 the binary search in C branches at the deepest level to find
first and last so there should no longer be a speed disadvantage in
defaulting mult to 'all'.
Note that when i (or i's key if it has one) has fewer columns than x's
key, mult is automatically set to "all". This is why grouping by i
works; e.g., DT[J(id),mean(v)] where key(DT) has 2 or more columns.



On Sat, 2010-07-31 at 15:25 +0100, Matthew Dowle wrote:
> This is how I think about it currently :
> 
> [1] The syntax of "x[y,d]" plus knowing how mult's default value is set
> ('first' in this case) means that a vector as long as the number of rows
> in y is the result so data.table does the least work it can and returns
> just the vector without adding in the data already in y. Changing mult
> to "all" however means you'll usually get a varying number of items back
> for each row in y, so data.table includes the y columns as a convenience
> since if it didn't the result would be difficult to use (you wouldn't
> know the correspondence). data.table tries to do the minimum, most
> efficient thing. If you want to be less efficient (e.g. adding columns
> you already know) then it's for the user to add them back. This is sort
> of a principle.
> 
> [2] nomatch is by default NA so this is the same as [1]. Is that any
> chance a typo and you meant nomatch=0 ?  If so then you might have a
> point and perhaps something needs changing there.
> 
> The other way I think about mult='all' is grouping. The documentation
> sometimes mentions 'by without by', or I might be recalling emails or
> posts. Remember mult='all' gets automatically set to 'all' when you
> match to not all of the columns of x's key. When mult='all' I think to
> myself "for each row of y fetch me all the rows from x that match and
> eval j for that group, then move on to the next row in y".  Its kind of
> like a data specific 'by'. Once you realise mult='all' is like a 'by'
> remember that 'by' automatically adds in the 'by' columns to the result.
> Hence mult='all' behaves more like a 'by' with respect to returning
> data.table rather than vector.
> 
> Example :
> 
>   X = data.table(x=1:3, y=1:4, z=rnorm(12), key="x,y")
>   Y = data.table(x=1:3) 
>   X[Y,sum(z)] same as X[,sum(z),by=x]
> 
> Then going further :
> 
>   X[Y[<having>],sum(z)] faster than X[,sum(z),by=x][<having>]
> 
> Lets say <having> are groups where x>2 (just one group in this
> example) :
> 
>   X[Y[x>2],sum(z)] same but faster than X[,sum(z),by=x][x>2] 
> 
> which is the same as 
> 
>   X[J(3),sum(z)]
> 
> if we knew we wanted group '3' in advance for example.
> 
> These constructs (e.g. 'by without by') generalise to list() of
> expressions and function calls of column variables in the usual way.
> 
> Sometimes you do want mult='all', and run the j expression on the result
> as a whole, not by row of Y.  In that case, assuming Y has less columns
> than key(X) meaning mult='all' (as it is in this example) :
> 
> 	X[Y,length(z)]	  # j eval'd by row of Y, result 3 rows
> 	X[Y][,length(z)]  # length 1 vector value 12
> 
> HTH?
> Matthew
> 
> 
> On Fri, 2010-07-30 at 19:28 -0700, Harish wrote:
> > I am getting some unexpected behavior with mult="all".
> > 
> > 1) Getting a data table when I expect a vector
> > 2) Not getting NA's when I expect them (because of nomatch=NA)
> > 
> > ==========
> > 
> > Common code for examples below
> > 
> > x <- data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4), key="a,b")
> > y <- data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> > 
> > ==========
> > 
> > Issue #1: Getting a data table when I expect a vector
> > 
> > I am not following the logic of when a data.table is returned and when a vector is returned.  Initially, I thought that if j had only one item without a list(), a vector is returned, but I am seeing some contrary behavior.
> > 
> > x[y,d]  # Returns a vector as expected
> > x[y,d,mult="all"]  # Returns a data.table.  Why?
> > 
> > Would someone help me understand why I should not expect a vector in the last query?
> > 
> > ==========
> > Issue #2: Not getting NA's when I expect them (because of nomatch=NA)
> > 
> > x[y,d,nomatch=NA]  # Expected: returns a vector with NAs in them
> > x[y,d,nomatch=NA,mult="all]  # Unexpected: NAs not appearing
> > 
> > Am I missing something?
> > 
> > Harish
> > 
> > 
> > 
> >       
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 




More information about the datatable-help mailing list