[datatable-help] Unexpected behavior with mult="all"

Harish harishv_99 at yahoo.com
Sun Aug 1 04:45:03 CEST 2010


Thanks for the detailed explanation.  My question about #1 is resolved.  You certainly gave me a lot to ponder over.

I still am doubtful about my question #2 -- not getting NAs.

x <- data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4), key="a,b")
y <- data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
x[y]                      # Expected: Getting NAs
x[y,mult="all"]           # Expected: Getting NAs with mult="all"
x[y,list(d)]              # Expected: Getting NAs with i
x[y,list(d),mult="all"]   # Unexpected: No NAs with i & mult="all"

> x[y]
        a    b  d
[1,]    a    A  1
[2,]    b    A  2
[3,] <NA> <NA> NA
[4,] <NA> <NA> NA

> x[y,mult="all"]
        a    b  d
[1,]    a    A  1
[2,]    b    A  2
[3,] <NA> <NA> NA
[4,] <NA> <NA> NA

> x[y,list(d)]
      d
[1,]  1
[2,]  2
[3,] NA
[4,] NA

> x[y,list(d),mult="all"]
     a b d
[1,] a A 1
[2,] b A 2


As you can see, the combination of having both i and mult="all" is not generating the NAs.  Is there a reason for this?


Regards,
Harish


--- On Sat, 7/31/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> From: Matthew Dowle <mdowle at mdowle.plus.com>
> Subject: Re: [datatable-help] Unexpected behavior with mult="all"
> To: "Harish" <harishv_99 at yahoo.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> Date: Saturday, July 31, 2010, 7:25 AM
> This is how I think about it
> currently :
> 
> [1] The syntax of "x[y,d]" plus knowing how mult's default
> value is set
> ('first' in this case) means that a vector as long as the
> number of rows
> in y is the result so data.table does the least work it can
> and returns
> just the vector without adding in the data already in y.
> Changing mult
> to "all" however means you'll usually get a varying number
> of items back
> for each row in y, so data.table includes the y columns as
> a convenience
> since if it didn't the result would be difficult to use
> (you wouldn't
> know the correspondence). data.table tries to do the
> minimum, most
> efficient thing. If you want to be less efficient (e.g.
> adding columns
> you already know) then it's for the user to add them back.
> This is sort
> of a principle.
> 
> [2] nomatch is by default NA so this is the same as [1]. Is
> that any
> chance a typo and you meant nomatch=0 ?  If so then
> you might have a
> point and perhaps something needs changing there.
> 
> The other way I think about mult='all' is grouping. The
> documentation
> sometimes mentions 'by without by', or I might be recalling
> emails or
> posts. Remember mult='all' gets automatically set to 'all'
> when you
> match to not all of the columns of x's key. When mult='all'
> I think to
> myself "for each row of y fetch me all the rows from x that
> match and
> eval j for that group, then move on to the next row in
> y".  Its kind of
> like a data specific 'by'. Once you realise mult='all' is
> like a 'by'
> remember that 'by' automatically adds in the 'by' columns
> to the result.
> Hence mult='all' behaves more like a 'by' with respect to
> returning
> data.table rather than vector.
> 
> Example :
> 
>   X = data.table(x=1:3, y=1:4, z=rnorm(12),
> key="x,y")
>   Y = data.table(x=1:3) 
>   X[Y,sum(z)] same as X[,sum(z),by=x]
> 
> Then going further :
> 
>   X[Y[<having>],sum(z)] faster than
> X[,sum(z),by=x][<having>]
> 
> Lets say <having> are groups where x>2 (just one
> group in this
> example) :
> 
>   X[Y[x>2],sum(z)] same but faster than
> X[,sum(z),by=x][x>2] 
> 
> which is the same as 
> 
>   X[J(3),sum(z)]
> 
> if we knew we wanted group '3' in advance for example.
> 
> These constructs (e.g. 'by without by') generalise to
> list() of
> expressions and function calls of column variables in the
> usual way.
> 
> Sometimes you do want mult='all', and run the j expression
> on the result
> as a whole, not by row of Y.  In that case, assuming Y
> has less columns
> than key(X) meaning mult='all' (as it is in this example)
> :
> 
>     X[Y,length(z)]     
> # j eval'd by row of Y, result 3 rows
>     X[Y][,length(z)]  # length 1 vector
> value 12
> 
> HTH?
> Matthew
> 
> 
> On Fri, 2010-07-30 at 19:28 -0700, Harish wrote:
> > I am getting some unexpected behavior with
> mult="all".
> > 
> > 1) Getting a data table when I expect a vector
> > 2) Not getting NA's when I expect them (because of
> nomatch=NA)
> > 
> > ==========
> > 
> > Common code for examples below
> > 
> > x <-
> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4),
> key="a,b")
> > y <-
> data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> > 
> > ==========
> > 
> > Issue #1: Getting a data table when I expect a vector
> > 
> > I am not following the logic of when a data.table is
> returned and when a vector is returned.  Initially, I
> thought that if j had only one item without a list(), a
> vector is returned, but I am seeing some contrary behavior.
> > 
> > x[y,d]  # Returns a vector as expected
> > x[y,d,mult="all"]  # Returns a data.table. 
> Why?
> > 
> > Would someone help me understand why I should not
> expect a vector in the last query?
> > 
> > ==========
> > Issue #2: Not getting NA's when I expect them (because
> of nomatch=NA)
> > 
> > x[y,d,nomatch=NA]  # Expected: returns a vector
> with NAs in them
> > x[y,d,nomatch=NA,mult="all]  # Unexpected: NAs
> not appearing
> > 
> > Am I missing something?
> > 
> > Harish
> > 
> > 
> > 
> >       
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
>



      


More information about the datatable-help mailing list