[datatable-help] Unexpected behavior with mult="all"

Matthew Dowle mdowle at mdowle.plus.com
Sun Aug 22 02:12:34 CEST 2010


Harish,

Just fixed this one, bug #1015 : mult='all' should return NA when
j=list(d)  i.e. 'by without by' now heeds nomatch=NA.

> x[y,list(d),mult="all"]
     a b  d
[1,] a A  1
[2,] b A  2
[3,] c A NA
[4,] d A NA
> 

The old behaviour is returned via nomatch=0.

> x[y,list(d),mult="all",nomatch=0]
     a b d
[1,] a A 1
[2,] b A 2
> 

We can observe the j running 4 times now :

> x[y,{cat('running j\n');list(d)},mult='all'] 
running j
running j
running j
running j
     a b V1
[1,] a A  1
[2,] b A  2
[3,] c A NA
[4,] d A NA
>

Have added 21 new tests related to this change.

Thanks once again for raising it, it really is much appreciated.

Matthew



On Mon, 2010-08-02 at 10:29 +0100, mdowle at mdowle.plus.com wrote:
> This change shouldn't impact 'by', if I'm thinking correctly. Think of
> 'by' as first finding the unique groups present and then splitting up the
> dataset by group. So there *is* a match for all groups when using 'by';
> every group does have data by construction therefore the 'nomatch'
> argument doesn't come into it. In the other thread the discussion was when
> j returns no rows when run on a group with data, not that there was no
> matching data for the group.
> 
> Using i with mult='all' may be a general form of "key'd by" not just an
> alternative. [I say 'may' because I'm only 80% sure - you've got me
> thinking now.] With i and mult='all' you can :
> 
>  i) restrict to a subset of groups (likely a small subset)
>  ii) include groups which may not be present (=> nomatch relevant)
>  iii) maintain the order of the group results (determined by the order of i)
> 
> The special case X[SJ(unique(grpcol)),...] is equivalent to key'd by,  I
> think.  Also note that X[SJ(grp),...] will be faster than X[J(grp),...]
> which is why 'by' does the former, but you have more control using the
> general form i and mult='all'.
> 
> Does that all make sense?  Regardless, I'll add this change to the list -
> thanks.
> 
> Best, Matthew
> 
> 
> > Tom, thanks for your vote.  :)
> >
> > Matthew, since Tom is on board on the change, I am curious whether this
> > change will impact the 'by' query behavior as well since both use the same
> > code (and are conceptually similar).  So, to get the current behavior, one
> > has to use nomatch=0 with the 'by' query to remove the NAs.  Do I
> > understand this correctly?
> >
> > I also thank you for so patiently listening to my questioning and engaging
> > me in a conversation.
> >
> >
> > Harish
> >
> >
> > --- On Sun, 8/1/10, Short, Tom <TShort at epri.com> wrote:
> >
> >> From: Short, Tom <TShort at epri.com>
> >> Subject: RE: [datatable-help] Unexpected behavior with mult="all"
> >> To: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>
> >> Cc: datatable-help at lists.r-forge.r-project.org
> >> Date: Sunday, August 1, 2010, 5:00 PM
> >> I think I agree with Harish (not
> >> enough to look into it myself right
> >> now, though :).
> >>
> >> - Tom
> >>
> >> 
> >>
> >> > -----Original Message-----
> >> > From: datatable-help-bounces at lists.r-forge.r-project.org
> >>
> >> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >>
> >> > On Behalf Of Matthew Dowle
> >> > Sent: Sunday, August 01, 2010 17:06
> >> > To: Harish
> >> > Cc: datatable-help at lists.r-forge.r-project.org
> >> > Subject: Re: [datatable-help] Unexpected behavior with
> >> mult="all"
> >> >
> >> > Thanks for sticking with it. The reason that's
> >> happening is
> >> > that internally the same code that does grouping via
> >> 'by'
> >> > also does grouping via mult='all'. So its the same
> >> reason as
> >> > the other thread where we talking about 'by'
> >> collapsing away
> >> > NULL groups.
> >> >
> >> > But when you present it in that way I see what you
> >> mean. I'm
> >> > almost convinced. If Tom agrees that'll tip me and
> >> I'll add
> >> > it as a bug to fix. 
> >> >
> >> > As an aside to illustrate, the j gets evaluated for
> >> each
> >> > group with the mult='all', but just once without :
> >> >
> >> > > x[y,{cat('running j\n');list(d)}]
> >> > running j
> >> > [[1]]
> >> > [1]  1  2 NA NA
> >> >
> >> > > x[y,{cat('running j\n');list(d)},mult='all']
> >> > running j
> >> > running j
> >> >      a b V1
> >> > [1,] a A  1
> >> > [2,] b A  2
> >> > >
> >> >
> >> >
> >> >
> >> > On Sat, 2010-07-31 at 19:45 -0700, Harish wrote:
> >> > > Thanks for the detailed explanation.  My
> >> question about #1
> >> > is resolved.  You certainly gave me a lot to
> >> ponder over.
> >> > >
> >> > > I still am doubtful about my question #2 -- not
> >> getting NAs.
> >> > >
> >> > > x <-
> >> > >
> >> >
> >> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3
> >> > ,4), key="a,b") y <-
> >> > data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> >> > > x[y]           
> >>           # Expected: Getting NAs
> >> > > x[y,mult="all"]       
> >>    # Expected: Getting NAs with mult="all"
> >> > > x[y,list(d)]         
> >>     # Expected: Getting NAs with i
> >> > > x[y,list(d),mult="all"]   #
> >> Unexpected: No NAs with i & mult="all"
> >> > >
> >> > > > x[y]
> >> > >         a 
> >>   b  d
> >> > > [1,]    a    A  1
> >> > > [2,]    b    A  2
> >> > > [3,] <NA> <NA> NA
> >> > > [4,] <NA> <NA> NA
> >> > >
> >> > > > x[y,mult="all"]
> >> > >         a 
> >>   b  d
> >> > > [1,]    a    A  1
> >> > > [2,]    b    A  2
> >> > > [3,] <NA> <NA> NA
> >> > > [4,] <NA> <NA> NA
> >> > >
> >> > > > x[y,list(d)]
> >> > >       d
> >> > > [1,]  1
> >> > > [2,]  2
> >> > > [3,] NA
> >> > > [4,] NA
> >> > >
> >> > > > x[y,list(d),mult="all"]
> >> > >      a b d
> >> > > [1,] a A 1
> >> > > [2,] b A 2
> >> > >
> >> > >
> >> > > As you can see, the combination of having both i
> >> and
> >> > mult="all" is not generating the NAs.  Is there a
> >> reason for this?
> >> > >
> >> > >
> >> > > Regards,
> >> > > Harish
> >> > >
> >> > >
> >> > > --- On Sat, 7/31/10, Matthew Dowle <mdowle at mdowle.plus.com>
> >> wrote:
> >> > >
> >> > > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> >> > > > Subject: Re: [datatable-help] Unexpected
> >> behavior with mult="all"
> >> > > > To: "Harish" <harishv_99 at yahoo.com>
> >> > > > Cc: datatable-help at lists.r-forge.r-project.org
> >> > > > Date: Saturday, July 31, 2010, 7:25 AM This
> >> is how I
> >> > think about it
> >> > > > currently :
> >> > > >
> >> > > > [1] The syntax of "x[y,d]" plus knowing how
> >> mult's
> >> > default value is
> >> > > > set ('first' in this case) means that a
> >> vector as long as
> >> > the number
> >> > > > of rows in y is the result so data.table
> >> does the least
> >> > work it can
> >> > > > and returns just the vector without adding
> >> in the data
> >> > already in y.
> >> > > > Changing mult
> >> > > > to "all" however means you'll usually get a
> >> varying
> >> > number of items
> >> > > > back for each row in y, so data.table
> >> includes the y columns as a
> >> > > > convenience since if it didn't the result
> >> would be
> >> > difficult to use
> >> > > > (you wouldn't know the correspondence).
> >> data.table tries
> >> > to do the
> >> > > > minimum, most efficient thing. If you want
> >> to be less efficient
> >> > > > (e.g.
> >> > > > adding columns
> >> > > > you already know) then it's for the user to
> >> add them back.
> >> > > > This is sort
> >> > > > of a principle.
> >> > > >
> >> > > > [2] nomatch is by default NA so this is the
> >> same as [1].
> >> > Is that any
> >> > > > chance a typo and you meant nomatch=0
> >> ?  If so then you
> >> > might have a
> >> > > > point and perhaps something needs changing
> >> there.
> >> > > >
> >> > > > The other way I think about mult='all' is
> >> grouping. The
> >> > > > documentation sometimes mentions 'by without
> >> by', or I might be
> >> > > > recalling emails or posts. Remember
> >> mult='all' gets automatically
> >> > > > set to 'all'
> >> > > > when you
> >> > > > match to not all of the columns of x's key.
> >> When mult='all'
> >> > > > I think to
> >> > > > myself "for each row of y fetch me all the
> >> rows from x that match
> >> > > > and eval j for that group, then move on to
> >> the next row
> >> > in y".  Its
> >> > > > kind of like a data specific 'by'. Once you
> >> realise mult='all' is
> >> > > > like a 'by'
> >> > > > remember that 'by' automatically adds in the
> >> 'by' columns to the
> >> > > > result.
> >> > > > Hence mult='all' behaves more like a 'by'
> >> with respect to
> >> > returning
> >> > > > data.table rather than vector.
> >> > > >
> >> > > > Example :
> >> > > >
> >> > > >   X = data.table(x=1:3,
> >> y=1:4, z=rnorm(12),
> >> > > > key="x,y")
> >> > > >   Y = data.table(x=1:3)
> >> > > >   X[Y,sum(z)] same as
> >> X[,sum(z),by=x]
> >> > > >
> >> > > > Then going further :
> >> > > >
> >> > > >   X[Y[<having>],sum(z)]
> >> faster than
> >> > > > X[,sum(z),by=x][<having>]
> >> > > >
> >> > > > Lets say <having> are groups where
> >> x>2 (just one group in this
> >> > > > example) :
> >> > > >
> >> > > >   X[Y[x>2],sum(z)] same
> >> but faster than X[,sum(z),by=x][x>2]
> >> > > >
> >> > > > which is the same as
> >> > > >
> >> > > >   X[J(3),sum(z)]
> >> > > >
> >> > > > if we knew we wanted group '3' in advance
> >> for example.
> >> > > >
> >> > > > These constructs (e.g. 'by without by')
> >> generalise to
> >> > > > list() of
> >> > > > expressions and function calls of column
> >> variables in the
> >> > usual way.
> >> > > >
> >> > > > Sometimes you do want mult='all', and run
> >> the j expression on the
> >> > > > result as a whole, not by row of Y.  In
> >> that case, assuming Y has
> >> > > > less columns than key(X) meaning mult='all'
> >> (as it is in this
> >> > > > example)
> >> > > > :
> >> > > >
> >> > > >     X[Y,length(z)] 
> >>    
> >> > > > # j eval'd by row of Y, result 3 rows
> >> > > > 
> >>    X[Y][,length(z)]  # length 1 vector
> >> value 12
> >> > > >
> >> > > > HTH?
> >> > > > Matthew
> >> > > >
> >> > > >
> >> > > > On Fri, 2010-07-30 at 19:28 -0700, Harish
> >> wrote:
> >> > > > > I am getting some unexpected behavior
> >> with
> >> > > > mult="all".
> >> > > > >
> >> > > > > 1) Getting a data table when I expect a
> >> vector
> >> > > > > 2) Not getting NA's when I expect them
> >> (because of
> >> > > > nomatch=NA)
> >> > > > >
> >> > > > > ==========
> >> > > > >
> >> > > > > Common code for examples below
> >> > > > >
> >> > > > > x <-
> >> > > >
> >> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4),
> >> > > > key="a,b")
> >> > > > > y <-
> >> > > >
> >> data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
> >> > > > >
> >> > > > > ==========
> >> > > > >
> >> > > > > Issue #1: Getting a data table when I
> >> expect a vector
> >> > > > >
> >> > > > > I am not following the logic of when a
> >> data.table is
> >> > > > returned and when a vector is
> >> returned.  Initially, I
> >> > thought that
> >> > > > if j had only one item without a list(), a
> >> vector is
> >> > returned, but I
> >> > > > am seeing some contrary behavior.
> >> > > > >
> >> > > > > x[y,d]  # Returns a vector as
> >> expected x[y,d,mult="all"]  #
> >> > > > > Returns a data.table.
> >> > > > Why?
> >> > > > >
> >> > > > > Would someone help me understand why I
> >> should not
> >> > > > expect a vector in the last query?
> >> > > > >
> >> > > > > ==========
> >> > > > > Issue #2: Not getting NA's when I
> >> expect them (because
> >> > > > of nomatch=NA)
> >> > > > >
> >> > > > > x[y,d,nomatch=NA]  # Expected:
> >> returns a vector
> >> > > > with NAs in them
> >> > > > > x[y,d,nomatch=NA,mult="all]  #
> >> Unexpected: NAs
> >> > > > not appearing
> >> > > > >
> >> > > > > Am I missing something?
> >> > > > >
> >> > > > > Harish
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >       
> >> > > > >
> >> _______________________________________________
> >> > > > > datatable-help mailing list
> >> > > > > datatable-help at lists.r-forge.r-project.org
> >> > > > >
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datat
> >> > > > > able-help
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > >       
> >> >
> >> >
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> >> atatable-help
> >> >
> >>
> >
> >
> >
> >
> >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list