[datatable-help] Unexpected behavior with mult="all"

mdowle at mdowle.plus.com mdowle at mdowle.plus.com
Mon Aug 2 11:29:51 CEST 2010


This change shouldn't impact 'by', if I'm thinking correctly. Think of
'by' as first finding the unique groups present and then splitting up the
dataset by group. So there *is* a match for all groups when using 'by';
every group does have data by construction therefore the 'nomatch'
argument doesn't come into it. In the other thread the discussion was when
j returns no rows when run on a group with data, not that there was no
matching data for the group.

Using i with mult='all' may be a general form of "key'd by" not just an
alternative. [I say 'may' because I'm only 80% sure - you've got me
thinking now.] With i and mult='all' you can :

 i) restrict to a subset of groups (likely a small subset)
 ii) include groups which may not be present (=> nomatch relevant)
 iii) maintain the order of the group results (determined by the order of i)

The special case X[SJ(unique(grpcol)),...] is equivalent to key'd by,  I
think.  Also note that X[SJ(grp),...] will be faster than X[J(grp),...]
which is why 'by' does the former, but you have more control using the
general form i and mult='all'.

Does that all make sense?  Regardless, I'll add this change to the list -
thanks.

Best, Matthew


> Tom, thanks for your vote.  :)
>
> Matthew, since Tom is on board on the change, I am curious whether this
> change will impact the 'by' query behavior as well since both use the same
> code (and are conceptually similar).  So, to get the current behavior, one
> has to use nomatch=0 with the 'by' query to remove the NAs.  Do I
> understand this correctly?
>
> I also thank you for so patiently listening to my questioning and engaging
> me in a conversation.
>
>
> Harish
>
>
> --- On Sun, 8/1/10, Short, Tom <TShort at epri.com> wrote:
>
>> From: Short, Tom <TShort at epri.com>
>> Subject: RE: [datatable-help] Unexpected behavior with mult="all"
>> To: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Date: Sunday, August 1, 2010, 5:00 PM
>> I think I agree with Harish (not
>> enough to look into it myself right
>> now, though :).
>>
>> - Tom
>>
>> 
>>
>> > -----Original Message-----
>> > From: datatable-help-bounces at lists.r-forge.r-project.org
>>
>> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>>
>> > On Behalf Of Matthew Dowle
>> > Sent: Sunday, August 01, 2010 17:06
>> > To: Harish
>> > Cc: datatable-help at lists.r-forge.r-project.org
>> > Subject: Re: [datatable-help] Unexpected behavior with
>> mult="all"
>> >
>> > Thanks for sticking with it. The reason that's
>> happening is
>> > that internally the same code that does grouping via
>> 'by'
>> > also does grouping via mult='all'. So its the same
>> reason as
>> > the other thread where we talking about 'by'
>> collapsing away
>> > NULL groups.
>> >
>> > But when you present it in that way I see what you
>> mean. I'm
>> > almost convinced. If Tom agrees that'll tip me and
>> I'll add
>> > it as a bug to fix. 
>> >
>> > As an aside to illustrate, the j gets evaluated for
>> each
>> > group with the mult='all', but just once without :
>> >
>> > > x[y,{cat('running j\n');list(d)}]
>> > running j
>> > [[1]]
>> > [1]  1  2 NA NA
>> >
>> > > x[y,{cat('running j\n');list(d)},mult='all']
>> > running j
>> > running j
>> >      a b V1
>> > [1,] a A  1
>> > [2,] b A  2
>> > >
>> >
>> >
>> >
>> > On Sat, 2010-07-31 at 19:45 -0700, Harish wrote:
>> > > Thanks for the detailed explanation.  My
>> question about #1
>> > is resolved.  You certainly gave me a lot to
>> ponder over.
>> > >
>> > > I still am doubtful about my question #2 -- not
>> getting NAs.
>> > >
>> > > x <-
>> > >
>> >
>> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3
>> > ,4), key="a,b") y <-
>> > data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
>> > > x[y]           
>>           # Expected: Getting NAs
>> > > x[y,mult="all"]       
>>    # Expected: Getting NAs with mult="all"
>> > > x[y,list(d)]         
>>     # Expected: Getting NAs with i
>> > > x[y,list(d),mult="all"]   #
>> Unexpected: No NAs with i & mult="all"
>> > >
>> > > > x[y]
>> > >         a 
>>   b  d
>> > > [1,]    a    A  1
>> > > [2,]    b    A  2
>> > > [3,] <NA> <NA> NA
>> > > [4,] <NA> <NA> NA
>> > >
>> > > > x[y,mult="all"]
>> > >         a 
>>   b  d
>> > > [1,]    a    A  1
>> > > [2,]    b    A  2
>> > > [3,] <NA> <NA> NA
>> > > [4,] <NA> <NA> NA
>> > >
>> > > > x[y,list(d)]
>> > >       d
>> > > [1,]  1
>> > > [2,]  2
>> > > [3,] NA
>> > > [4,] NA
>> > >
>> > > > x[y,list(d),mult="all"]
>> > >      a b d
>> > > [1,] a A 1
>> > > [2,] b A 2
>> > >
>> > >
>> > > As you can see, the combination of having both i
>> and
>> > mult="all" is not generating the NAs.  Is there a
>> reason for this?
>> > >
>> > >
>> > > Regards,
>> > > Harish
>> > >
>> > >
>> > > --- On Sat, 7/31/10, Matthew Dowle <mdowle at mdowle.plus.com>
>> wrote:
>> > >
>> > > > From: Matthew Dowle <mdowle at mdowle.plus.com>
>> > > > Subject: Re: [datatable-help] Unexpected
>> behavior with mult="all"
>> > > > To: "Harish" <harishv_99 at yahoo.com>
>> > > > Cc: datatable-help at lists.r-forge.r-project.org
>> > > > Date: Saturday, July 31, 2010, 7:25 AM This
>> is how I
>> > think about it
>> > > > currently :
>> > > >
>> > > > [1] The syntax of "x[y,d]" plus knowing how
>> mult's
>> > default value is
>> > > > set ('first' in this case) means that a
>> vector as long as
>> > the number
>> > > > of rows in y is the result so data.table
>> does the least
>> > work it can
>> > > > and returns just the vector without adding
>> in the data
>> > already in y.
>> > > > Changing mult
>> > > > to "all" however means you'll usually get a
>> varying
>> > number of items
>> > > > back for each row in y, so data.table
>> includes the y columns as a
>> > > > convenience since if it didn't the result
>> would be
>> > difficult to use
>> > > > (you wouldn't know the correspondence).
>> data.table tries
>> > to do the
>> > > > minimum, most efficient thing. If you want
>> to be less efficient
>> > > > (e.g.
>> > > > adding columns
>> > > > you already know) then it's for the user to
>> add them back.
>> > > > This is sort
>> > > > of a principle.
>> > > >
>> > > > [2] nomatch is by default NA so this is the
>> same as [1].
>> > Is that any
>> > > > chance a typo and you meant nomatch=0
>> ?  If so then you
>> > might have a
>> > > > point and perhaps something needs changing
>> there.
>> > > >
>> > > > The other way I think about mult='all' is
>> grouping. The
>> > > > documentation sometimes mentions 'by without
>> by', or I might be
>> > > > recalling emails or posts. Remember
>> mult='all' gets automatically
>> > > > set to 'all'
>> > > > when you
>> > > > match to not all of the columns of x's key.
>> When mult='all'
>> > > > I think to
>> > > > myself "for each row of y fetch me all the
>> rows from x that match
>> > > > and eval j for that group, then move on to
>> the next row
>> > in y".  Its
>> > > > kind of like a data specific 'by'. Once you
>> realise mult='all' is
>> > > > like a 'by'
>> > > > remember that 'by' automatically adds in the
>> 'by' columns to the
>> > > > result.
>> > > > Hence mult='all' behaves more like a 'by'
>> with respect to
>> > returning
>> > > > data.table rather than vector.
>> > > >
>> > > > Example :
>> > > >
>> > > >   X = data.table(x=1:3,
>> y=1:4, z=rnorm(12),
>> > > > key="x,y")
>> > > >   Y = data.table(x=1:3)
>> > > >   X[Y,sum(z)] same as
>> X[,sum(z),by=x]
>> > > >
>> > > > Then going further :
>> > > >
>> > > >   X[Y[<having>],sum(z)]
>> faster than
>> > > > X[,sum(z),by=x][<having>]
>> > > >
>> > > > Lets say <having> are groups where
>> x>2 (just one group in this
>> > > > example) :
>> > > >
>> > > >   X[Y[x>2],sum(z)] same
>> but faster than X[,sum(z),by=x][x>2]
>> > > >
>> > > > which is the same as
>> > > >
>> > > >   X[J(3),sum(z)]
>> > > >
>> > > > if we knew we wanted group '3' in advance
>> for example.
>> > > >
>> > > > These constructs (e.g. 'by without by')
>> generalise to
>> > > > list() of
>> > > > expressions and function calls of column
>> variables in the
>> > usual way.
>> > > >
>> > > > Sometimes you do want mult='all', and run
>> the j expression on the
>> > > > result as a whole, not by row of Y.  In
>> that case, assuming Y has
>> > > > less columns than key(X) meaning mult='all'
>> (as it is in this
>> > > > example)
>> > > > :
>> > > >
>> > > >     X[Y,length(z)] 
>>    
>> > > > # j eval'd by row of Y, result 3 rows
>> > > > 
>>    X[Y][,length(z)]  # length 1 vector
>> value 12
>> > > >
>> > > > HTH?
>> > > > Matthew
>> > > >
>> > > >
>> > > > On Fri, 2010-07-30 at 19:28 -0700, Harish
>> wrote:
>> > > > > I am getting some unexpected behavior
>> with
>> > > > mult="all".
>> > > > >
>> > > > > 1) Getting a data table when I expect a
>> vector
>> > > > > 2) Not getting NA's when I expect them
>> (because of
>> > > > nomatch=NA)
>> > > > >
>> > > > > ==========
>> > > > >
>> > > > > Common code for examples below
>> > > > >
>> > > > > x <-
>> > > >
>> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4),
>> > > > key="a,b")
>> > > > > y <-
>> > > >
>> data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
>> > > > >
>> > > > > ==========
>> > > > >
>> > > > > Issue #1: Getting a data table when I
>> expect a vector
>> > > > >
>> > > > > I am not following the logic of when a
>> data.table is
>> > > > returned and when a vector is
>> returned.  Initially, I
>> > thought that
>> > > > if j had only one item without a list(), a
>> vector is
>> > returned, but I
>> > > > am seeing some contrary behavior.
>> > > > >
>> > > > > x[y,d]  # Returns a vector as
>> expected x[y,d,mult="all"]  #
>> > > > > Returns a data.table.
>> > > > Why?
>> > > > >
>> > > > > Would someone help me understand why I
>> should not
>> > > > expect a vector in the last query?
>> > > > >
>> > > > > ==========
>> > > > > Issue #2: Not getting NA's when I
>> expect them (because
>> > > > of nomatch=NA)
>> > > > >
>> > > > > x[y,d,nomatch=NA]  # Expected:
>> returns a vector
>> > > > with NAs in them
>> > > > > x[y,d,nomatch=NA,mult="all]  #
>> Unexpected: NAs
>> > > > not appearing
>> > > > >
>> > > > > Am I missing something?
>> > > > >
>> > > > > Harish
>> > > > >
>> > > > >
>> > > > >
>> > > > >       
>> > > > >
>> _______________________________________________
>> > > > > datatable-help mailing list
>> > > > > datatable-help at lists.r-forge.r-project.org
>> > > > >
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datat
>> > > > > able-help
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > >       
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> atatable-help
>> >
>>
>
>
>
>
>




More information about the datatable-help mailing list