This change shouldn't impact 'by', if I'm thinking correctly. Think of
'by' as first finding the unique groups present and then splitting up the
dataset by group. So there *is* a match for all groups when using 'by';
every group does have data by construction therefore the 'nomatch'
argument doesn't come into it. In the other thread the discussion was when
j returns no rows when run on a group with data, not that there was no
matching data for the group.
Using i with mult='all' may be a general form of "key'd by" not just an
alternative. [I say 'may' because I'm only 80% sure - you've got me
thinking now.] With i and mult='all' you can :
 i) restrict to a subset of groups (likely a small subset)
 ii) include groups which may not be present (=> nomatch relevant)
 iii) maintain the order of the group results (determined by the order of i)
The special case X[SJ(unique(grpcol)),...] is equivalent to key'd by,  I
think.  Also note that X[SJ(grp),...] will be faster than X[J(grp),...]
which is why 'by' does the former, but you have more control using the
general form i and mult='all'.
Does that all make sense?  Regardless, I'll add this change to the list -
thanks.
Best, Matthew
> Tom, thanks for your vote.  :)
>
> Matthew, since Tom is on board on the change, I am curious whether this
> change will impact the 'by' query behavior as well since both use the same
> code (and are conceptually similar).  So, to get the current behavior, one
> has to use nomatch=0 with the 'by' query to remove the NAs.  Do I
> understand this correctly?
>
> I also thank you for so patiently listening to my questioning and engaging
> me in a conversation.
>
>
> Harish
>
>
> --- On Sun, 8/1/10, Short, Tom <TShort at epri.com> wrote:
>
>> From: Short, Tom <TShort at epri.com>
>> Subject: RE: [datatable-help] Unexpected behavior with mult="all"
>> To: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Date: Sunday, August 1, 2010, 5:00 PM
>> I think I agree with Harish (not
>> enough to look into it myself right
>> now, though :).
>>
>> - Tom
>>
>> 
>>
>> > -----Original Message-----
>> > From: datatable-help-bounces at lists.r-forge.r-project.org
>>
>> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>>
>> > On Behalf Of Matthew Dowle
>> > Sent: Sunday, August 01, 2010 17:06
>> > To: Harish
>> > Cc: datatable-help at lists.r-forge.r-project.org
>> > Subject: Re: [datatable-help] Unexpected behavior with
>> mult="all"
>> >
>> > Thanks for sticking with it. The reason that's
>> happening is
>> > that internally the same code that does grouping via
>> 'by'
>> > also does grouping via mult='all'. So its the same
>> reason as
>> > the other thread where we talking about 'by'
>> collapsing away
>> > NULL groups.
>> >
>> > But when you present it in that way I see what you
>> mean. I'm
>> > almost convinced. If Tom agrees that'll tip me and
>> I'll add
>> > it as a bug to fix. 
>> >
>> > As an aside to illustrate, the j gets evaluated for
>> each
>> > group with the mult='all', but just once without :
>> >
>> > > x[y,{cat('running j\n');list(d)}]
>> > running j
>> > [[1]]
>> > [1]  1  2 NA NA
>> >
>> > > x[y,{cat('running j\n');list(d)},mult='all']
>> > running j
>> > running j
>> >      a b V1
>> > [1,] a A  1
>> > [2,] b A  2
>> > >
>> >
>> >
>> >
>> > On Sat, 2010-07-31 at 19:45 -0700, Harish wrote:
>> > > Thanks for the detailed explanation.  My
>> question about #1
>> > is resolved.  You certainly gave me a lot to
>> ponder over.
>> > >
>> > > I still am doubtful about my question #2 -- not
>> getting NAs.
>> > >
>> > > x <-
>> > >
>> >
>> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3
>> > ,4), key="a,b") y <-
>> > data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
>> > > x[y]           
>>           # Expected: Getting NAs
>> > > x[y,mult="all"]       
>>    # Expected: Getting NAs with mult="all"
>> > > x[y,list(d)]         
>>     # Expected: Getting NAs with i
>> > > x[y,list(d),mult="all"]   #
>> Unexpected: No NAs with i & mult="all"
>> > >
>> > > > x[y]
>> > >         a 
>>   b  d
>> > > [1,]    a    A  1
>> > > [2,]    b    A  2
>> > > [3,] <NA> <NA> NA
>> > > [4,] <NA> <NA> NA
>> > >
>> > > > x[y,mult="all"]
>> > >         a 
>>   b  d
>> > > [1,]    a    A  1
>> > > [2,]    b    A  2
>> > > [3,] <NA> <NA> NA
>> > > [4,] <NA> <NA> NA
>> > >
>> > > > x[y,list(d)]
>> > >       d
>> > > [1,]  1
>> > > [2,]  2
>> > > [3,] NA
>> > > [4,] NA
>> > >
>> > > > x[y,list(d),mult="all"]
>> > >      a b d
>> > > [1,] a A 1
>> > > [2,] b A 2
>> > >
>> > >
>> > > As you can see, the combination of having both i
>> and
>> > mult="all" is not generating the NAs.  Is there a
>> reason for this?
>> > >
>> > >
>> > > Regards,
>> > > Harish
>> > >
>> > >
>> > > --- On Sat, 7/31/10, Matthew Dowle <mdowle at mdowle.plus.com>
>> wrote:
>> > >
>> > > > From: Matthew Dowle <mdowle at mdowle.plus.com>
>> > > > Subject: Re: [datatable-help] Unexpected
>> behavior with mult="all"
>> > > > To: "Harish" <harishv_99 at yahoo.com>
>> > > > Cc: datatable-help at lists.r-forge.r-project.org
>> > > > Date: Saturday, July 31, 2010, 7:25 AM This
>> is how I
>> > think about it
>> > > > currently :
>> > > >
>> > > > [1] The syntax of "x[y,d]" plus knowing how
>> mult's
>> > default value is
>> > > > set ('first' in this case) means that a
>> vector as long as
>> > the number
>> > > > of rows in y is the result so data.table
>> does the least
>> > work it can
>> > > > and returns just the vector without adding
>> in the data
>> > already in y.
>> > > > Changing mult
>> > > > to "all" however means you'll usually get a
>> varying
>> > number of items
>> > > > back for each row in y, so data.table
>> includes the y columns as a
>> > > > convenience since if it didn't the result
>> would be
>> > difficult to use
>> > > > (you wouldn't know the correspondence).
>> data.table tries
>> > to do the
>> > > > minimum, most efficient thing. If you want
>> to be less efficient
>> > > > (e.g.
>> > > > adding columns
>> > > > you already know) then it's for the user to
>> add them back.
>> > > > This is sort
>> > > > of a principle.
>> > > >
>> > > > [2] nomatch is by default NA so this is the
>> same as [1].
>> > Is that any
>> > > > chance a typo and you meant nomatch=0
>> ?  If so then you
>> > might have a
>> > > > point and perhaps something needs changing
>> there.
>> > > >
>> > > > The other way I think about mult='all' is
>> grouping. The
>> > > > documentation sometimes mentions 'by without
>> by', or I might be
>> > > > recalling emails or posts. Remember
>> mult='all' gets automatically
>> > > > set to 'all'
>> > > > when you
>> > > > match to not all of the columns of x's key.
>> When mult='all'
>> > > > I think to
>> > > > myself "for each row of y fetch me all the
>> rows from x that match
>> > > > and eval j for that group, then move on to
>> the next row
>> > in y".  Its
>> > > > kind of like a data specific 'by'. Once you
>> realise mult='all' is
>> > > > like a 'by'
>> > > > remember that 'by' automatically adds in the
>> 'by' columns to the
>> > > > result.
>> > > > Hence mult='all' behaves more like a 'by'
>> with respect to
>> > returning
>> > > > data.table rather than vector.
>> > > >
>> > > > Example :
>> > > >
>> > > >   X = data.table(x=1:3,
>> y=1:4, z=rnorm(12),
>> > > > key="x,y")
>> > > >   Y = data.table(x=1:3)
>> > > >   X[Y,sum(z)] same as
>> X[,sum(z),by=x]
>> > > >
>> > > > Then going further :
>> > > >
>> > > >   X[Y[<having>],sum(z)]
>> faster than
>> > > > X[,sum(z),by=x][<having>]
>> > > >
>> > > > Lets say <having> are groups where
>> x>2 (just one group in this
>> > > > example) :
>> > > >
>> > > >   X[Y[x>2],sum(z)] same
>> but faster than X[,sum(z),by=x][x>2]
>> > > >
>> > > > which is the same as
>> > > >
>> > > >   X[J(3),sum(z)]
>> > > >
>> > > > if we knew we wanted group '3' in advance
>> for example.
>> > > >
>> > > > These constructs (e.g. 'by without by')
>> generalise to
>> > > > list() of
>> > > > expressions and function calls of column
>> variables in the
>> > usual way.
>> > > >
>> > > > Sometimes you do want mult='all', and run
>> the j expression on the
>> > > > result as a whole, not by row of Y.  In
>> that case, assuming Y has
>> > > > less columns than key(X) meaning mult='all'
>> (as it is in this
>> > > > example)
>> > > > :
>> > > >
>> > > >     X[Y,length(z)] 
>>    
>> > > > # j eval'd by row of Y, result 3 rows
>> > > > 
>>    X[Y][,length(z)]  # length 1 vector
>> value 12
>> > > >
>> > > > HTH?
>> > > > Matthew
>> > > >
>> > > >
>> > > > On Fri, 2010-07-30 at 19:28 -0700, Harish
>> wrote:
>> > > > > I am getting some unexpected behavior
>> with
>> > > > mult="all".
>> > > > >
>> > > > > 1) Getting a data table when I expect a
>> vector
>> > > > > 2) Not getting NA's when I expect them
>> (because of
>> > > > nomatch=NA)
>> > > > >
>> > > > > ==========
>> > > > >
>> > > > > Common code for examples below
>> > > > >
>> > > > > x <-
>> > > >
>> data.table(a=c("a","b","d","e"),b=c("A","A","B","B"),d=c(1,2,3,4),
>> > > > key="a,b")
>> > > > > y <-
>> > > >
>> data.table(g=c("a","b","c","d"),h=c("A","A","A","A"))
>> > > > >
>> > > > > ==========
>> > > > >
>> > > > > Issue #1: Getting a data table when I
>> expect a vector
>> > > > >
>> > > > > I am not following the logic of when a
>> data.table is
>> > > > returned and when a vector is
>> returned.  Initially, I
>> > thought that
>> > > > if j had only one item without a list(), a
>> vector is
>> > returned, but I
>> > > > am seeing some contrary behavior.
>> > > > >
>> > > > > x[y,d]  # Returns a vector as
>> expected x[y,d,mult="all"]  #
>> > > > > Returns a data.table.
>> > > > Why?
>> > > > >
>> > > > > Would someone help me understand why I
>> should not
>> > > > expect a vector in the last query?
>> > > > >
>> > > > > ==========
>> > > > > Issue #2: Not getting NA's when I
>> expect them (because
>> > > > of nomatch=NA)
>> > > > >
>> > > > > x[y,d,nomatch=NA]  # Expected:
>> returns a vector
>> > > > with NAs in them
>> > > > > x[y,d,nomatch=NA,mult="all]  #
>> Unexpected: NAs
>> > > > not appearing
>> > > > >
>> > > > > Am I missing something?
>> > > > >
>> > > > > Harish
>> > > > >
>> > > > >
>> > > > >
>> > > > >       
>> > > > >
>> _______________________________________________
>> > > > > datatable-help mailing list
>> > > > > datatable-help at lists.r-forge.r-project.org
>> > > > >
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datat
>> > > > > able-help
>> > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > >       
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> atatable-help
>> >
>>
>
>
>
>
>