[datatable-help] Can crash R with a data.table query

Short, Tom TShort at epri.com
Tue Jul 13 14:07:14 CEST 2010


Matthew/Harish, I like the existing functionality (no NA's). I'd also
prefer Matthew's nonull function idea to the nonullj option. The number
of options to [.data.table is high, and I don't think this warrants
another. 

- Tom
 

> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Matthew Dowle
> Sent: Monday, July 12, 2010 20:11
> To: Harish
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Can crash R with a data.table query
> 
> Harish,
> 
> You're right about the example, thanks. It was a typo. '<' 
> should have been '>' :
> 
>    DT[, date[date-min(date)>7], by=var1]
> 
> That may not return data for all groups.
> 
> Basically, the way I think about 'by', and the way I explain 
> it sometimes is that it does this general form :
> 
> dfsplit = split(iris,iris$Species)
> do.call("rbind",lapply(dfsplit,function(SUBSET)with(SUBSET,dat
a.frame(Species[1],mean(Sepal.Length)))))
> 
> Thats not a good example because every item of the result of 
> lapply will have data.  Imagine an example where not all 
> groups return data though.
> Then the do.call("rbind",...) will collapse it all together 
> and the no-data groups will be gone.  I'm not saying at all 
> that data.table 'by'
> should be the same, but just that it is the same at the 
> moment and how I think about it.
> 
> How do by(), doBy() and plyr treat this ?
> 
> The issue basically is that the j in your example :
> 
>    DT[ , .SD[ A==25, C ] + .SD[ A==85, C ], by=B ]
> 
> can return NULL, or more specifically a data.table with no rows.
> That may be a deliberate thing the programmer wants to do, 
> and a nice feature.
> 
> I'm not really comfortable with the sub-queries inside the j. 
> We would usually join in the i, or match in the j.
> 
> Another way might be to create a function nonull which 
> replaces 0 row data.tables with a single row of NAs. That 
> could be added to data.table.
> 
> DT[ , nonull(.SD[A==25,C] + .SD[A==85,C]), by=B ]
> 
> but I realise the discussion is about defaults so this post 
> was to add to the discussion not at all to close it. My 
> preferred option currently is to add an option 'nonullj' 
> which does that, default FALSE. You would prefer TRUE by 
> default I think.
> 
> I suspect Tom maybe able to see the wood through these trees 
> hopefully.
> 
> Matthew
> 
> 
> 
> 
> On Sun, 2010-07-11 at 02:04 -0700, Harish wrote:
> > I was thinking more about this and I am unsure how the 
> use-case you mentioned will break if NAs are returned when 
> length(j)==0.
> >    DT[, date[date-min(date)<7], by=var1]
> > 
> > In the above code, you will at least have one date selected 
> for each var1 since min(date) - min(date) is always < 7.  So 
> when does length(j) equal 0 for the proposed code returning 
> NA to even get triggered?
> > 
> > My point was that we should have all the "by" variable 
> values represented in the output.  So in your example above, 
> if var1 was c("A","B","C"), then the result must have at 
> least 3 rows with at least 1 row for each var1.  If no values 
> are selected for whatever reason for var1=="B", then NA is returned.
> > 
> > To take this a little further:
> >    DT[ blah1, blah2, by=list(var1,var2)] Now suppose var1 and var2 
> > were:
> >    var1   var2
> >    A      x
> >    A      x
> >    A      y
> >    B      x
> > 
> > In the above case, the output will have three rows at least:
> >    var1   var2
> >    A      x
> >    A      y
> >    B      x
> > and it need not have (B,y) since that does not even exist 
> in the data.
> > 
> > If I did not select any values for (B,x) because of a row 
> filter, I am proposing that I get an NA for all values that 
> cannot be computed in that row.
> > 
> > I suppose I am not understanding how this is implemented 
> because I see the example that you mentioned to be very 
> different from what I am talking about.
> > 
> > Thanks for being so patient.
> > 
> > 
> > Regards,
> > Harish
> > 
> > 
> > --- On Fri, 7/9/10, Harish <harishv_99 at yahoo.com> wrote:
> > 
> > > From: Harish <harishv_99 at yahoo.com>
> > > Subject: Re: [datatable-help] Can crash R with a data.table query
> > > To: mdowle at mdowle.plus.com
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Date: Friday, July 9, 2010, 10:01 PM Thanks for the fix.  
> I did use 
> > > a workaround to perform the same computation; thanks.
> > > 
> > > I think that if data.tables returned NA's for all cases 
> -- even when 
> > > length(j)==0, we will easily be able to accomplish all our goals:
> > >    1) Conveniently remove rows with NA's in some cases -- Use 
> > > complete.cases(DT)
> > >    2) Be informed about missing data -- NAs are propagated during 
> > > computations and are easy to detect.
> > > 
> > > Also, a parameter can be used in case Goal #1 (above) is not met 
> > > with complete.cases() efficiently.
> > > 
> > > I think the behavior of not returning NA's when length(j) 
> == 0 might 
> > > cause missing data to be overlooked.
> > > 
> > > In my opinion, the default behavior -- in case a 
> parameter is used 
> > > -- should be the "safe" scenario where the fact that data are 
> > > missing is mentioned (just like na.rm=FALSE by default 
> for a lot of 
> > > the functions).  This prevents the analyst from unknowingly 
> > > proceeding with subsets of data or inaccurate data.  Such errors 
> > > will be hard to find with large and complex data sets.  Return NA 
> > > will always ensure that the NA is propagated -- therefore 
> making it 
> > > easier to catch the issue after a lot of computation.
> > > 
> > > Would love to hear other perspectives.
> > > 
> > > 
> > > Regards,
> > > Harish
> > > 
> > > 
> > > --- On Thu, 7/8/10, Matthew Dowle <mdowle at mdowle.plus.com>
> > > wrote:
> > > 
> > > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > > Subject: RE: [datatable-help] Can crash R with a
> > > data.table query
> > > > To: "Harish" <harishv_99 at yahoo.com>
> > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > Date: Thursday, July 8, 2010, 8:35 PM
> > > > 
> > > > Crash bug fixed (#983 reported by Harish, thanks).
> > > Tests
> > > > 171 and 172
> > > > added.
> > > > 
> > > > If one or more columns of the j evaluate to
> > > length>0,
> > > > then any zero
> > > > length columns are replaced with an NA vector with
> > > length
> > > > the longest
> > > > column of the j.  Thats pretty clear.
> > > > 
> > > > If all columns in the j have zero length however, then
> > > it
> > > > is not
> > > > replaced with a single NA row, at the moment at least 
> > > > unfortunately. I couldn't get that to work because putting NAs 
> > > > there
> > > stop
> > > > other nice
> > > > features working, which I know several users depend on
> > > for
> > > > example :
> > > > 
> > > >     DT[, date[date-min(date)<7],
> > > > by=var1]
> > > > 
> > > > Happy to discuss further and come up with some
> > > solution.
> > > > Maybe we need a
> > > > new parameter. How did you get on Harish with the alternatives 
> > > > using a join rather than by?
> > > > 
> > > > Here are the current results :
> > > > 
> > > > > DT
> > > >       A B   C
> > > > [1,] 25 a   2
> > > > [2,] 85 a  65
> > > > [3,] 25 b   9
> > > > [4,] 25 c  82
> > > > [5,] 85 c 823
> > > > 
> > > > > DT[ , data.table( A, C )[ A==25, C ] +
> > > data.table( A,
> > > > C )[ A==85, C ],
> > > > by=B ]
> > > >      B  V1
> > > > [1,] a  67
> > > > [2,] c 905
> > > > 
> > > > > DT[ , list(3,data.table( A, C )[ A==25, C ] +
> > > > data.table( A,
> > > > C )[ A==85, C ]), by=B ]
> > > >      B V1  V2
> > > > [1,] a  3  67
> > > > [2,] b  3  NA
> > > > [3,] c  3 905
> > > > 
> > > > Matthew
> > > > 
> > > > 
> > > > On Thu, 2010-07-01 at 09:21 -0700, Harish wrote:
> > > > > Tom and Matthew -- Thanks for confirming the
> > > issue.
> > > > > 
> > > > > I had to pull out each number (i.e. A==85 and
> > > A==25)
> > > > separately because the real computation I had to do is
> > > not
> > > > associative -- involves division, etc.  So the other approaches 
> > > > you suggested won't quite work.
> > > > > 
> > > > > I think that returning NA is quite acceptable
> > > and
> > > > preferred; it is better than having the row
> > > missing. 
> > > > It provides an opportunity for the person analyzing
> > > the data
> > > > to realize that something was amiss (i.e. A==85 was
> > > missing
> > > > for B=="b" in example).  It is also consistent with 
> reshaping the 
> > > > data table by having the A's as columns
> > > where
> > > > we would get NAs for missing data.  Then performing
> > > the
> > > > same computation will give an NA.
> > > > > 
> > > > > Regards,
> > > > > Harish
> > > > > 
> > > > > 
> > > > > --- On Thu, 7/1/10, mdowle at mdowle.plus.com
> > > > <mdowle at mdowle.plus.com>
> > > > wrote:
> > > > > 
> > > > > > From: mdowle at mdowle.plus.com
> > > > <mdowle at mdowle.plus.com>
> > > > > > Subject: RE: [datatable-help] Can crash R
> > > with a
> > > > data.table query
> > > > > > To: "Short, Tom" <TShort at epri.com>
> > > > > > Cc: mdowle at mdowle.plus.com,
> > > > "Harish" <harishv_99 at yahoo.com>,
> > > > datatable-help at lists.r-forge.r-project.org
> > > > > > Date: Thursday, July 1, 2010, 5:43 AM
> > > > > > 
> > > > > > I see that too now. It'll be inside
> > > dogroups.c.
> > > > Harish -
> > > > > > can you add as
> > > > > > bug please to tracker, good spot.  What
> > > > should the
> > > > > > result be though?  No
> > > > > > rows, for group "b", or NA?  The way the j
> > > > is
> > > > > > constructed it can't be 9.
> > > > > > 
> > > > > > Other ways to do that :
> > > > > > 
> > > > > > DT[A%in%c(25,85),sum(C),by=B]  # ok
> > > > > >      B  V1
> > > > > > [1,] a  67
> > > > > > [2,] b   9
> > > > > > [3,] c 905
> > > > > > 
> > > > > > DT[,.SD[A%in%c(85,25),sum(C)],by=B]  # ok
> > > > > >      B  V1
> > > > > > [1,] a  67
> > > > > > [2,] b   9
> > > > > > [3,] c 905
> > > > > > 
> > > > > > DT[,.SD[A==25,C]+.SD[A==85,C],by=B] # crash
> > > too
> > > > > > 
> > > > > > > setkey(DT,A)
> > > > > > >
> > > DT[J(c(25,85)),sum(C),by=B,mult="all"]
> > > > # ok,
> > > > > > likely fastest
> > > > > >      B  V1
> > > > > > [1,] a  67
> > > > > > [2,] b   9
> > > > > > [3,] c 905
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > That crashes R for me, too, somewhere
> > > in
> > > > > > data.table.dll.
> > > > > > >
> > > > > > > - Tom
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >> -----Original Message-----
> > > > > > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > > > > > >> 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > > > > > >> On Behalf Of mdowle at mdowle.plus.com
> > > > > > >> Sent: Thursday, July 01, 2010
> > > 05:33
> > > > > > >> To: Harish
> > > > > > >> Cc: datatable-help at lists.r-forge.r-project.org
> > > > > > >> Subject: Re: [datatable-help] Can
> > > crash
> > > > R with a
> > > > > > data.table query
> > > > > > >>
> > > > > > >> What you mean by 'crash'? R simply
> > > stops
> > > > or theres
> > > > > > a message?
> > > > > > >> Try the clean install of latest
> > > 1.5, as
> > > > per recent
> > > > > > reply on
> > > > > > >> other thread, and can go from
> > > there...
> > > > > > >>
> > > > > > >> > Hi,
> > > > > > >> >
> > > > > > >> > I am crashing R with the
> > > following
> > > > code (and
> > > > > > it might have
> > > > > > >> something
> > > > > > >> > to do with data tables as
> > > well):
> > > > > > >> >
> > > > > > >> > =========
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > DT <- structure(list(A =
> > > c(25L,
> > > > 85L, 25L,
> > > > > > 25L, 85L), B =
> > > > > > >> > structure(c(1L, 1L, 2L, 3L,
> > > 3L),
> > > > .Label =
> > > > > > c("a", "b", "c"),
> > > > > > >> class = "factor"),
> > > > > > >> >     C = c(2L,
> > > > 65L, 9L,
> > > > > > 82L, 823L)), .Names = c("A", "B",
> > > > > > >> "C"), class =
> > > > > > >> > c("data.table",
> > > "data.frame"),
> > > > row.names =
> > > > > > c(NA, -5L))
> > > > > > >> >
> > > > > > >> > DT[ , data.table( A, C )[
> > > A==25, C
> > > > ] +
> > > > > > data.table( A, C )[
> > > > > > >> A==85, C ],
> > > > > > >> > by=B ]
> > > > > > >> >
> > > > > > >> > =========
> > > > > > >> >
> > > > > > >> > For every B, I am trying to
> > > sum the
> > > > C's where
> > > > > > A is 25 and 85.
> > > > > > >> >
> > > > > > >> > The crash has something to do
> > > with
> > > > my row
> > > > > > selection
> > > > > > >> criteria.  First,
> > > > > > >> > note that for B=="b", I don't
> > > have
> > > > > > A==85.  It looks like a
> > > > > > >> numeric(0)
> > > > > > >> > is being returned in this
> > > case.
> > > > > > >> >
> > > > > > >> > In order to avoid the crash, I
> > > had
> > > > to do
> > > > > > something like:
> > > > > > >> >    if ( ! identical( DT[
> > > > blah ],
> > > > > > numeric( 0 ) )
> > > > > > >> >
> > > > > > >> > It isn't just that R is unable
> > > to
> > > > handle
> > > > > > operations on numeric(0)
> > > > > > >> > because I don't get a crash
> > > when I
> > > > just type
> > > > > > in "numeric(0)
> > > > > > >> + 2".  So,
> > > > > > >> > my guess is that it has
> > > something
> > > > to do with
> > > > > > data.table as well.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Harish
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > >
> > > _______________________________________________
> > > > > > >> > datatable-help mailing list 
> > > > > > >> > datatable-help at lists.r-forge.r-project.org
> > > > > > >> >
> > > > > > >> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinf
> > > > > > >> o/datatable
> > > > > > >> > -help
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > _______________________________________________
> > > > > > >> datatable-help mailing list 
> > > > > > >> datatable-help at lists.r-forge.r-project.org
> > > > > > >> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinf
> > > > > > >> o/d
> > > > > > > atatable-help
> > > > > > >>
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > >       
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > >       
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > > 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
> > > le-help
> > >
> > 
> > 
> > 
> >       
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
> 


More information about the datatable-help mailing list