[datatable-help] Can crash R with a data.table query

Matthew Dowle mdowle at mdowle.plus.com
Sun Jul 18 12:52:20 CEST 2010


Another simple approach might be to have a global option
'datatable-warnjempty', by default TRUE. It would issue a warning saying
"j evaluated to empty data.table for 2 out of 634 groups".  That would
give the analyst the feedback that Harish was looking for so they can
then check into it and add the nonull() wrapper if that is appropriate,
or fix up the data in other ways.

The warning text could be calculated and output in the verbose=TRUE mode
too. It won't add any compute time really to store a counter inside
dogroups.c each time j evaluates to 0 rows.

It can be turned off on a per-query basis with suppressWarnings() or
globally in .Rprofile with options(datatable-warnjempty=FALSE).
For side-effect-only 'by' such as plotting, the j would be NULL for all
groups and no warning would be issued in that case.

Thinking about it, I might find it quite useful too.

Tom and Harish - does that sound ok?

Matthew


On Tue, 2010-07-13 at 20:35 -0700, Harish wrote:
> I'm sold.  The nonull function is easy enough to create and is acceptable.
> 
> I just wanted to pitch my thoughts out and get some perspectives.
> 
> Thanks,
> Harish
> 
> 
> --- On Tue, 7/13/10, Short, Tom <TShort at epri.com> wrote:
> 
> > From: Short, Tom <TShort at epri.com>
> > Subject: RE: [datatable-help] Can crash R with a data.table query
> > To: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Date: Tuesday, July 13, 2010, 5:07 AM
> > Matthew/Harish, I like the existing
> > functionality (no NA's). I'd also
> > prefer Matthew's nonull function idea to the nonullj
> > option. The number
> > of options to [.data.table is high, and I don't think this
> > warrants
> > another. 
> > 
> > - Tom
> >  
> > 
> > > -----Original Message-----
> > > From: datatable-help-bounces at lists.r-forge.r-project.org
> > 
> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > 
> > > On Behalf Of Matthew Dowle
> > > Sent: Monday, July 12, 2010 20:11
> > > To: Harish
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Subject: Re: [datatable-help] Can crash R with a
> > data.table query
> > > 
> > > Harish,
> > > 
> > > You're right about the example, thanks. It was a typo.
> > '<' 
> > > should have been '>' :
> > > 
> > >    DT[, date[date-min(date)>7], by=var1]
> > > 
> > > That may not return data for all groups.
> > > 
> > > Basically, the way I think about 'by', and the way I
> > explain 
> > > it sometimes is that it does this general form :
> > > 
> > > dfsplit = split(iris,iris$Species)
> > >
> > do.call("rbind",lapply(dfsplit,function(SUBSET)with(SUBSET,dat
> > a.frame(Species[1],mean(Sepal.Length)))))
> > > 
> > > Thats not a good example because every item of the
> > result of 
> > > lapply will have data.  Imagine an example where
> > not all 
> > > groups return data though.
> > > Then the do.call("rbind",...) will collapse it all
> > together 
> > > and the no-data groups will be gone.  I'm not
> > saying at all 
> > > that data.table 'by'
> > > should be the same, but just that it is the same at
> > the 
> > > moment and how I think about it.
> > > 
> > > How do by(), doBy() and plyr treat this ?
> > > 
> > > The issue basically is that the j in your example :
> > > 
> > >    DT[ , .SD[ A==25, C ] + .SD[ A==85, C ],
> > by=B ]
> > > 
> > > can return NULL, or more specifically a data.table
> > with no rows.
> > > That may be a deliberate thing the programmer wants to
> > do, 
> > > and a nice feature.
> > > 
> > > I'm not really comfortable with the sub-queries inside
> > the j. 
> > > We would usually join in the i, or match in the j.
> > > 
> > > Another way might be to create a function nonull which
> > 
> > > replaces 0 row data.tables with a single row of NAs.
> > That 
> > > could be added to data.table.
> > > 
> > > DT[ , nonull(.SD[A==25,C] + .SD[A==85,C]), by=B ]
> > > 
> > > but I realise the discussion is about defaults so this
> > post 
> > > was to add to the discussion not at all to close it.
> > My 
> > > preferred option currently is to add an option
> > 'nonullj' 
> > > which does that, default FALSE. You would prefer TRUE
> > by 
> > > default I think.
> > > 
> > > I suspect Tom maybe able to see the wood through these
> > trees 
> > > hopefully.
> > > 
> > > Matthew
> > > 
> > > 
> > > 
> > > 
> > > On Sun, 2010-07-11 at 02:04 -0700, Harish wrote:
> > > > I was thinking more about this and I am unsure
> > how the 
> > > use-case you mentioned will break if NAs are returned
> > when 
> > > length(j)==0.
> > > >    DT[, date[date-min(date)<7],
> > by=var1]
> > > > 
> > > > In the above code, you will at least have one
> > date selected 
> > > for each var1 since min(date) - min(date) is always
> > < 7.  So 
> > > when does length(j) equal 0 for the proposed code
> > returning 
> > > NA to even get triggered?
> > > > 
> > > > My point was that we should have all the "by"
> > variable 
> > > values represented in the output.  So in your
> > example above, 
> > > if var1 was c("A","B","C"), then the result must have
> > at 
> > > least 3 rows with at least 1 row for each var1. 
> > If no values 
> > > are selected for whatever reason for var1=="B", then
> > NA is returned.
> > > > 
> > > > To take this a little further:
> > > >    DT[ blah1, blah2,
> > by=list(var1,var2)] Now suppose var1 and var2 
> > > > were:
> > > >    var1   var2
> > > >    A      x
> > > >    A      x
> > > >    A      y
> > > >    B      x
> > > > 
> > > > In the above case, the output will have three
> > rows at least:
> > > >    var1   var2
> > > >    A      x
> > > >    A      y
> > > >    B      x
> > > > and it need not have (B,y) since that does not
> > even exist 
> > > in the data.
> > > > 
> > > > If I did not select any values for (B,x) because
> > of a row 
> > > filter, I am proposing that I get an NA for all values
> > that 
> > > cannot be computed in that row.
> > > > 
> > > > I suppose I am not understanding how this is
> > implemented 
> > > because I see the example that you mentioned to be
> > very 
> > > different from what I am talking about.
> > > > 
> > > > Thanks for being so patient.
> > > > 
> > > > 
> > > > Regards,
> > > > Harish
> > > > 
> > > > 
> > > > --- On Fri, 7/9/10, Harish <harishv_99 at yahoo.com>
> > wrote:
> > > > 
> > > > > From: Harish <harishv_99 at yahoo.com>
> > > > > Subject: Re: [datatable-help] Can crash R
> > with a data.table query
> > > > > To: mdowle at mdowle.plus.com
> > > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > > Date: Friday, July 9, 2010, 10:01 PM Thanks
> > for the fix.  
> > > I did use 
> > > > > a workaround to perform the same
> > computation; thanks.
> > > > > 
> > > > > I think that if data.tables returned NA's
> > for all cases 
> > > -- even when 
> > > > > length(j)==0, we will easily be able to
> > accomplish all our goals:
> > > > >    1) Conveniently remove rows
> > with NA's in some cases -- Use 
> > > > > complete.cases(DT)
> > > > >    2) Be informed about missing
> > data -- NAs are propagated during 
> > > > > computations and are easy to detect.
> > > > > 
> > > > > Also, a parameter can be used in case Goal
> > #1 (above) is not met 
> > > > > with complete.cases() efficiently.
> > > > > 
> > > > > I think the behavior of not returning NA's
> > when length(j) 
> > > == 0 might 
> > > > > cause missing data to be overlooked.
> > > > > 
> > > > > In my opinion, the default behavior -- in
> > case a 
> > > parameter is used 
> > > > > -- should be the "safe" scenario where the
> > fact that data are 
> > > > > missing is mentioned (just like na.rm=FALSE
> > by default 
> > > for a lot of 
> > > > > the functions).  This prevents the
> > analyst from unknowingly 
> > > > > proceeding with subsets of data or
> > inaccurate data.  Such errors 
> > > > > will be hard to find with large and complex
> > data sets.  Return NA 
> > > > > will always ensure that the NA is propagated
> > -- therefore 
> > > making it 
> > > > > easier to catch the issue after a lot of
> > computation.
> > > > > 
> > > > > Would love to hear other perspectives.
> > > > > 
> > > > > 
> > > > > Regards,
> > > > > Harish
> > > > > 
> > > > > 
> > > > > --- On Thu, 7/8/10, Matthew Dowle <mdowle at mdowle.plus.com>
> > > > > wrote:
> > > > > 
> > > > > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > > > > Subject: RE: [datatable-help] Can crash
> > R with a
> > > > > data.table query
> > > > > > To: "Harish" <harishv_99 at yahoo.com>
> > > > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > > > Date: Thursday, July 8, 2010, 8:35 PM
> > > > > > 
> > > > > > Crash bug fixed (#983 reported by
> > Harish, thanks).
> > > > > Tests
> > > > > > 171 and 172
> > > > > > added.
> > > > > > 
> > > > > > If one or more columns of the j
> > evaluate to
> > > > > length>0,
> > > > > > then any zero
> > > > > > length columns are replaced with an NA
> > vector with
> > > > > length
> > > > > > the longest
> > > > > > column of the j.  Thats pretty
> > clear.
> > > > > > 
> > > > > > If all columns in the j have zero
> > length however, then
> > > > > it
> > > > > > is not
> > > > > > replaced with a single NA row, at the
> > moment at least 
> > > > > > unfortunately. I couldn't get that to
> > work because putting NAs 
> > > > > > there
> > > > > stop
> > > > > > other nice
> > > > > > features working, which I know several
> > users depend on
> > > > > for
> > > > > > example :
> > > > > > 
> > > > > >     DT[,
> > date[date-min(date)<7],
> > > > > > by=var1]
> > > > > > 
> > > > > > Happy to discuss further and come up
> > with some
> > > > > solution.
> > > > > > Maybe we need a
> > > > > > new parameter. How did you get on
> > Harish with the alternatives 
> > > > > > using a join rather than by?
> > > > > > 
> > > > > > Here are the current results :
> > > > > > 
> > > > > > > DT
> > > > > >       A
> > B   C
> > > > > > [1,] 25 a   2
> > > > > > [2,] 85 a  65
> > > > > > [3,] 25 b   9
> > > > > > [4,] 25 c  82
> > > > > > [5,] 85 c 823
> > > > > > 
> > > > > > > DT[ , data.table( A, C )[ A==25, C
> > ] +
> > > > > data.table( A,
> > > > > > C )[ A==85, C ],
> > > > > > by=B ]
> > > > > >      B  V1
> > > > > > [1,] a  67
> > > > > > [2,] c 905
> > > > > > 
> > > > > > > DT[ , list(3,data.table( A, C )[
> > A==25, C ] +
> > > > > > data.table( A,
> > > > > > C )[ A==85, C ]), by=B ]
> > > > > >      B V1  V2
> > > > > > [1,] a  3  67
> > > > > > [2,] b  3  NA
> > > > > > [3,] c  3 905
> > > > > > 
> > > > > > Matthew
> > > > > > 
> > > > > > 
> > > > > > On Thu, 2010-07-01 at 09:21 -0700,
> > Harish wrote:
> > > > > > > Tom and Matthew -- Thanks for
> > confirming the
> > > > > issue.
> > > > > > > 
> > > > > > > I had to pull out each number
> > (i.e. A==85 and
> > > > > A==25)
> > > > > > separately because the real computation
> > I had to do is
> > > > > not
> > > > > > associative -- involves division,
> > etc.  So the other approaches 
> > > > > > you suggested won't quite work.
> > > > > > > 
> > > > > > > I think that returning NA is quite
> > acceptable
> > > > > and
> > > > > > preferred; it is better than having the
> > row
> > > > > missing. 
> > > > > > It provides an opportunity for the
> > person analyzing
> > > > > the data
> > > > > > to realize that something was amiss
> > (i.e. A==85 was
> > > > > missing
> > > > > > for B=="b" in example).  It is
> > also consistent with 
> > > reshaping the 
> > > > > > data table by having the A's as
> > columns
> > > > > where
> > > > > > we would get NAs for missing
> > data.  Then performing
> > > > > the
> > > > > > same computation will give an NA.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Harish
> > > > > > > 
> > > > > > > 
> > > > > > > --- On Thu, 7/1/10, mdowle at mdowle.plus.com
> > > > > > <mdowle at mdowle.plus.com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > > From: mdowle at mdowle.plus.com
> > > > > > <mdowle at mdowle.plus.com>
> > > > > > > > Subject: RE: [datatable-help]
> > Can crash R
> > > > > with a
> > > > > > data.table query
> > > > > > > > To: "Short, Tom" <TShort at epri.com>
> > > > > > > > Cc: mdowle at mdowle.plus.com,
> > > > > > "Harish" <harishv_99 at yahoo.com>,
> > > > > > datatable-help at lists.r-forge.r-project.org
> > > > > > > > Date: Thursday, July 1, 2010,
> > 5:43 AM
> > > > > > > > 
> > > > > > > > I see that too now. It'll be
> > inside
> > > > > dogroups.c.
> > > > > > Harish -
> > > > > > > > can you add as
> > > > > > > > bug please to tracker, good
> > spot.  What
> > > > > > should the
> > > > > > > > result be though?  No
> > > > > > > > rows, for group "b", or
> > NA?  The way the j
> > > > > > is
> > > > > > > > constructed it can't be 9.
> > > > > > > > 
> > > > > > > > Other ways to do that :
> > > > > > > > 
> > > > > > > >
> > DT[A%in%c(25,85),sum(C),by=B]  # ok
> > > > > > > >      B 
> > V1
> > > > > > > > [1,] a  67
> > > > > > > > [2,] b   9
> > > > > > > > [3,] c 905
> > > > > > > > 
> > > > > > > >
> > DT[,.SD[A%in%c(85,25),sum(C)],by=B]  # ok
> > > > > > > >      B 
> > V1
> > > > > > > > [1,] a  67
> > > > > > > > [2,] b   9
> > > > > > > > [3,] c 905
> > > > > > > > 
> > > > > > > >
> > DT[,.SD[A==25,C]+.SD[A==85,C],by=B] # crash
> > > > > too
> > > > > > > > 
> > > > > > > > > setkey(DT,A)
> > > > > > > > >
> > > > > DT[J(c(25,85)),sum(C),by=B,mult="all"]
> > > > > > # ok,
> > > > > > > > likely fastest
> > > > > > > >      B 
> > V1
> > > > > > > > [1,] a  67
> > > > > > > > [2,] b   9
> > > > > > > > [3,] c 905
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > That crashes R for me,
> > too, somewhere
> > > > > in
> > > > > > > > data.table.dll.
> > > > > > > > >
> > > > > > > > > - Tom
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >> -----Original
> > Message-----
> > > > > > > > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > > > > > > > >> 
> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > > > > > > > >> On Behalf Of mdowle at mdowle.plus.com
> > > > > > > > >> Sent: Thursday, July
> > 01, 2010
> > > > > 05:33
> > > > > > > > >> To: Harish
> > > > > > > > >> Cc: datatable-help at lists.r-forge.r-project.org
> > > > > > > > >> Subject: Re:
> > [datatable-help] Can
> > > > > crash
> > > > > > R with a
> > > > > > > > data.table query
> > > > > > > > >>
> > > > > > > > >> What you mean by
> > 'crash'? R simply
> > > > > stops
> > > > > > or theres
> > > > > > > > a message?
> > > > > > > > >> Try the clean
> > install of latest
> > > > > 1.5, as
> > > > > > per recent
> > > > > > > > reply on
> > > > > > > > >> other thread, and
> > can go from
> > > > > there...
> > > > > > > > >>
> > > > > > > > >> > Hi,
> > > > > > > > >> >
> > > > > > > > >> > I am crashing R
> > with the
> > > > > following
> > > > > > code (and
> > > > > > > > it might have
> > > > > > > > >> something
> > > > > > > > >> > to do with data
> > tables as
> > > > > well):
> > > > > > > > >> >
> > > > > > > > >> > =========
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > DT <-
> > structure(list(A =
> > > > > c(25L,
> > > > > > 85L, 25L,
> > > > > > > > 25L, 85L), B =
> > > > > > > > >> > structure(c(1L,
> > 1L, 2L, 3L,
> > > > > 3L),
> > > > > > .Label =
> > > > > > > > c("a", "b", "c"),
> > > > > > > > >> class = "factor"),
> > > > > > > > >> > 
> >    C = c(2L,
> > > > > > 65L, 9L,
> > > > > > > > 82L, 823L)), .Names = c("A",
> > "B",
> > > > > > > > >> "C"), class =
> > > > > > > > >> >
> > c("data.table",
> > > > > "data.frame"),
> > > > > > row.names =
> > > > > > > > c(NA, -5L))
> > > > > > > > >> >
> > > > > > > > >> > DT[ ,
> > data.table( A, C )[
> > > > > A==25, C
> > > > > > ] +
> > > > > > > > data.table( A, C )[
> > > > > > > > >> A==85, C ],
> > > > > > > > >> > by=B ]
> > > > > > > > >> >
> > > > > > > > >> > =========
> > > > > > > > >> >
> > > > > > > > >> > For every B, I
> > am trying to
> > > > > sum the
> > > > > > C's where
> > > > > > > > A is 25 and 85.
> > > > > > > > >> >
> > > > > > > > >> > The crash has
> > something to do
> > > > > with
> > > > > > my row
> > > > > > > > selection
> > > > > > > > >> criteria. 
> > First,
> > > > > > > > >> > note that for
> > B=="b", I don't
> > > > > have
> > > > > > > > A==85.  It looks like a
> > > > > > > > >> numeric(0)
> > > > > > > > >> > is being
> > returned in this
> > > > > case.
> > > > > > > > >> >
> > > > > > > > >> > In order to
> > avoid the crash, I
> > > > > had
> > > > > > to do
> > > > > > > > something like:
> > > > > > > > >> >    if
> > ( ! identical( DT[
> > > > > > blah ],
> > > > > > > > numeric( 0 ) )
> > > > > > > > >> >
> > > > > > > > >> > It isn't just
> > that R is unable
> > > > > to
> > > > > > handle
> > > > > > > > operations on numeric(0)
> > > > > > > > >> > because I don't
> > get a crash
> > > > > when I
> > > > > > just type
> > > > > > > > in "numeric(0)
> > > > > > > > >> + 2".  So,
> > > > > > > > >> > my guess is
> > that it has
> > > > > something
> > > > > > to do with
> > > > > > > > data.table as well.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Harish
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > >
> > > > >
> > _______________________________________________
> > > > > > > > >> > datatable-help
> > mailing list 
> > > > > > > > >> > datatable-help at lists.r-forge.r-project.org
> > > > > > > > >> >
> > > > > > > > >> 
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinf
> > > > > > > > >> o/datatable
> > > > > > > > >> > -help
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > >
> > _______________________________________________
> > > > > > > > >> datatable-help
> > mailing list 
> > > > > > > > >> datatable-help at lists.r-forge.r-project.org
> > > > > > > > >> 
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinf
> > > > > > > > >> o/d
> > > > > > > > > atatable-help
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >       
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > >       
> > > > >
> > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org
> > > > > 
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
> > > > > le-help
> > > > >
> > > > 
> > > > 
> > > > 
> > > >       
> > > 
> > > 
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> > > 
> > 
> 
> 
>       




More information about the datatable-help mailing list