[datatable-help] Can crash R with a data.table query

Harish harishv_99 at yahoo.com
Sat Jul 10 07:01:04 CEST 2010


Thanks for the fix.  I did use a workaround to perform the same computation; thanks.

I think that if data.tables returned NA's for all cases -- even when length(j)==0, we will easily be able to accomplish all our goals:
   1) Conveniently remove rows with NA's in some cases -- Use complete.cases(DT)
   2) Be informed about missing data -- NAs are propagated during computations and are easy to detect.

Also, a parameter can be used in case Goal #1 (above) is not met with complete.cases() efficiently.

I think the behavior of not returning NA's when length(j) == 0 might cause missing data to be overlooked.

In my opinion, the default behavior -- in case a parameter is used -- should be the "safe" scenario where the fact that data are missing is mentioned (just like na.rm=FALSE by default for a lot of the functions).  This prevents the analyst from unknowingly proceeding with subsets of data or inaccurate data.  Such errors will be hard to find with large and complex data sets.  Return NA will always ensure that the NA is propagated -- therefore making it easier to catch the issue after a lot of computation.

Would love to hear other perspectives.


Regards,
Harish


--- On Thu, 7/8/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> From: Matthew Dowle <mdowle at mdowle.plus.com>
> Subject: RE: [datatable-help] Can crash R with a data.table query
> To: "Harish" <harishv_99 at yahoo.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> Date: Thursday, July 8, 2010, 8:35 PM
> 
> Crash bug fixed (#983 reported by Harish, thanks). Tests
> 171 and 172
> added.
> 
> If one or more columns of the j evaluate to length>0,
> then any zero
> length columns are replaced with an NA vector with length
> the longest
> column of the j.  Thats pretty clear.
> 
> If all columns in the j have zero length however, then it
> is not
> replaced with a single NA row, at the moment at least
> unfortunately. I
> couldn't get that to work because putting NAs there stop
> other nice
> features working, which I know several users depend on for
> example :
> 
>     DT[, date[date-min(date)<7],
> by=var1]
> 
> Happy to discuss further and come up with some solution.
> Maybe we need a
> new parameter. How did you get on Harish with the
> alternatives using a
> join rather than by?
> 
> Here are the current results :
> 
> > DT
>       A B   C
> [1,] 25 a   2
> [2,] 85 a  65
> [3,] 25 b   9
> [4,] 25 c  82
> [5,] 85 c 823
> 
> > DT[ , data.table( A, C )[ A==25, C ] + data.table( A,
> C )[ A==85, C ],
> by=B ]
>      B  V1
> [1,] a  67
> [2,] c 905
> 
> > DT[ , list(3,data.table( A, C )[ A==25, C ] +
> data.table( A,
> C )[ A==85, C ]), by=B ]
>      B V1  V2
> [1,] a  3  67
> [2,] b  3  NA
> [3,] c  3 905
> 
> Matthew
> 
> 
> On Thu, 2010-07-01 at 09:21 -0700, Harish wrote:
> > Tom and Matthew -- Thanks for confirming the issue.
> > 
> > I had to pull out each number (i.e. A==85 and A==25)
> separately because the real computation I had to do is not
> associative -- involves division, etc.  So the other
> approaches you suggested won't quite work.
> > 
> > I think that returning NA is quite acceptable and
> preferred; it is better than having the row missing. 
> It provides an opportunity for the person analyzing the data
> to realize that something was amiss (i.e. A==85 was missing
> for B=="b" in example).  It is also consistent with
> reshaping the data table by having the A's as columns where
> we would get NAs for missing data.  Then performing the
> same computation will give an NA.
> > 
> > Regards,
> > Harish
> > 
> > 
> > --- On Thu, 7/1/10, mdowle at mdowle.plus.com
> <mdowle at mdowle.plus.com>
> wrote:
> > 
> > > From: mdowle at mdowle.plus.com
> <mdowle at mdowle.plus.com>
> > > Subject: RE: [datatable-help] Can crash R with a
> data.table query
> > > To: "Short, Tom" <TShort at epri.com>
> > > Cc: mdowle at mdowle.plus.com,
> "Harish" <harishv_99 at yahoo.com>,
> datatable-help at lists.r-forge.r-project.org
> > > Date: Thursday, July 1, 2010, 5:43 AM
> > > 
> > > I see that too now. It'll be inside dogroups.c.
> Harish -
> > > can you add as
> > > bug please to tracker, good spot.  What
> should the
> > > result be though?  No
> > > rows, for group "b", or NA?  The way the j
> is
> > > constructed it can't be 9.
> > > 
> > > Other ways to do that :
> > > 
> > > DT[A%in%c(25,85),sum(C),by=B]  # ok
> > >      B  V1
> > > [1,] a  67
> > > [2,] b   9
> > > [3,] c 905
> > > 
> > > DT[,.SD[A%in%c(85,25),sum(C)],by=B]  # ok
> > >      B  V1
> > > [1,] a  67
> > > [2,] b   9
> > > [3,] c 905
> > > 
> > > DT[,.SD[A==25,C]+.SD[A==85,C],by=B] # crash too
> > > 
> > > > setkey(DT,A)
> > > > DT[J(c(25,85)),sum(C),by=B,mult="all"] 
> # ok,
> > > likely fastest
> > >      B  V1
> > > [1,] a  67
> > > [2,] b   9
> > > [3,] c 905
> > > 
> > > 
> > > 
> > > > That crashes R for me, too, somewhere in
> > > data.table.dll.
> > > >
> > > > - Tom
> > > >
> > > >
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > > >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > > >> On Behalf Of mdowle at mdowle.plus.com
> > > >> Sent: Thursday, July 01, 2010 05:33
> > > >> To: Harish
> > > >> Cc: datatable-help at lists.r-forge.r-project.org
> > > >> Subject: Re: [datatable-help] Can crash
> R with a
> > > data.table query
> > > >>
> > > >> What you mean by 'crash'? R simply stops
> or theres
> > > a message?
> > > >> Try the clean install of latest 1.5, as
> per recent
> > > reply on
> > > >> other thread, and can go from there...
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > I am crashing R with the following
> code (and
> > > it might have
> > > >> something
> > > >> > to do with data tables as well):
> > > >> >
> > > >> > =========
> > > >> >
> > > >> >
> > > >> > DT <- structure(list(A = c(25L,
> 85L, 25L,
> > > 25L, 85L), B =
> > > >> > structure(c(1L, 1L, 2L, 3L, 3L),
> .Label =
> > > c("a", "b", "c"),
> > > >> class = "factor"),
> > > >> >     C = c(2L,
> 65L, 9L,
> > > 82L, 823L)), .Names = c("A", "B",
> > > >> "C"), class =
> > > >> > c("data.table", "data.frame"),
> row.names =
> > > c(NA, -5L))
> > > >> >
> > > >> > DT[ , data.table( A, C )[ A==25, C
> ] +
> > > data.table( A, C )[
> > > >> A==85, C ],
> > > >> > by=B ]
> > > >> >
> > > >> > =========
> > > >> >
> > > >> > For every B, I am trying to sum the
> C's where
> > > A is 25 and 85.
> > > >> >
> > > >> > The crash has something to do with
> my row
> > > selection
> > > >> criteria.  First,
> > > >> > note that for B=="b", I don't have
> > > A==85.  It looks like a
> > > >> numeric(0)
> > > >> > is being returned in this case.
> > > >> >
> > > >> > In order to avoid the crash, I had
> to do
> > > something like:
> > > >> >    if ( ! identical( DT[
> blah ],
> > > numeric( 0 ) )
> > > >> >
> > > >> > It isn't just that R is unable to
> handle
> > > operations on numeric(0)
> > > >> > because I don't get a crash when I
> just type
> > > in "numeric(0)
> > > >> + 2".  So,
> > > >> > my guess is that it has something
> to do with
> > > data.table as well.
> > > >> >
> > > >> >
> > > >> > Harish
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > _______________________________________________
> > > >> > datatable-help mailing list
> > > >> > datatable-help at lists.r-forge.r-project.org
> > > >> >
> > > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> > > >> > -help
> > > >> >
> > > >>
> > > >>
> > > >>
> _______________________________________________
> > > >> datatable-help mailing list
> > > >> datatable-help at lists.r-forge.r-project.org
> > > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > > > atatable-help
> > > >>
> > > >
> > > 
> > > 
> > > 
> > 
> > 
> >       
> 
> 
> 


      


More information about the datatable-help mailing list