[datatable-help] Can crash R with a data.table query

Matthew Dowle mdowle at mdowle.plus.com
Tue Jul 13 02:10:37 CEST 2010


Harish,

You're right about the example, thanks. It was a typo. '<' should have
been '>' :

   DT[, date[date-min(date)>7], by=var1]

That may not return data for all groups.

Basically, the way I think about 'by', and the way I explain it
sometimes is that it does this general form :

dfsplit = split(iris,iris$Species)
do.call("rbind",lapply(dfsplit,function(SUBSET)with(SUBSET,data.frame(Species[1],mean(Sepal.Length)))))

Thats not a good example because every item of the result of lapply will
have data.  Imagine an example where not all groups return data though.
Then the do.call("rbind",...) will collapse it all together and the
no-data groups will be gone.  I'm not saying at all that data.table 'by'
should be the same, but just that it is the same at the moment and how I
think about it.

How do by(), doBy() and plyr treat this ?

The issue basically is that the j in your example :

   DT[ , .SD[ A==25, C ] + .SD[ A==85, C ], by=B ]

can return NULL, or more specifically a data.table with no rows.
That may be a deliberate thing the programmer wants to do, and
a nice feature.

I'm not really comfortable with the sub-queries inside the
j. We would usually join in the i, or match in the j.

Another way might be to create a function nonull which
replaces 0 row data.tables with a single row of NAs. That
could be added to data.table.

DT[ , nonull(.SD[A==25,C] + .SD[A==85,C]), by=B ]

but I realise the discussion is about defaults so this post
was to add to the discussion not at all to close it. My
preferred option currently is to add an option 'nonullj' which
does that, default FALSE. You would prefer TRUE by default I
think.

I suspect Tom maybe able to see the wood through these trees hopefully.

Matthew




On Sun, 2010-07-11 at 02:04 -0700, Harish wrote:
> I was thinking more about this and I am unsure how the use-case you mentioned will break if NAs are returned when length(j)==0.
>    DT[, date[date-min(date)<7], by=var1]
> 
> In the above code, you will at least have one date selected for each var1 since min(date) - min(date) is always < 7.  So when does length(j) equal 0 for the proposed code returning NA to even get triggered?
> 
> My point was that we should have all the "by" variable values represented in the output.  So in your example above, if var1 was c("A","B","C"), then the result must have at least 3 rows with at least 1 row for each var1.  If no values are selected for whatever reason for var1=="B", then NA is returned.
> 
> To take this a little further:
>    DT[ blah1, blah2, by=list(var1,var2)]
> Now suppose var1 and var2 were:
>    var1   var2
>    A      x
>    A      x
>    A      y
>    B      x
> 
> In the above case, the output will have three rows at least:
>    var1   var2
>    A      x
>    A      y
>    B      x
> and it need not have (B,y) since that does not even exist in the data.
> 
> If I did not select any values for (B,x) because of a row filter, I am proposing that I get an NA for all values that cannot be computed in that row.
> 
> I suppose I am not understanding how this is implemented because I see the example that you mentioned to be very different from what I am talking about.
> 
> Thanks for being so patient.
> 
> 
> Regards,
> Harish
> 
> 
> --- On Fri, 7/9/10, Harish <harishv_99 at yahoo.com> wrote:
> 
> > From: Harish <harishv_99 at yahoo.com>
> > Subject: Re: [datatable-help] Can crash R with a data.table query
> > To: mdowle at mdowle.plus.com
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Date: Friday, July 9, 2010, 10:01 PM
> > Thanks for the fix.  I did use a
> > workaround to perform the same computation; thanks.
> > 
> > I think that if data.tables returned NA's for all cases --
> > even when length(j)==0, we will easily be able to accomplish
> > all our goals:
> >    1) Conveniently remove rows with NA's in
> > some cases -- Use complete.cases(DT)
> >    2) Be informed about missing data -- NAs
> > are propagated during computations and are easy to detect.
> > 
> > Also, a parameter can be used in case Goal #1 (above) is
> > not met with complete.cases() efficiently.
> > 
> > I think the behavior of not returning NA's when length(j)
> > == 0 might cause missing data to be overlooked.
> > 
> > In my opinion, the default behavior -- in case a parameter
> > is used -- should be the "safe" scenario where the fact that
> > data are missing is mentioned (just like na.rm=FALSE by
> > default for a lot of the functions).  This prevents the
> > analyst from unknowingly proceeding with subsets of data or
> > inaccurate data.  Such errors will be hard to find with
> > large and complex data sets.  Return NA will always
> > ensure that the NA is propagated -- therefore making it
> > easier to catch the issue after a lot of computation.
> > 
> > Would love to hear other perspectives.
> > 
> > 
> > Regards,
> > Harish
> > 
> > 
> > --- On Thu, 7/8/10, Matthew Dowle <mdowle at mdowle.plus.com>
> > wrote:
> > 
> > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > Subject: RE: [datatable-help] Can crash R with a
> > data.table query
> > > To: "Harish" <harishv_99 at yahoo.com>
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Date: Thursday, July 8, 2010, 8:35 PM
> > > 
> > > Crash bug fixed (#983 reported by Harish, thanks).
> > Tests
> > > 171 and 172
> > > added.
> > > 
> > > If one or more columns of the j evaluate to
> > length>0,
> > > then any zero
> > > length columns are replaced with an NA vector with
> > length
> > > the longest
> > > column of the j.  Thats pretty clear.
> > > 
> > > If all columns in the j have zero length however, then
> > it
> > > is not
> > > replaced with a single NA row, at the moment at least
> > > unfortunately. I
> > > couldn't get that to work because putting NAs there
> > stop
> > > other nice
> > > features working, which I know several users depend on
> > for
> > > example :
> > > 
> > >     DT[, date[date-min(date)<7],
> > > by=var1]
> > > 
> > > Happy to discuss further and come up with some
> > solution.
> > > Maybe we need a
> > > new parameter. How did you get on Harish with the
> > > alternatives using a
> > > join rather than by?
> > > 
> > > Here are the current results :
> > > 
> > > > DT
> > >       A B   C
> > > [1,] 25 a   2
> > > [2,] 85 a  65
> > > [3,] 25 b   9
> > > [4,] 25 c  82
> > > [5,] 85 c 823
> > > 
> > > > DT[ , data.table( A, C )[ A==25, C ] +
> > data.table( A,
> > > C )[ A==85, C ],
> > > by=B ]
> > >      B  V1
> > > [1,] a  67
> > > [2,] c 905
> > > 
> > > > DT[ , list(3,data.table( A, C )[ A==25, C ] +
> > > data.table( A,
> > > C )[ A==85, C ]), by=B ]
> > >      B V1  V2
> > > [1,] a  3  67
> > > [2,] b  3  NA
> > > [3,] c  3 905
> > > 
> > > Matthew
> > > 
> > > 
> > > On Thu, 2010-07-01 at 09:21 -0700, Harish wrote:
> > > > Tom and Matthew -- Thanks for confirming the
> > issue.
> > > > 
> > > > I had to pull out each number (i.e. A==85 and
> > A==25)
> > > separately because the real computation I had to do is
> > not
> > > associative -- involves division, etc.  So the other
> > > approaches you suggested won't quite work.
> > > > 
> > > > I think that returning NA is quite acceptable
> > and
> > > preferred; it is better than having the row
> > missing. 
> > > It provides an opportunity for the person analyzing
> > the data
> > > to realize that something was amiss (i.e. A==85 was
> > missing
> > > for B=="b" in example).  It is also consistent with
> > > reshaping the data table by having the A's as columns
> > where
> > > we would get NAs for missing data.  Then performing
> > the
> > > same computation will give an NA.
> > > > 
> > > > Regards,
> > > > Harish
> > > > 
> > > > 
> > > > --- On Thu, 7/1/10, mdowle at mdowle.plus.com
> > > <mdowle at mdowle.plus.com>
> > > wrote:
> > > > 
> > > > > From: mdowle at mdowle.plus.com
> > > <mdowle at mdowle.plus.com>
> > > > > Subject: RE: [datatable-help] Can crash R
> > with a
> > > data.table query
> > > > > To: "Short, Tom" <TShort at epri.com>
> > > > > Cc: mdowle at mdowle.plus.com,
> > > "Harish" <harishv_99 at yahoo.com>,
> > > datatable-help at lists.r-forge.r-project.org
> > > > > Date: Thursday, July 1, 2010, 5:43 AM
> > > > > 
> > > > > I see that too now. It'll be inside
> > dogroups.c.
> > > Harish -
> > > > > can you add as
> > > > > bug please to tracker, good spot.  What
> > > should the
> > > > > result be though?  No
> > > > > rows, for group "b", or NA?  The way the j
> > > is
> > > > > constructed it can't be 9.
> > > > > 
> > > > > Other ways to do that :
> > > > > 
> > > > > DT[A%in%c(25,85),sum(C),by=B]  # ok
> > > > >      B  V1
> > > > > [1,] a  67
> > > > > [2,] b   9
> > > > > [3,] c 905
> > > > > 
> > > > > DT[,.SD[A%in%c(85,25),sum(C)],by=B]  # ok
> > > > >      B  V1
> > > > > [1,] a  67
> > > > > [2,] b   9
> > > > > [3,] c 905
> > > > > 
> > > > > DT[,.SD[A==25,C]+.SD[A==85,C],by=B] # crash
> > too
> > > > > 
> > > > > > setkey(DT,A)
> > > > > >
> > DT[J(c(25,85)),sum(C),by=B,mult="all"] 
> > > # ok,
> > > > > likely fastest
> > > > >      B  V1
> > > > > [1,] a  67
> > > > > [2,] b   9
> > > > > [3,] c 905
> > > > > 
> > > > > 
> > > > > 
> > > > > > That crashes R for me, too, somewhere
> > in
> > > > > data.table.dll.
> > > > > >
> > > > > > - Tom
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >> -----Original Message-----
> > > > > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > > > > >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > > > > >> On Behalf Of mdowle at mdowle.plus.com
> > > > > >> Sent: Thursday, July 01, 2010
> > 05:33
> > > > > >> To: Harish
> > > > > >> Cc: datatable-help at lists.r-forge.r-project.org
> > > > > >> Subject: Re: [datatable-help] Can
> > crash
> > > R with a
> > > > > data.table query
> > > > > >>
> > > > > >> What you mean by 'crash'? R simply
> > stops
> > > or theres
> > > > > a message?
> > > > > >> Try the clean install of latest
> > 1.5, as
> > > per recent
> > > > > reply on
> > > > > >> other thread, and can go from
> > there...
> > > > > >>
> > > > > >> > Hi,
> > > > > >> >
> > > > > >> > I am crashing R with the
> > following
> > > code (and
> > > > > it might have
> > > > > >> something
> > > > > >> > to do with data tables as
> > well):
> > > > > >> >
> > > > > >> > =========
> > > > > >> >
> > > > > >> >
> > > > > >> > DT <- structure(list(A =
> > c(25L,
> > > 85L, 25L,
> > > > > 25L, 85L), B =
> > > > > >> > structure(c(1L, 1L, 2L, 3L,
> > 3L),
> > > .Label =
> > > > > c("a", "b", "c"),
> > > > > >> class = "factor"),
> > > > > >> >     C = c(2L,
> > > 65L, 9L,
> > > > > 82L, 823L)), .Names = c("A", "B",
> > > > > >> "C"), class =
> > > > > >> > c("data.table",
> > "data.frame"),
> > > row.names =
> > > > > c(NA, -5L))
> > > > > >> >
> > > > > >> > DT[ , data.table( A, C )[
> > A==25, C
> > > ] +
> > > > > data.table( A, C )[
> > > > > >> A==85, C ],
> > > > > >> > by=B ]
> > > > > >> >
> > > > > >> > =========
> > > > > >> >
> > > > > >> > For every B, I am trying to
> > sum the
> > > C's where
> > > > > A is 25 and 85.
> > > > > >> >
> > > > > >> > The crash has something to do
> > with
> > > my row
> > > > > selection
> > > > > >> criteria.  First,
> > > > > >> > note that for B=="b", I don't
> > have
> > > > > A==85.  It looks like a
> > > > > >> numeric(0)
> > > > > >> > is being returned in this
> > case.
> > > > > >> >
> > > > > >> > In order to avoid the crash, I
> > had
> > > to do
> > > > > something like:
> > > > > >> >    if ( ! identical( DT[
> > > blah ],
> > > > > numeric( 0 ) )
> > > > > >> >
> > > > > >> > It isn't just that R is unable
> > to
> > > handle
> > > > > operations on numeric(0)
> > > > > >> > because I don't get a crash
> > when I
> > > just type
> > > > > in "numeric(0)
> > > > > >> + 2".  So,
> > > > > >> > my guess is that it has
> > something
> > > to do with
> > > > > data.table as well.
> > > > > >> >
> > > > > >> >
> > > > > >> > Harish
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > >
> > _______________________________________________
> > > > > >> > datatable-help mailing list
> > > > > >> > datatable-help at lists.r-forge.r-project.org
> > > > > >> >
> > > > > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> > > > > >> > -help
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > _______________________________________________
> > > > > >> datatable-help mailing list
> > > > > >> datatable-help at lists.r-forge.r-project.org
> > > > > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > > > > > atatable-help
> > > > > >>
> > > > > >
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > >       
> > > 
> > > 
> > > 
> > 
> > 
> >       
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> 
> 
> 
>       




More information about the datatable-help mailing list