[datatable-help] Can crash R with a data.table query

Matthew Dowle mdowle at mdowle.plus.com
Fri Jul 9 05:35:14 CEST 2010


Crash bug fixed (#983 reported by Harish, thanks). Tests 171 and 172
added.

If one or more columns of the j evaluate to length>0, then any zero
length columns are replaced with an NA vector with length the longest
column of the j.  Thats pretty clear.

If all columns in the j have zero length however, then it is not
replaced with a single NA row, at the moment at least unfortunately. I
couldn't get that to work because putting NAs there stop other nice
features working, which I know several users depend on for example :

	DT[, date[date-min(date)<7], by=var1]

Happy to discuss further and come up with some solution. Maybe we need a
new parameter. How did you get on Harish with the alternatives using a
join rather than by?

Here are the current results :

> DT
      A B   C
[1,] 25 a   2
[2,] 85 a  65
[3,] 25 b   9
[4,] 25 c  82
[5,] 85 c 823

> DT[ , data.table( A, C )[ A==25, C ] + data.table( A, C )[ A==85, C ],
by=B ]
     B  V1
[1,] a  67
[2,] c 905

> DT[ , list(3,data.table( A, C )[ A==25, C ] + data.table( A,
C )[ A==85, C ]), by=B ]
     B V1  V2
[1,] a  3  67
[2,] b  3  NA
[3,] c  3 905

Matthew


On Thu, 2010-07-01 at 09:21 -0700, Harish wrote:
> Tom and Matthew -- Thanks for confirming the issue.
> 
> I had to pull out each number (i.e. A==85 and A==25) separately because the real computation I had to do is not associative -- involves division, etc.  So the other approaches you suggested won't quite work.
> 
> I think that returning NA is quite acceptable and preferred; it is better than having the row missing.  It provides an opportunity for the person analyzing the data to realize that something was amiss (i.e. A==85 was missing for B=="b" in example).  It is also consistent with reshaping the data table by having the A's as columns where we would get NAs for missing data.  Then performing the same computation will give an NA.
> 
> Regards,
> Harish
> 
> 
> --- On Thu, 7/1/10, mdowle at mdowle.plus.com <mdowle at mdowle.plus.com> wrote:
> 
> > From: mdowle at mdowle.plus.com <mdowle at mdowle.plus.com>
> > Subject: RE: [datatable-help] Can crash R with a data.table query
> > To: "Short, Tom" <TShort at epri.com>
> > Cc: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>, datatable-help at lists.r-forge.r-project.org
> > Date: Thursday, July 1, 2010, 5:43 AM
> > 
> > I see that too now. It'll be inside dogroups.c. Harish -
> > can you add as
> > bug please to tracker, good spot.  What should the
> > result be though?  No
> > rows, for group "b", or NA?  The way the j is
> > constructed it can't be 9.
> > 
> > Other ways to do that :
> > 
> > DT[A%in%c(25,85),sum(C),by=B]  # ok
> >      B  V1
> > [1,] a  67
> > [2,] b   9
> > [3,] c 905
> > 
> > DT[,.SD[A%in%c(85,25),sum(C)],by=B]  # ok
> >      B  V1
> > [1,] a  67
> > [2,] b   9
> > [3,] c 905
> > 
> > DT[,.SD[A==25,C]+.SD[A==85,C],by=B] # crash too
> > 
> > > setkey(DT,A)
> > > DT[J(c(25,85)),sum(C),by=B,mult="all"]  # ok,
> > likely fastest
> >      B  V1
> > [1,] a  67
> > [2,] b   9
> > [3,] c 905
> > 
> > 
> > 
> > > That crashes R for me, too, somewhere in
> > data.table.dll.
> > >
> > > - Tom
> > >
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > >> On Behalf Of mdowle at mdowle.plus.com
> > >> Sent: Thursday, July 01, 2010 05:33
> > >> To: Harish
> > >> Cc: datatable-help at lists.r-forge.r-project.org
> > >> Subject: Re: [datatable-help] Can crash R with a
> > data.table query
> > >>
> > >> What you mean by 'crash'? R simply stops or theres
> > a message?
> > >> Try the clean install of latest 1.5, as per recent
> > reply on
> > >> other thread, and can go from there...
> > >>
> > >> > Hi,
> > >> >
> > >> > I am crashing R with the following code (and
> > it might have
> > >> something
> > >> > to do with data tables as well):
> > >> >
> > >> > =========
> > >> >
> > >> >
> > >> > DT <- structure(list(A = c(25L, 85L, 25L,
> > 25L, 85L), B =
> > >> > structure(c(1L, 1L, 2L, 3L, 3L), .Label =
> > c("a", "b", "c"),
> > >> class = "factor"),
> > >> >     C = c(2L, 65L, 9L,
> > 82L, 823L)), .Names = c("A", "B",
> > >> "C"), class =
> > >> > c("data.table", "data.frame"), row.names =
> > c(NA, -5L))
> > >> >
> > >> > DT[ , data.table( A, C )[ A==25, C ] +
> > data.table( A, C )[
> > >> A==85, C ],
> > >> > by=B ]
> > >> >
> > >> > =========
> > >> >
> > >> > For every B, I am trying to sum the C's where
> > A is 25 and 85.
> > >> >
> > >> > The crash has something to do with my row
> > selection
> > >> criteria.  First,
> > >> > note that for B=="b", I don't have
> > A==85.  It looks like a
> > >> numeric(0)
> > >> > is being returned in this case.
> > >> >
> > >> > In order to avoid the crash, I had to do
> > something like:
> > >> >    if ( ! identical( DT[ blah ],
> > numeric( 0 ) )
> > >> >
> > >> > It isn't just that R is unable to handle
> > operations on numeric(0)
> > >> > because I don't get a crash when I just type
> > in "numeric(0)
> > >> + 2".  So,
> > >> > my guess is that it has something to do with
> > data.table as well.
> > >> >
> > >> >
> > >> > Harish
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > _______________________________________________
> > >> > datatable-help mailing list
> > >> > datatable-help at lists.r-forge.r-project.org
> > >> >
> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> > >> > -help
> > >> >
> > >>
> > >>
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > > atatable-help
> > >>
> > >
> > 
> > 
> > 
> 
> 
>       




More information about the datatable-help mailing list