[datatable-help] Can crash R with a data.table query

Harish harishv_99 at yahoo.com
Wed Jul 14 05:35:12 CEST 2010


I'm sold.  The nonull function is easy enough to create and is acceptable.

I just wanted to pitch my thoughts out and get some perspectives.

Thanks,
Harish


--- On Tue, 7/13/10, Short, Tom <TShort at epri.com> wrote:

> From: Short, Tom <TShort at epri.com>
> Subject: RE: [datatable-help] Can crash R with a data.table query
> To: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> Date: Tuesday, July 13, 2010, 5:07 AM
> Matthew/Harish, I like the existing
> functionality (no NA's). I'd also
> prefer Matthew's nonull function idea to the nonullj
> option. The number
> of options to [.data.table is high, and I don't think this
> warrants
> another. 
> 
> - Tom
>  
> 
> > -----Original Message-----
> > From: datatable-help-bounces at lists.r-forge.r-project.org
> 
> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> 
> > On Behalf Of Matthew Dowle
> > Sent: Monday, July 12, 2010 20:11
> > To: Harish
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Subject: Re: [datatable-help] Can crash R with a
> data.table query
> > 
> > Harish,
> > 
> > You're right about the example, thanks. It was a typo.
> '<' 
> > should have been '>' :
> > 
> >    DT[, date[date-min(date)>7], by=var1]
> > 
> > That may not return data for all groups.
> > 
> > Basically, the way I think about 'by', and the way I
> explain 
> > it sometimes is that it does this general form :
> > 
> > dfsplit = split(iris,iris$Species)
> >
> do.call("rbind",lapply(dfsplit,function(SUBSET)with(SUBSET,dat
> a.frame(Species[1],mean(Sepal.Length)))))
> > 
> > Thats not a good example because every item of the
> result of 
> > lapply will have data.  Imagine an example where
> not all 
> > groups return data though.
> > Then the do.call("rbind",...) will collapse it all
> together 
> > and the no-data groups will be gone.  I'm not
> saying at all 
> > that data.table 'by'
> > should be the same, but just that it is the same at
> the 
> > moment and how I think about it.
> > 
> > How do by(), doBy() and plyr treat this ?
> > 
> > The issue basically is that the j in your example :
> > 
> >    DT[ , .SD[ A==25, C ] + .SD[ A==85, C ],
> by=B ]
> > 
> > can return NULL, or more specifically a data.table
> with no rows.
> > That may be a deliberate thing the programmer wants to
> do, 
> > and a nice feature.
> > 
> > I'm not really comfortable with the sub-queries inside
> the j. 
> > We would usually join in the i, or match in the j.
> > 
> > Another way might be to create a function nonull which
> 
> > replaces 0 row data.tables with a single row of NAs.
> That 
> > could be added to data.table.
> > 
> > DT[ , nonull(.SD[A==25,C] + .SD[A==85,C]), by=B ]
> > 
> > but I realise the discussion is about defaults so this
> post 
> > was to add to the discussion not at all to close it.
> My 
> > preferred option currently is to add an option
> 'nonullj' 
> > which does that, default FALSE. You would prefer TRUE
> by 
> > default I think.
> > 
> > I suspect Tom maybe able to see the wood through these
> trees 
> > hopefully.
> > 
> > Matthew
> > 
> > 
> > 
> > 
> > On Sun, 2010-07-11 at 02:04 -0700, Harish wrote:
> > > I was thinking more about this and I am unsure
> how the 
> > use-case you mentioned will break if NAs are returned
> when 
> > length(j)==0.
> > >    DT[, date[date-min(date)<7],
> by=var1]
> > > 
> > > In the above code, you will at least have one
> date selected 
> > for each var1 since min(date) - min(date) is always
> < 7.  So 
> > when does length(j) equal 0 for the proposed code
> returning 
> > NA to even get triggered?
> > > 
> > > My point was that we should have all the "by"
> variable 
> > values represented in the output.  So in your
> example above, 
> > if var1 was c("A","B","C"), then the result must have
> at 
> > least 3 rows with at least 1 row for each var1. 
> If no values 
> > are selected for whatever reason for var1=="B", then
> NA is returned.
> > > 
> > > To take this a little further:
> > >    DT[ blah1, blah2,
> by=list(var1,var2)] Now suppose var1 and var2 
> > > were:
> > >    var1   var2
> > >    A      x
> > >    A      x
> > >    A      y
> > >    B      x
> > > 
> > > In the above case, the output will have three
> rows at least:
> > >    var1   var2
> > >    A      x
> > >    A      y
> > >    B      x
> > > and it need not have (B,y) since that does not
> even exist 
> > in the data.
> > > 
> > > If I did not select any values for (B,x) because
> of a row 
> > filter, I am proposing that I get an NA for all values
> that 
> > cannot be computed in that row.
> > > 
> > > I suppose I am not understanding how this is
> implemented 
> > because I see the example that you mentioned to be
> very 
> > different from what I am talking about.
> > > 
> > > Thanks for being so patient.
> > > 
> > > 
> > > Regards,
> > > Harish
> > > 
> > > 
> > > --- On Fri, 7/9/10, Harish <harishv_99 at yahoo.com>
> wrote:
> > > 
> > > > From: Harish <harishv_99 at yahoo.com>
> > > > Subject: Re: [datatable-help] Can crash R
> with a data.table query
> > > > To: mdowle at mdowle.plus.com
> > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > Date: Friday, July 9, 2010, 10:01 PM Thanks
> for the fix.  
> > I did use 
> > > > a workaround to perform the same
> computation; thanks.
> > > > 
> > > > I think that if data.tables returned NA's
> for all cases 
> > -- even when 
> > > > length(j)==0, we will easily be able to
> accomplish all our goals:
> > > >    1) Conveniently remove rows
> with NA's in some cases -- Use 
> > > > complete.cases(DT)
> > > >    2) Be informed about missing
> data -- NAs are propagated during 
> > > > computations and are easy to detect.
> > > > 
> > > > Also, a parameter can be used in case Goal
> #1 (above) is not met 
> > > > with complete.cases() efficiently.
> > > > 
> > > > I think the behavior of not returning NA's
> when length(j) 
> > == 0 might 
> > > > cause missing data to be overlooked.
> > > > 
> > > > In my opinion, the default behavior -- in
> case a 
> > parameter is used 
> > > > -- should be the "safe" scenario where the
> fact that data are 
> > > > missing is mentioned (just like na.rm=FALSE
> by default 
> > for a lot of 
> > > > the functions).  This prevents the
> analyst from unknowingly 
> > > > proceeding with subsets of data or
> inaccurate data.  Such errors 
> > > > will be hard to find with large and complex
> data sets.  Return NA 
> > > > will always ensure that the NA is propagated
> -- therefore 
> > making it 
> > > > easier to catch the issue after a lot of
> computation.
> > > > 
> > > > Would love to hear other perspectives.
> > > > 
> > > > 
> > > > Regards,
> > > > Harish
> > > > 
> > > > 
> > > > --- On Thu, 7/8/10, Matthew Dowle <mdowle at mdowle.plus.com>
> > > > wrote:
> > > > 
> > > > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > > > Subject: RE: [datatable-help] Can crash
> R with a
> > > > data.table query
> > > > > To: "Harish" <harishv_99 at yahoo.com>
> > > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > > Date: Thursday, July 8, 2010, 8:35 PM
> > > > > 
> > > > > Crash bug fixed (#983 reported by
> Harish, thanks).
> > > > Tests
> > > > > 171 and 172
> > > > > added.
> > > > > 
> > > > > If one or more columns of the j
> evaluate to
> > > > length>0,
> > > > > then any zero
> > > > > length columns are replaced with an NA
> vector with
> > > > length
> > > > > the longest
> > > > > column of the j.  Thats pretty
> clear.
> > > > > 
> > > > > If all columns in the j have zero
> length however, then
> > > > it
> > > > > is not
> > > > > replaced with a single NA row, at the
> moment at least 
> > > > > unfortunately. I couldn't get that to
> work because putting NAs 
> > > > > there
> > > > stop
> > > > > other nice
> > > > > features working, which I know several
> users depend on
> > > > for
> > > > > example :
> > > > > 
> > > > >     DT[,
> date[date-min(date)<7],
> > > > > by=var1]
> > > > > 
> > > > > Happy to discuss further and come up
> with some
> > > > solution.
> > > > > Maybe we need a
> > > > > new parameter. How did you get on
> Harish with the alternatives 
> > > > > using a join rather than by?
> > > > > 
> > > > > Here are the current results :
> > > > > 
> > > > > > DT
> > > > >       A
> B   C
> > > > > [1,] 25 a   2
> > > > > [2,] 85 a  65
> > > > > [3,] 25 b   9
> > > > > [4,] 25 c  82
> > > > > [5,] 85 c 823
> > > > > 
> > > > > > DT[ , data.table( A, C )[ A==25, C
> ] +
> > > > data.table( A,
> > > > > C )[ A==85, C ],
> > > > > by=B ]
> > > > >      B  V1
> > > > > [1,] a  67
> > > > > [2,] c 905
> > > > > 
> > > > > > DT[ , list(3,data.table( A, C )[
> A==25, C ] +
> > > > > data.table( A,
> > > > > C )[ A==85, C ]), by=B ]
> > > > >      B V1  V2
> > > > > [1,] a  3  67
> > > > > [2,] b  3  NA
> > > > > [3,] c  3 905
> > > > > 
> > > > > Matthew
> > > > > 
> > > > > 
> > > > > On Thu, 2010-07-01 at 09:21 -0700,
> Harish wrote:
> > > > > > Tom and Matthew -- Thanks for
> confirming the
> > > > issue.
> > > > > > 
> > > > > > I had to pull out each number
> (i.e. A==85 and
> > > > A==25)
> > > > > separately because the real computation
> I had to do is
> > > > not
> > > > > associative -- involves division,
> etc.  So the other approaches 
> > > > > you suggested won't quite work.
> > > > > > 
> > > > > > I think that returning NA is quite
> acceptable
> > > > and
> > > > > preferred; it is better than having the
> row
> > > > missing. 
> > > > > It provides an opportunity for the
> person analyzing
> > > > the data
> > > > > to realize that something was amiss
> (i.e. A==85 was
> > > > missing
> > > > > for B=="b" in example).  It is
> also consistent with 
> > reshaping the 
> > > > > data table by having the A's as
> columns
> > > > where
> > > > > we would get NAs for missing
> data.  Then performing
> > > > the
> > > > > same computation will give an NA.
> > > > > > 
> > > > > > Regards,
> > > > > > Harish
> > > > > > 
> > > > > > 
> > > > > > --- On Thu, 7/1/10, mdowle at mdowle.plus.com
> > > > > <mdowle at mdowle.plus.com>
> > > > > wrote:
> > > > > > 
> > > > > > > From: mdowle at mdowle.plus.com
> > > > > <mdowle at mdowle.plus.com>
> > > > > > > Subject: RE: [datatable-help]
> Can crash R
> > > > with a
> > > > > data.table query
> > > > > > > To: "Short, Tom" <TShort at epri.com>
> > > > > > > Cc: mdowle at mdowle.plus.com,
> > > > > "Harish" <harishv_99 at yahoo.com>,
> > > > > datatable-help at lists.r-forge.r-project.org
> > > > > > > Date: Thursday, July 1, 2010,
> 5:43 AM
> > > > > > > 
> > > > > > > I see that too now. It'll be
> inside
> > > > dogroups.c.
> > > > > Harish -
> > > > > > > can you add as
> > > > > > > bug please to tracker, good
> spot.  What
> > > > > should the
> > > > > > > result be though?  No
> > > > > > > rows, for group "b", or
> NA?  The way the j
> > > > > is
> > > > > > > constructed it can't be 9.
> > > > > > > 
> > > > > > > Other ways to do that :
> > > > > > > 
> > > > > > >
> DT[A%in%c(25,85),sum(C),by=B]  # ok
> > > > > > >      B 
> V1
> > > > > > > [1,] a  67
> > > > > > > [2,] b   9
> > > > > > > [3,] c 905
> > > > > > > 
> > > > > > >
> DT[,.SD[A%in%c(85,25),sum(C)],by=B]  # ok
> > > > > > >      B 
> V1
> > > > > > > [1,] a  67
> > > > > > > [2,] b   9
> > > > > > > [3,] c 905
> > > > > > > 
> > > > > > >
> DT[,.SD[A==25,C]+.SD[A==85,C],by=B] # crash
> > > > too
> > > > > > > 
> > > > > > > > setkey(DT,A)
> > > > > > > >
> > > > DT[J(c(25,85)),sum(C),by=B,mult="all"]
> > > > > # ok,
> > > > > > > likely fastest
> > > > > > >      B 
> V1
> > > > > > > [1,] a  67
> > > > > > > [2,] b   9
> > > > > > > [3,] c 905
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > That crashes R for me,
> too, somewhere
> > > > in
> > > > > > > data.table.dll.
> > > > > > > >
> > > > > > > > - Tom
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >> -----Original
> Message-----
> > > > > > > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > > > > > > >> 
> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > > > > > > >> On Behalf Of mdowle at mdowle.plus.com
> > > > > > > >> Sent: Thursday, July
> 01, 2010
> > > > 05:33
> > > > > > > >> To: Harish
> > > > > > > >> Cc: datatable-help at lists.r-forge.r-project.org
> > > > > > > >> Subject: Re:
> [datatable-help] Can
> > > > crash
> > > > > R with a
> > > > > > > data.table query
> > > > > > > >>
> > > > > > > >> What you mean by
> 'crash'? R simply
> > > > stops
> > > > > or theres
> > > > > > > a message?
> > > > > > > >> Try the clean
> install of latest
> > > > 1.5, as
> > > > > per recent
> > > > > > > reply on
> > > > > > > >> other thread, and
> can go from
> > > > there...
> > > > > > > >>
> > > > > > > >> > Hi,
> > > > > > > >> >
> > > > > > > >> > I am crashing R
> with the
> > > > following
> > > > > code (and
> > > > > > > it might have
> > > > > > > >> something
> > > > > > > >> > to do with data
> tables as
> > > > well):
> > > > > > > >> >
> > > > > > > >> > =========
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > DT <-
> structure(list(A =
> > > > c(25L,
> > > > > 85L, 25L,
> > > > > > > 25L, 85L), B =
> > > > > > > >> > structure(c(1L,
> 1L, 2L, 3L,
> > > > 3L),
> > > > > .Label =
> > > > > > > c("a", "b", "c"),
> > > > > > > >> class = "factor"),
> > > > > > > >> > 
>    C = c(2L,
> > > > > 65L, 9L,
> > > > > > > 82L, 823L)), .Names = c("A",
> "B",
> > > > > > > >> "C"), class =
> > > > > > > >> >
> c("data.table",
> > > > "data.frame"),
> > > > > row.names =
> > > > > > > c(NA, -5L))
> > > > > > > >> >
> > > > > > > >> > DT[ ,
> data.table( A, C )[
> > > > A==25, C
> > > > > ] +
> > > > > > > data.table( A, C )[
> > > > > > > >> A==85, C ],
> > > > > > > >> > by=B ]
> > > > > > > >> >
> > > > > > > >> > =========
> > > > > > > >> >
> > > > > > > >> > For every B, I
> am trying to
> > > > sum the
> > > > > C's where
> > > > > > > A is 25 and 85.
> > > > > > > >> >
> > > > > > > >> > The crash has
> something to do
> > > > with
> > > > > my row
> > > > > > > selection
> > > > > > > >> criteria. 
> First,
> > > > > > > >> > note that for
> B=="b", I don't
> > > > have
> > > > > > > A==85.  It looks like a
> > > > > > > >> numeric(0)
> > > > > > > >> > is being
> returned in this
> > > > case.
> > > > > > > >> >
> > > > > > > >> > In order to
> avoid the crash, I
> > > > had
> > > > > to do
> > > > > > > something like:
> > > > > > > >> >    if
> ( ! identical( DT[
> > > > > blah ],
> > > > > > > numeric( 0 ) )
> > > > > > > >> >
> > > > > > > >> > It isn't just
> that R is unable
> > > > to
> > > > > handle
> > > > > > > operations on numeric(0)
> > > > > > > >> > because I don't
> get a crash
> > > > when I
> > > > > just type
> > > > > > > in "numeric(0)
> > > > > > > >> + 2".  So,
> > > > > > > >> > my guess is
> that it has
> > > > something
> > > > > to do with
> > > > > > > data.table as well.
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Harish
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > >
> > > >
> _______________________________________________
> > > > > > > >> > datatable-help
> mailing list 
> > > > > > > >> > datatable-help at lists.r-forge.r-project.org
> > > > > > > >> >
> > > > > > > >> 
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinf
> > > > > > > >> o/datatable
> > > > > > > >> > -help
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > >
> _______________________________________________
> > > > > > > >> datatable-help
> mailing list 
> > > > > > > >> datatable-help at lists.r-forge.r-project.org
> > > > > > > >> 
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinf
> > > > > > > >> o/d
> > > > > > > > atatable-help
> > > > > > > >>
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > >       
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > >       
> > > >
> _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org
> > > > 
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatab
> > > > le-help
> > > >
> > > 
> > > 
> > > 
> > >       
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> atatable-help
> > 
> 


      


More information about the datatable-help mailing list