[datatable-help] Subsetting that behaves right for both data frames and data.tables?

Chris Neff caneff at gmail.com
Wed Jul 20 16:55:02 CEST 2011


On 20 July 2011 10:42, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

>
> Thanks, makes sense. Yes, as.data.frame.data.table currently removes the
> 'sorted' attribute, which is all a key is. I suppose that line could be
> removed so the key would be left on the data.frame.  You would then need
> to change the class back to data.table at the end of the function, though,
> and make sure you didn't change the order of the rows otherwise that key
> would be invalid.
>
> However, packages I use, use other packages I don't use directly and know
> nothing about. I don't see the issue. Disk space? Memory space? The
> banner?
>

Behaving nicely in a build environment that is more complicated than a
normal R thing.


> There is also this related FR :
>
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=984&group_id=240&atid=978
>
> Just to check you know that the result of j in data.table can happily be a
> data.frame?  So if your user is using data.table to call your function, he
> won't mind. If he's passing the entire data.table to your function, then
> he's not going to be wanting to retain the key anyway. You're returning
> some statistical result to him (not the orginal data back) so why does the
> key make sense to retain?
>
>
Well I explicitly crafted an example where I return the entire data frame.
 Now, in this dumb example I ruined the ordering so the key leaves anyway.
 But I think i have cases where I want to take an entire data.(table|frame),
do some processing, and return the full data.(table|frame) back like it was.

 Noticing that strictly, your MyFunc 'returned' two columns, so it might be

> written like this :
>
> MyFunc <- function(numerator,denominator)
> {
>     o = order(numerator)
>     data.frame(numerator[o], cumsum((numerator/denomintor)[o])
> }
>
> Then the user can decide if he wants to cbind it to his data.frame, or
> fast assign it into a data.table,  or by group,  or whatever.  That seems
> to me to be up to your user.  Perhaps, the job of MyFunc is to return it's
> output given the input (and that's all).
>


I think my issues are coming more from inexperience/uneasiness with some of
the data.table idioms still. When you list it all out like that it becomes
crystal clear though, and I think refactoring of my code is correct.  I'm
just not in the data.table mindset yet I guess.


> Matthew
>
>
> > Mainly it is that I am writing some library functions that I and a few
> > others may be using. I don't want those functions to have to depend on
> > data.table because I don't want it to need to be installed for a purpose
> > that has nothing to do with it. But I use data.tables as input. Here is a
> > psuedo example
> >
> > MyFunc <- function(data, numerator.var, denominator.var)
> > {
> >   data <- data[order(data[,numerator.var])]
> >   data$metric <- data[, numerator.var] / data[, denominator.var]
> >   data$cum.metric <- cumsum(data$metric)
> >
> >   return(data)
> > }
> >
> > I make this example to show that I need to preserve the whole data
> > variable
> > the whole way through and return a modified version.  If I do
> >
> > data <- as.data.frame(data)
> >
> > as the first line of that function, then I lose the keys in a potential
> > data.table that is passed in.  If I use
> >
> > data <- as.data.table(data)
> >
> > and change the subsetting to be data.table compliant, then I am forcing
> > someone to have a whole package loaded for something that can be done in
> > the
> > base language fine. There must be an agnostic way to do this. Apparently
> > subset doesn't do it either if keys get lost.
> >
> > -Chris
> >
> > On 20 July 2011 08:48, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> >
> >>
> >> Hi Chris,
> >>
> >> If you're writing a package and don't want to worry if someone passes
> >> your
> >> package a data.table, then don't worry; just use data.frame syntax and
> >> your non-datatable-aware package will work fine.
> >>
> >> If you're writing your own code you're in control of, just embrace the
> >> data.table ;)
> >>
> >> If you're writing a function in an environment which is data.table
> >> aware,
> >> but you want your function to accept either data.frame or data.table,
> >> then
> >> at the beginning of your function just do :
> >>
> >> f = myfunction(x) {
> >>    x = as.data.table(x)
> >>    # proceed with data.table syntax
> >> }
> >>
> >> or
> >>
> >> f = myfunction(x) {
> >>    x = as.data.frame(x)
> >>    # proceed with data.frame syntax
> >> }
> >>
> >> Some of the CRAN packages that depend on data.table are doing that, I
> >> think.
> >>
> >> In R itself it is common practice to coerce arguments to a common type
> >> and
> >> then proceed with the appropriate syntax for that type.  Consider that
> >> matrix syntax is different syntax to data.frame syntax. You often see
> >> as.classiwant() at the beginning of functions, or switches depending on
> >> the type of object.
> >>
> >> Remember that is.data.frame() is TRUE for both data.frame and
> >> data.table,
> >> but is.data.table() is TRUE only for data.table.  as.data.table() does
> >> nothing if x is already a data.table, and is an efficient class change
> >> if
> >> x is a data.frame.  Is efficiency the issue?
> >>
> >> Does that help?  If not, more info about the problem will be needed
> >> please.
> >>
> >> Matthew
> >>
> >>
> >> > I'm used to seeing the column names at the bottom of the column too,
> >> but
> >> > that is only if the data.table is long enough. My example was too
> >> short
> >> > for
> >> > that, so I made the same sort of mistake you did :(
> >> >
> >> > Okay, that is a way, but is it a good way? Not sure...
> >> >
> >> > 2011/7/20 Timothée Carayol <timothee.carayol at gmail.com>
> >> >
> >> >> Sorry my mistake -- subset does return a data.table.
> >> >> (I was using as an example a data.table with 100 rows, and stupidly
> >> >> using
> >> >> the fact that it printed the whole thing rather than the 10 first
> >> rows
> >> >> only
> >> >> as my criterion for whether it worked or not.. Omitting that
> >> >> print.data.table does print up to 100 rows. I feel a bit stupid.)
> >> >>
> >> >> Why doesn't it work for you if that is the case?
> >> >>
> >> >> DF <- data.frame(a=1:200, b=1:10)
> >> >> DT <- as.data.table(DF)
> >> >> subDT <- subset(DT, select=a)
> >> >> class(DT)
> >> >> subDF <- subset(DF, select=a)
> >> >> class(DF)
> >> >> identical(as.data.frame(DT), DF)
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Jul 20, 2011 at 12:50 PM, Chris Neff <caneff at gmail.com>
> >> wrote:
> >> >>
> >> >>> Yeah I realized that myself.
> >> >>>
> >> >>> Another one: the function "with" doesn't seem to do what I want...
> >> but
> >> >>> at
> >> >>> least it is consistent!
> >> >>>
> >> >>>
> >> >>> 2011/7/20 Timothée Carayol <timothee.carayol at gmail.com>
> >> >>>
> >> >>>> Sorry --
> >> >>>>
> >> >>>> subset() was a poor idea, as it will return a data.frame even if
> >> the
> >> >>>> argument is a data.table..
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> 2011/7/20 Timothée Carayol <timothee.carayol at gmail.com>
> >> >>>>
> >> >>>>> Hi--
> >> >>>>>
> >> >>>>> You can use the subset() command with the select= option; not sure
> >> >>>>> it's
> >> >>>>> the best solution, though.
> >> >>>>>
> >> >>>>> Timothee
> >> >>>>>
> >> >>>>>
> >> >>>>> On Wed, Jul 20, 2011 at 12:26 PM, Chris Neff <caneff at gmail.com>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> I have a function where I pass a data frame and some variable
> >> names
> >> >>>>>> to
> >> >>>>>> calculate statistics on. However, I am at a loss as to how to
> >> write
> >> >>>>>> it
> >> >>>>>> correctly so that both data.frame and data.table work with it. If
> >> I
> >> >>>>>> have:
> >> >>>>>>
> >> >>>>>> DF = data.frame(x=1:10,y=2:11,z=3:12)
> >> >>>>>>
> >> >>>>>> DT = data.table(DF)
> >> >>>>>>
> >> >>>>>> var.names = c("x","y")
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> I can do the following things to subset:
> >> >>>>>>
> >> >>>>>> DT[,var.names,with=FALSE]
> >> >>>>>> DF[,var.names]
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> but of course DT[,var.names] won't give me back what I want, and
> >> >>>>>> DF[,var.names,with=FALSE] returns an error because with doesn't
> >> >>>>>> exist there.
> >> >>>>>> So how do I do this?
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> -Chris
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> _______________________________________________
> >> >>>>>> datatable-help mailing list
> >> >>>>>> datatable-help at lists.r-forge.r-project.org
> >> >>>>>>
> >> >>>>>>
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> >
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >
> >>
> >>
> >>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20110720/fb286a50/attachment.htm>


More information about the datatable-help mailing list