[datatable-help] Subsetting that behaves right for both data frames and data.tables?

Matthew Dowle mdowle at mdowle.plus.com
Wed Jul 20 16:42:03 CEST 2011


Thanks, makes sense. Yes, as.data.frame.data.table currently removes the
'sorted' attribute, which is all a key is. I suppose that line could be
removed so the key would be left on the data.frame.  You would then need
to change the class back to data.table at the end of the function, though,
and make sure you didn't change the order of the rows otherwise that key
would be invalid.

However, packages I use, use other packages I don't use directly and know
nothing about. I don't see the issue. Disk space? Memory space? The
banner?

There is also this related FR :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=984&group_id=240&atid=978

Just to check you know that the result of j in data.table can happily be a
data.frame?  So if your user is using data.table to call your function, he
won't mind. If he's passing the entire data.table to your function, then
he's not going to be wanting to retain the key anyway. You're returning
some statistical result to him (not the orginal data back) so why does the
key make sense to retain?

The functional idiom you're showing is one of the things I don't like
about data.frame in R.  It's one of the reasons the syntax in data.table
is different. I'll translate it with comments to what really happens :

MyFunc <- function(data, numerator.var, denominator.var)
{
  data <- data[order(      # reorder all columns of the data
  data[,numerator.var])]   # copy one column to a new vector
  data$metric <-           # copy all 'data' (doesn't just add a column)
  data[, numerator.var]    # new copy of vector
  /data[, denominator.var] # new copy of vector
  data$cum.metric <-       # copy all of 'data' again, and
                           # lock user into your choice of column name
  cumsum(data$metric)      # new copy of metric vector (not sure)
  return(data)             # finally, gosh, I'm worn out after all that
}

contrast to this :

MyFunc <- function(numerator,denominator)
{
     o = order(numerator)
     cumsum((numerator/denomintor)[o])
     # oh, that's what it does!
}

A data.table user would call the latter like this :

    DT[,MyFunc(colA,colB)]

So there aren't any copies of the columns going on because colA and colB
are vectors right there, and it's much faster.  Or the user can do :

    DT[,MyFunc(colA,colB),by=grp]

and that saves you adding a grouping variable to MyFunc.

Or, if MyFunc is already locked into accepting a data.frame, the
data.table user can (and does) use it like this :

    DT[,MyFunc(data=.SD,"colA","colB"),by=grp]

and it doesn't matter that the j comes back as data.frame, that's still a
list which is fine to j.  Obviously it's less efficient of course, because
the data is being copied and added to, but the inefficiency is up in
MyFunc. The data.table user might decide to take MyFunc, chop out all the
innefficiency and just keep the bits it really does.

Noticing that strictly, your MyFunc 'returned' two columns, so it might be
written like this :

MyFunc <- function(numerator,denominator)
{
     o = order(numerator)
     data.frame(numerator[o], cumsum((numerator/denomintor)[o])
}

Then the user can decide if he wants to cbind it to his data.frame, or
fast assign it into a data.table,  or by group,  or whatever.  That seems
to me to be up to your user.  Perhaps, the job of MyFunc is to return it's
output given the input (and that's all).

Writing quickly, probably with errors and typos. There are many ways to do
things, and above is just one way. Maybe a more complicated example from
you is needed please, for me to see.  My main concern is effiency on large
datasets; passing the large dataset into a function for it to be copied
and copied, just isn't a good idiom as far I can see. That's why in
data.table the idea is to pass functions the columns themselves within the
scope of the data.table i.e. call the function in j.

Matthew


> Mainly it is that I am writing some library functions that I and a few
> others may be using. I don't want those functions to have to depend on
> data.table because I don't want it to need to be installed for a purpose
> that has nothing to do with it. But I use data.tables as input. Here is a
> psuedo example
>
> MyFunc <- function(data, numerator.var, denominator.var)
> {
>   data <- data[order(data[,numerator.var])]
>   data$metric <- data[, numerator.var] / data[, denominator.var]
>   data$cum.metric <- cumsum(data$metric)
>
>   return(data)
> }
>
> I make this example to show that I need to preserve the whole data
> variable
> the whole way through and return a modified version.  If I do
>
> data <- as.data.frame(data)
>
> as the first line of that function, then I lose the keys in a potential
> data.table that is passed in.  If I use
>
> data <- as.data.table(data)
>
> and change the subsetting to be data.table compliant, then I am forcing
> someone to have a whole package loaded for something that can be done in
> the
> base language fine. There must be an agnostic way to do this. Apparently
> subset doesn't do it either if keys get lost.
>
> -Chris
>
> On 20 July 2011 08:48, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>>
>> Hi Chris,
>>
>> If you're writing a package and don't want to worry if someone passes
>> your
>> package a data.table, then don't worry; just use data.frame syntax and
>> your non-datatable-aware package will work fine.
>>
>> If you're writing your own code you're in control of, just embrace the
>> data.table ;)
>>
>> If you're writing a function in an environment which is data.table
>> aware,
>> but you want your function to accept either data.frame or data.table,
>> then
>> at the beginning of your function just do :
>>
>> f = myfunction(x) {
>>    x = as.data.table(x)
>>    # proceed with data.table syntax
>> }
>>
>> or
>>
>> f = myfunction(x) {
>>    x = as.data.frame(x)
>>    # proceed with data.frame syntax
>> }
>>
>> Some of the CRAN packages that depend on data.table are doing that, I
>> think.
>>
>> In R itself it is common practice to coerce arguments to a common type
>> and
>> then proceed with the appropriate syntax for that type.  Consider that
>> matrix syntax is different syntax to data.frame syntax. You often see
>> as.classiwant() at the beginning of functions, or switches depending on
>> the type of object.
>>
>> Remember that is.data.frame() is TRUE for both data.frame and
>> data.table,
>> but is.data.table() is TRUE only for data.table.  as.data.table() does
>> nothing if x is already a data.table, and is an efficient class change
>> if
>> x is a data.frame.  Is efficiency the issue?
>>
>> Does that help?  If not, more info about the problem will be needed
>> please.
>>
>> Matthew
>>
>>
>> > I'm used to seeing the column names at the bottom of the column too,
>> but
>> > that is only if the data.table is long enough. My example was too
>> short
>> > for
>> > that, so I made the same sort of mistake you did :(
>> >
>> > Okay, that is a way, but is it a good way? Not sure...
>> >
>> > 2011/7/20 Timothée Carayol <timothee.carayol at gmail.com>
>> >
>> >> Sorry my mistake -- subset does return a data.table.
>> >> (I was using as an example a data.table with 100 rows, and stupidly
>> >> using
>> >> the fact that it printed the whole thing rather than the 10 first
>> rows
>> >> only
>> >> as my criterion for whether it worked or not.. Omitting that
>> >> print.data.table does print up to 100 rows. I feel a bit stupid.)
>> >>
>> >> Why doesn't it work for you if that is the case?
>> >>
>> >> DF <- data.frame(a=1:200, b=1:10)
>> >> DT <- as.data.table(DF)
>> >> subDT <- subset(DT, select=a)
>> >> class(DT)
>> >> subDF <- subset(DF, select=a)
>> >> class(DF)
>> >> identical(as.data.frame(DT), DF)
>> >>
>> >>
>> >>
>> >> On Wed, Jul 20, 2011 at 12:50 PM, Chris Neff <caneff at gmail.com>
>> wrote:
>> >>
>> >>> Yeah I realized that myself.
>> >>>
>> >>> Another one: the function "with" doesn't seem to do what I want...
>> but
>> >>> at
>> >>> least it is consistent!
>> >>>
>> >>>
>> >>> 2011/7/20 Timothée Carayol <timothee.carayol at gmail.com>
>> >>>
>> >>>> Sorry --
>> >>>>
>> >>>> subset() was a poor idea, as it will return a data.frame even if
>> the
>> >>>> argument is a data.table..
>> >>>>
>> >>>>
>> >>>>
>> >>>> 2011/7/20 Timothée Carayol <timothee.carayol at gmail.com>
>> >>>>
>> >>>>> Hi--
>> >>>>>
>> >>>>> You can use the subset() command with the select= option; not sure
>> >>>>> it's
>> >>>>> the best solution, though.
>> >>>>>
>> >>>>> Timothee
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Jul 20, 2011 at 12:26 PM, Chris Neff <caneff at gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I have a function where I pass a data frame and some variable
>> names
>> >>>>>> to
>> >>>>>> calculate statistics on. However, I am at a loss as to how to
>> write
>> >>>>>> it
>> >>>>>> correctly so that both data.frame and data.table work with it. If
>> I
>> >>>>>> have:
>> >>>>>>
>> >>>>>> DF = data.frame(x=1:10,y=2:11,z=3:12)
>> >>>>>>
>> >>>>>> DT = data.table(DF)
>> >>>>>>
>> >>>>>> var.names = c("x","y")
>> >>>>>>
>> >>>>>>
>> >>>>>> I can do the following things to subset:
>> >>>>>>
>> >>>>>> DT[,var.names,with=FALSE]
>> >>>>>> DF[,var.names]
>> >>>>>>
>> >>>>>>
>> >>>>>> but of course DT[,var.names] won't give me back what I want, and
>> >>>>>> DF[,var.names,with=FALSE] returns an error because with doesn't
>> >>>>>> exist there.
>> >>>>>> So how do I do this?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> -Chris
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> datatable-help mailing list
>> >>>>>> datatable-help at lists.r-forge.r-project.org
>> >>>>>>
>> >>>>>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>>
>>
>>
>




More information about the datatable-help mailing list