[datatable-help] New package dataframe

Matthew Dowle mdowle at mdowle.plus.com
Fri May 25 12:34:07 CEST 2012


Tim Hesterberg <rocket <at> google.com> writes:
>
> * however, Luke Tierney is looking at that too and trying to
> change R to make those tricks unnecessary.  He made a change to
> the development version that may make the "attributes<-"(...)
> trick I use unnecessary.

Great. That's now made its way through to NEWS, and seems to be names()<-
too :

2.15.0 patched PERFORMANCE IMPROVEMENTS
•	There is less copying when using primitive replacement functions such as
‘names()’, ‘attr()’ and ‘attributes()’.

Will look forward to testing that out and maybe we can simplify some of
data.table too. Hopefully they won't copy DF at all? Assignment by
reference in data.table is about avoiding even a single copy. If
names()<-, attr()<- and attributes()<- copy DF, even just once, then it'll
still be infinitely faster to use the set* functions in data.table (any
time / 0 = Inf), but there it's 'out of memory' or the (later) time to
garbage collect that's the practical concern; not the Inf speedup factor
really. Copying a 50GB data.table once on a 128GB machine isn't an
insignificant time, either. Say that takes 2 seconds, but what about the
other users on the server who're squeezed into 28GB, or swapped out to
disk. When you're swapped out, performance falls off a cliff even for the
simplest task.

The announcement about dataframe detailed reductions in the number of
copies to numbers greater than 0 as far as I could see. And the item in
NEWS says "less copying" so leaves it unclear whether no copies are made
in any cases. In example(setnames) it shows 4,3,1...0 (in 1.8.0) and
4,3,2,1...0 (in 1.8.1).

If base R can manage to reduce copies to 0 in many cases, it would be
fantastic. That's why I posted to r-devel: "confused about NAMED" (Nov
2011) trying to get changes like that made. Luke said he would look at it
then so it's exciting he has :
http://r.789695.n4.nabble.com/Confused-about-NAMED-tp4103326p4105017.html

Also, have you seen the last paragraph of data.table FAQ 1.8? :

    A second proposal was to use memcpy in duplicate.c, which is much
    faster than a for loop in C. This would improve the way that R copies
    data internally (on some measures by 13 times). The thread on r-devel
    is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

It isn't just the number of copies, but the _way_ R copies. Prof Ripley
placed a FIXME in duplicate.c in 2006 (iirc). Perhaps someone could take a
look at the thread linked in the FAQ and help grease the cogs there?  If
r-devel just fixed that FIXME it could speed up R a lot on large objects
when it does copy.

Matthew





More information about the datatable-help mailing list