[datatable-help] Using data.table to run a function on every row

Stian Håklev shaklev at gmail.com
Fri Sep 27 18:20:20 CEST 2013


> system.time( db[T, matches := str_match_all(text, url_pattern)] )
   user  system elapsed
 19.610   0.475  20.304
> system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text,
url_pattern))) :
  All items in j=list(...) should be atomic vectors or lists. If you are
trying something like j=list(.SD,newcol=mean(colA)) then use := by group
instead (much quicker), or cbind or merge afterwards.
Timing stopped at: 6.339 0.043 6.403


On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta <
saporta at scarletmail.rutgers.edu> wrote:

> Hi Stian,
>
> Try the following two and look at the difference:
>
>   db[T, matches := str_match_all(text, url_pattern)]
>  db[.(T), matches := str_match_all(text, url_pattern)]
>
> ;)
>
>
>
> On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <shaklev at gmail.com> wrote:
>
>> I really appreciate all your help - amazingly supportive community. I
>> could probably figure out a "brute-force" way of doing things, but since
>> I'm going to be writing a lot of R in the future too, I always want to find
>> the "correct" way of doing it, which both looks clear, and is quick. (I
>> come from a background in Ruby, and am always interested in writing very
>> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
>> of time in R struggling with various data formats - lists, nested lists,
>> vectors, matrices, different forms of apply/ddply/for loops etc).
>>
>> Anyway, a few different points.
>>
>> I tried db[has_url,], but got "object has_url not found"
>>
>> I then tried setkey(db, "has_url"), and using this, but somehow it was a
>> lot slower than what I used to do (I repeated a few times). Not sure if I'm
>> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
>> this once. But good to understand the underlying principles).
>>
>> setkey(db, "has_url")
>> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>>    user  system elapsed
>>  17.514   0.334  17.847
>> > system.time( db[has_url == T, matches := str_match_all(text,
>> url_pattern)] )
>>    user  system elapsed
>>   5.943   0.040   5.984
>>
>> The second point was how to get out the matches. The idea was that you
>> have a text field which might contain several urls, which I want to
>> extract, but I need each URL tagged with the row it came from (so I can
>> link it back to properties of the post and author, look at whether certain
>> students are more likely to post certain kinds of URLs etc).
>>
>> Instead of a function, you'll see above that I rewrote it to use :=,
>> which creates a new column that holds a list. That worked wonderfully, but
>> now how do I get these "out" of this data.table, and into a new one.
>>
>> Made-up example data:
>> a <- c(1,2,3)
>> b <- c(2,3,4)
>> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
>> NULL))
>>
>> Now my goal is to have a new data.table that looks like this
>> Name Number
>> Stian 1
>> Stian 2
>> Stian 3
>> Christian 2
>> Christian 3
>> Christian 4
>>
>> Again, I'm sure I could do this with a for() or lapply? But I'd love to
>> see the most elegant solution.
>>
>> Note that this:
>>
>> getUrls <- function(text, id) {
>>   matches <- str_match_all(text, url_pattern)
>>   data.frame(urls=unlist(matches), id=id)
>> }
>>
>> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>>
>> Works perfectly, the result is
>> idurlsid116
>> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16
>> 224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
>> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
>> 44461
>> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
>> 61575
>> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
>> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>>
>> which is exactly what I was looking for. So I've really reached my goal,
>> but I'm curious about the other method as well.
>>
>> Thanks!
>> Stian
>>
>>
>> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>> That was my thought too.  I don't know what str_match_all is,  but given
>>> the unlist() in getUrls(),  it seems to return a list.   Rather than
>>> unlist(),  leave it as list,  and data.table should happily make a `list`
>>> column where each cell is itself a vector.  In fact each cell can be
>>> anything at all,  even embedded data.table, function definitions, or any
>>> type of object.
>>> You might need a list(list(str_match_all(...))) in j to do that.
>>>
>>> Or what Rick has suggested here might work first time.  It's hard to
>>> visualise it without a small reproducible example, so we're having to make
>>> educated guesses.
>>>
>>> Many thanks for the kind words about data.table.
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>>
>>> In fact, you should be able to skip the function altogether and just
>>> use:
>>>
>>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>>
>>>
>>>  (and now, my apologies to all for the email clutter)
>>> good night
>>>
>>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>>> saporta at scarletmail.rutgers.edu> wrote:
>>>
>>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>>
>>>>  The error you are seeing is most likely coming from your getURL
>>>> function in that you are adding several ids to a data.frame of varying
>>>> rows, and `R` cannot recycle it correctly.
>>>>
>>>>  If you instead breakdown by id, then each time you are only assigning
>>>> one id and R will be able to recycle appropriately, without issue.
>>>>
>>>>  good luck!
>>>> Rick
>>>>
>>>>
>>>>  Ricardo Saporta
>>>>  Graduate Student, Data Analytics
>>>> Rutgers University, New Jersey
>>>> e: saporta at rutgers.edu
>>>>
>>>>
>>>>
>>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>>> saporta at scarletmail.rutgers.edu> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>>  Try inserting a `by=id` in
>>>>>
>>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>>
>>>>>  Also, no need for "has_url == T"
>>>>> instead, use
>>>>>   (has_url)
>>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>>> things down ;)
>>>>>
>>>>>
>>>>>
>>>>>  Ricardo Saporta
>>>>> Graduate Student, Data Analytics
>>>>> Rutgers University, New Jersey
>>>>>  e: saporta at rutgers.edu
>>>>>
>>>>>
>>>>>
>>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com>wrote:
>>>>>
>>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>>> of data frames and rbindlist them together for a totally separate
>>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>>> tagging them with the forum post they came from).
>>>>>>
>>>>>>  I tried doing this with a data.table
>>>>>>
>>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>>
>>>>>>  and get the message
>>>>>>
>>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>>> 4L,  :
>>>>>>   replacement has 11007 rows, data has 29787
>>>>>>
>>>>>>  Because some rows have several URLs... However, I don't care that
>>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>>> names, etc.
>>>>>>
>>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>>
>>>>>>  getUrls <- function(text, id) {
>>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>>   a$id <- id
>>>>>>   a
>>>>>> }
>>>>>>
>>>>>>
>>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>>> life so much easier. It should be part of base, I think.
>>>>>> Stian Haklev, University of Toronto
>>>>>>
>>>>>>  --
>>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>>
>>>>>>  _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>> --
>> http://reganmian.net/blog -- Random Stuff that Matters
>>
>
>


-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/fa82d632/attachment-0001.html>


More information about the datatable-help mailing list