[datatable-help] Using data.table to run a function on every row

Stian Håklev shaklev at gmail.com
Fri Sep 27 17:21:43 CEST 2013


I really appreciate all your help - amazingly supportive community. I could
probably figure out a "brute-force" way of doing things, but since I'm
going to be writing a lot of R in the future too, I always want to find the
"correct" way of doing it, which both looks clear, and is quick. (I come
from a background in Ruby, and am always interested in writing very clear
and DRY (do not repeat yourself) code, but I find I still spend a lot of
time in R struggling with various data formats - lists, nested lists,
vectors, matrices, different forms of apply/ddply/for loops etc).

Anyway, a few different points.

I tried db[has_url,], but got "object has_url not found"

I then tried setkey(db, "has_url"), and using this, but somehow it was a
lot slower than what I used to do (I repeated a few times). Not sure if I'm
doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
this once. But good to understand the underlying principles).

setkey(db, "has_url")
> system.time( db[T, matches := str_match_all(text, url_pattern)] )
   user  system elapsed
 17.514   0.334  17.847
> system.time( db[has_url == T, matches := str_match_all(text,
url_pattern)] )
   user  system elapsed
  5.943   0.040   5.984

The second point was how to get out the matches. The idea was that you have
a text field which might contain several urls, which I want to extract, but
I need each URL tagged with the row it came from (so I can link it back to
properties of the post and author, look at whether certain students are
more likely to post certain kinds of URLs etc).

Instead of a function, you'll see above that I rewrote it to use :=, which
creates a new column that holds a list. That worked wonderfully, but now
how do I get these "out" of this data.table, and into a new one.

Made-up example data:
a <- c(1,2,3)
b <- c(2,3,4)
dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
NULL))

Now my goal is to have a new data.table that looks like this
Name Number
Stian 1
Stian 2
Stian 3
Christian 2
Christian 3
Christian 4

Again, I'm sure I could do this with a for() or lapply? But I'd love to see
the most elegant solution.

Note that this:

getUrls <- function(text, id) {
  matches <- str_match_all(text, url_pattern)
  data.frame(urls=unlist(matches), id=id)
}

system.time( a <- db[(has_url), getUrls(text, id), by=id] )

Works perfectly, the result is
idurlsid116
https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16224
http://www.youtube.com/watch?v=JUiGF4TGI9w24 344
http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
44461
http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
61575
http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
75675https://www.facebook.com/photo.php?fbid=10151324672623754 75

which is exactly what I was looking for. So I've really reached my goal,
but I'm curious about the other method as well.

Thanks!
Stian


On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> That was my thought too.  I don't know what str_match_all is,  but given
> the unlist() in getUrls(),  it seems to return a list.   Rather than
> unlist(),  leave it as list,  and data.table should happily make a `list`
> column where each cell is itself a vector.  In fact each cell can be
> anything at all,  even embedded data.table, function definitions, or any
> type of object.
> You might need a list(list(str_match_all(...))) in j to do that.
>
> Or what Rick has suggested here might work first time.  It's hard to
> visualise it without a small reproducible example, so we're having to make
> educated guesses.
>
> Many thanks for the kind words about data.table.
>
> Matthew
>
>
>
> On 27/09/13 07:44, Ricardo Saporta wrote:
>
> In fact, you should be able to skip the function altogether and just use:
>
>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>
>
>  (and now, my apologies to all for the email clutter)
> good night
>
> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>
>>  The error you are seeing is most likely coming from your getURL
>> function in that you are adding several ids to a data.frame of varying
>> rows, and `R` cannot recycle it correctly.
>>
>>  If you instead breakdown by id, then each time you are only assigning
>> one id and R will be able to recycle appropriately, without issue.
>>
>>  good luck!
>> Rick
>>
>>
>>  Ricardo Saporta
>>  Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
>>
>>
>>
>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> Hi there,
>>>
>>>  Try inserting a `by=id` in
>>>
>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>
>>>  Also, no need for "has_url == T"
>>> instead, use
>>>   (has_url)
>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>> things down ;)
>>>
>>>
>>>
>>>  Ricardo Saporta
>>> Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>>  e: saporta at rutgers.edu
>>>
>>>
>>>
>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com>wrote:
>>>
>>>>  I'm trying to run a function on every row fulfilling a certain
>>>> criterium, which returns a data frame - the idea is then to take the list
>>>> of data frames and rbindlist them together for a totally separate
>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>> tagging them with the forum post they came from).
>>>>
>>>>  I tried doing this with a data.table
>>>>
>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>
>>>>  and get the message
>>>>
>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>> 4L,  :
>>>>   replacement has 11007 rows, data has 29787
>>>>
>>>>  Because some rows have several URLs... However, I don't care that
>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>> names, etc.
>>>>
>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>
>>>>  getUrls <- function(text, id) {
>>>>   matches <- str_match_all(text, url_pattern)
>>>>   a <- data.frame(urls=unlist(matches))
>>>>   a$id <- id
>>>>   a
>>>> }
>>>>
>>>>
>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>> life so much easier. It should be part of base, I think.
>>>> Stian Haklev, University of Toronto
>>>>
>>>>  --
>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>
>>>>  _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>


-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/792e7921/attachment-0001.html>


More information about the datatable-help mailing list