[datatable-help] Using data.table to run a function on every row

Stian Håklev shaklev at gmail.com
Fri Sep 27 17:39:35 CEST 2013


OK, so I just realized a few things.

First of all, I should have had has_db in a parenthesis to use it as an
index (like Ricardo did, I just didn't notice that it was important).
However, this still doesn't make much of a difference, because we're only
talking about 146k entries, and most of the time is spent on the string
extraction:

> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
   user  system elapsed
 10.246   0.027  10.275
> system.time( a <- db[has_url == T, getUrls(text, id), by=id] )
   user  system elapsed
 10.094   0.029  10.123

Either way, good to know!

Secondly, I tried this form:
system.time( b <- db[(has_url),
                     j=list(urls = str_match_all(text, url_pattern)),
                     by=id] )


The problem is that it only accepts one value per row, so the output format
looks exactly like what I want - but
> nrow(db) # all records
[1] 146058
> nrow(a) # using the function getUrls
[1] 30019
> nrow(b) # using str_match_all directly with j=list
[1] 11007
> length(unique(a$id)) # similar number of IDs, but not similar number of
URLs
[1] 11007
> length(unique(b$id))
[1] 11007

thanks again,
Stian


On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <shaklev at gmail.com> wrote:

> I really appreciate all your help - amazingly supportive community. I
> could probably figure out a "brute-force" way of doing things, but since
> I'm going to be writing a lot of R in the future too, I always want to find
> the "correct" way of doing it, which both looks clear, and is quick. (I
> come from a background in Ruby, and am always interested in writing very
> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
> of time in R struggling with various data formats - lists, nested lists,
> vectors, matrices, different forms of apply/ddply/for loops etc).
>
> Anyway, a few different points.
>
> I tried db[has_url,], but got "object has_url not found"
>
> I then tried setkey(db, "has_url"), and using this, but somehow it was a
> lot slower than what I used to do (I repeated a few times). Not sure if I'm
> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
> this once. But good to understand the underlying principles).
>
> setkey(db, "has_url")
> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>    user  system elapsed
>  17.514   0.334  17.847
> > system.time( db[has_url == T, matches := str_match_all(text,
> url_pattern)] )
>    user  system elapsed
>   5.943   0.040   5.984
>
> The second point was how to get out the matches. The idea was that you
> have a text field which might contain several urls, which I want to
> extract, but I need each URL tagged with the row it came from (so I can
> link it back to properties of the post and author, look at whether certain
> students are more likely to post certain kinds of URLs etc).
>
> Instead of a function, you'll see above that I rewrote it to use :=, which
> creates a new column that holds a list. That worked wonderfully, but now
> how do I get these "out" of this data.table, and into a new one.
>
> Made-up example data:
> a <- c(1,2,3)
> b <- c(2,3,4)
> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
> NULL))
>
> Now my goal is to have a new data.table that looks like this
> Name Number
> Stian 1
> Stian 2
> Stian 3
> Christian 2
> Christian 3
> Christian 4
>
> Again, I'm sure I could do this with a for() or lapply? But I'd love to
> see the most elegant solution.
>
> Note that this:
>
> getUrls <- function(text, id) {
>   matches <- str_match_all(text, url_pattern)
>   data.frame(urls=unlist(matches), id=id)
> }
>
> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
> Works perfectly, the result is
> idurlsid116
> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162
> 24http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
> 44461
> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
> 61575
> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>
> which is exactly what I was looking for. So I've really reached my goal,
> but I'm curious about the other method as well.
>
> Thanks!
> Stian
>
>
> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> That was my thought too.  I don't know what str_match_all is,  but given
>> the unlist() in getUrls(),  it seems to return a list.   Rather than
>> unlist(),  leave it as list,  and data.table should happily make a `list`
>> column where each cell is itself a vector.  In fact each cell can be
>> anything at all,  even embedded data.table, function definitions, or any
>> type of object.
>> You might need a list(list(str_match_all(...))) in j to do that.
>>
>> Or what Rick has suggested here might work first time.  It's hard to
>> visualise it without a small reproducible example, so we're having to make
>> educated guesses.
>>
>> Many thanks for the kind words about data.table.
>>
>> Matthew
>>
>>
>>
>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>
>> In fact, you should be able to skip the function altogether and just
>> use:
>>
>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>
>>
>>  (and now, my apologies to all for the email clutter)
>> good night
>>
>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>
>>>  The error you are seeing is most likely coming from your getURL
>>> function in that you are adding several ids to a data.frame of varying
>>> rows, and `R` cannot recycle it correctly.
>>>
>>>  If you instead breakdown by id, then each time you are only assigning
>>> one id and R will be able to recycle appropriately, without issue.
>>>
>>>  good luck!
>>> Rick
>>>
>>>
>>>  Ricardo Saporta
>>>  Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>> e: saporta at rutgers.edu
>>>
>>>
>>>
>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>> saporta at scarletmail.rutgers.edu> wrote:
>>>
>>>> Hi there,
>>>>
>>>>  Try inserting a `by=id` in
>>>>
>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>
>>>>  Also, no need for "has_url == T"
>>>> instead, use
>>>>   (has_url)
>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>> things down ;)
>>>>
>>>>
>>>>
>>>>  Ricardo Saporta
>>>> Graduate Student, Data Analytics
>>>> Rutgers University, New Jersey
>>>>  e: saporta at rutgers.edu
>>>>
>>>>
>>>>
>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com>wrote:
>>>>
>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>> of data frames and rbindlist them together for a totally separate
>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>> tagging them with the forum post they came from).
>>>>>
>>>>>  I tried doing this with a data.table
>>>>>
>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>
>>>>>  and get the message
>>>>>
>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>> 4L,  :
>>>>>   replacement has 11007 rows, data has 29787
>>>>>
>>>>>  Because some rows have several URLs... However, I don't care that
>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>> names, etc.
>>>>>
>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>
>>>>>  getUrls <- function(text, id) {
>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>   a$id <- id
>>>>>   a
>>>>> }
>>>>>
>>>>>
>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>> life so much easier. It should be part of base, I think.
>>>>> Stian Haklev, University of Toronto
>>>>>
>>>>>  --
>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>
>>>>>  _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>



-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/8d678a1f/attachment-0001.html>


More information about the datatable-help mailing list