[datatable-help] Using data.table to run a function on every row
Stian Håklev
shaklev at gmail.com
Fri Sep 27 17:21:43 CEST 2013
I really appreciate all your help - amazingly supportive community. I could
probably figure out a "brute-force" way of doing things, but since I'm
going to be writing a lot of R in the future too, I always want to find the
"correct" way of doing it, which both looks clear, and is quick. (I come
from a background in Ruby, and am always interested in writing very clear
and DRY (do not repeat yourself) code, but I find I still spend a lot of
time in R struggling with various data formats - lists, nested lists,
vectors, matrices, different forms of apply/ddply/for loops etc).
Anyway, a few different points.
I tried db[has_url,], but got "object has_url not found"
I then tried setkey(db, "has_url"), and using this, but somehow it was a
lot slower than what I used to do (I repeated a few times). Not sure if I'm
doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
this once. But good to understand the underlying principles).
setkey(db, "has_url")
> system.time( db[T, matches := str_match_all(text, url_pattern)] )
user system elapsed
17.514 0.334 17.847
> system.time( db[has_url == T, matches := str_match_all(text,
url_pattern)] )
user system elapsed
5.943 0.040 5.984
The second point was how to get out the matches. The idea was that you have
a text field which might contain several urls, which I want to extract, but
I need each URL tagged with the row it came from (so I can link it back to
properties of the post and author, look at whether certain students are
more likely to post certain kinds of URLs etc).
Instead of a function, you'll see above that I rewrote it to use :=, which
creates a new column that holds a list. That worked wonderfully, but now
how do I get these "out" of this data.table, and into a new one.
Made-up example data:
a <- c(1,2,3)
b <- c(2,3,4)
dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
NULL))
Now my goal is to have a new data.table that looks like this
Name Number
Stian 1
Stian 2
Stian 3
Christian 2
Christian 3
Christian 4
Again, I'm sure I could do this with a for() or lapply? But I'd love to see
the most elegant solution.
Note that this:
getUrls <- function(text, id) {
matches <- str_match_all(text, url_pattern)
data.frame(urls=unlist(matches), id=id)
}
system.time( a <- db[(has_url), getUrls(text, id), by=id] )
Works perfectly, the result is
idurlsid116
https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16224
http://www.youtube.com/watch?v=JUiGF4TGI9w24 344
http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
44461
http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
61575
http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
which is exactly what I was looking for. So I've really reached my goal,
but I'm curious about the other method as well.
Thanks!
Stian
On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
> That was my thought too. I don't know what str_match_all is, but given
> the unlist() in getUrls(), it seems to return a list. Rather than
> unlist(), leave it as list, and data.table should happily make a `list`
> column where each cell is itself a vector. In fact each cell can be
> anything at all, even embedded data.table, function definitions, or any
> type of object.
> You might need a list(list(str_match_all(...))) in j to do that.
>
> Or what Rick has suggested here might work first time. It's hard to
> visualise it without a small reproducible example, so we're having to make
> educated guesses.
>
> Many thanks for the kind words about data.table.
>
> Matthew
>
>
>
> On 27/09/13 07:44, Ricardo Saporta wrote:
>
> In fact, you should be able to skip the function altogether and just use:
>
> db[ (has_url), str_match_all(text, url_pattern), by=id]
>
>
> (and now, my apologies to all for the email clutter)
> good night
>
> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> sorry, I probably should have elaborated (it's late here, in NJ)
>>
>> The error you are seeing is most likely coming from your getURL
>> function in that you are adding several ids to a data.frame of varying
>> rows, and `R` cannot recycle it correctly.
>>
>> If you instead breakdown by id, then each time you are only assigning
>> one id and R will be able to recycle appropriately, without issue.
>>
>> good luck!
>> Rick
>>
>>
>> Ricardo Saporta
>> Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
>>
>>
>>
>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> Hi there,
>>>
>>> Try inserting a `by=id` in
>>>
>>> a <- db[(has_url), getUrls(text, id), by=id]
>>>
>>> Also, no need for "has_url == T"
>>> instead, use
>>> (has_url)
>>> If the variable is alread logical. (Otherwise, you are just slowing
>>> things down ;)
>>>
>>>
>>>
>>> Ricardo Saporta
>>> Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>> e: saporta at rutgers.edu
>>>
>>>
>>>
>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com>wrote:
>>>
>>>> I'm trying to run a function on every row fulfilling a certain
>>>> criterium, which returns a data frame - the idea is then to take the list
>>>> of data frames and rbindlist them together for a totally separate
>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>> tagging them with the forum post they came from).
>>>>
>>>> I tried doing this with a data.table
>>>>
>>>> a <- db[has_url == T, getUrls(text, id)]
>>>>
>>>> and get the message
>>>>
>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>> 4L, :
>>>> replacement has 11007 rows, data has 29787
>>>>
>>>> Because some rows have several URLs... However, I don't care that
>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>> names, etc.
>>>>
>>>> Here's the function it's running, but that shouldn't be relevant
>>>>
>>>> getUrls <- function(text, id) {
>>>> matches <- str_match_all(text, url_pattern)
>>>> a <- data.frame(urls=unlist(matches))
>>>> a$id <- id
>>>> a
>>>> }
>>>>
>>>>
>>>> Thanks, and thanks for an amazing package - data.table has made my
>>>> life so much easier. It should be part of base, I think.
>>>> Stian Haklev, University of Toronto
>>>>
>>>> --
>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
--
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/792e7921/attachment-0001.html>
More information about the datatable-help
mailing list