[datatable-help] Using data.table to run a function on every row
Stian Håklev
shaklev at gmail.com
Fri Sep 27 17:39:35 CEST 2013
OK, so I just realized a few things.
First of all, I should have had has_db in a parenthesis to use it as an
index (like Ricardo did, I just didn't notice that it was important).
However, this still doesn't make much of a difference, because we're only
talking about 146k entries, and most of the time is spent on the string
extraction:
> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
user system elapsed
10.246 0.027 10.275
> system.time( a <- db[has_url == T, getUrls(text, id), by=id] )
user system elapsed
10.094 0.029 10.123
Either way, good to know!
Secondly, I tried this form:
system.time( b <- db[(has_url),
j=list(urls = str_match_all(text, url_pattern)),
by=id] )
The problem is that it only accepts one value per row, so the output format
looks exactly like what I want - but
> nrow(db) # all records
[1] 146058
> nrow(a) # using the function getUrls
[1] 30019
> nrow(b) # using str_match_all directly with j=list
[1] 11007
> length(unique(a$id)) # similar number of IDs, but not similar number of
URLs
[1] 11007
> length(unique(b$id))
[1] 11007
thanks again,
Stian
On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <shaklev at gmail.com> wrote:
> I really appreciate all your help - amazingly supportive community. I
> could probably figure out a "brute-force" way of doing things, but since
> I'm going to be writing a lot of R in the future too, I always want to find
> the "correct" way of doing it, which both looks clear, and is quick. (I
> come from a background in Ruby, and am always interested in writing very
> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
> of time in R struggling with various data formats - lists, nested lists,
> vectors, matrices, different forms of apply/ddply/for loops etc).
>
> Anyway, a few different points.
>
> I tried db[has_url,], but got "object has_url not found"
>
> I then tried setkey(db, "has_url"), and using this, but somehow it was a
> lot slower than what I used to do (I repeated a few times). Not sure if I'm
> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
> this once. But good to understand the underlying principles).
>
> setkey(db, "has_url")
> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
> user system elapsed
> 17.514 0.334 17.847
> > system.time( db[has_url == T, matches := str_match_all(text,
> url_pattern)] )
> user system elapsed
> 5.943 0.040 5.984
>
> The second point was how to get out the matches. The idea was that you
> have a text field which might contain several urls, which I want to
> extract, but I need each URL tagged with the row it came from (so I can
> link it back to properties of the post and author, look at whether certain
> students are more likely to post certain kinds of URLs etc).
>
> Instead of a function, you'll see above that I rewrote it to use :=, which
> creates a new column that holds a list. That worked wonderfully, but now
> how do I get these "out" of this data.table, and into a new one.
>
> Made-up example data:
> a <- c(1,2,3)
> b <- c(2,3,4)
> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
> NULL))
>
> Now my goal is to have a new data.table that looks like this
> Name Number
> Stian 1
> Stian 2
> Stian 3
> Christian 2
> Christian 3
> Christian 4
>
> Again, I'm sure I could do this with a for() or lapply? But I'd love to
> see the most elegant solution.
>
> Note that this:
>
> getUrls <- function(text, id) {
> matches <- str_match_all(text, url_pattern)
> data.frame(urls=unlist(matches), id=id)
> }
>
> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
> Works perfectly, the result is
> idurlsid116
> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162
> 24http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
> 44461
> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
> 61575
> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>
> which is exactly what I was looking for. So I've really reached my goal,
> but I'm curious about the other method as well.
>
> Thanks!
> Stian
>
>
> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> That was my thought too. I don't know what str_match_all is, but given
>> the unlist() in getUrls(), it seems to return a list. Rather than
>> unlist(), leave it as list, and data.table should happily make a `list`
>> column where each cell is itself a vector. In fact each cell can be
>> anything at all, even embedded data.table, function definitions, or any
>> type of object.
>> You might need a list(list(str_match_all(...))) in j to do that.
>>
>> Or what Rick has suggested here might work first time. It's hard to
>> visualise it without a small reproducible example, so we're having to make
>> educated guesses.
>>
>> Many thanks for the kind words about data.table.
>>
>> Matthew
>>
>>
>>
>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>
>> In fact, you should be able to skip the function altogether and just
>> use:
>>
>> db[ (has_url), str_match_all(text, url_pattern), by=id]
>>
>>
>> (and now, my apologies to all for the email clutter)
>> good night
>>
>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> sorry, I probably should have elaborated (it's late here, in NJ)
>>>
>>> The error you are seeing is most likely coming from your getURL
>>> function in that you are adding several ids to a data.frame of varying
>>> rows, and `R` cannot recycle it correctly.
>>>
>>> If you instead breakdown by id, then each time you are only assigning
>>> one id and R will be able to recycle appropriately, without issue.
>>>
>>> good luck!
>>> Rick
>>>
>>>
>>> Ricardo Saporta
>>> Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>> e: saporta at rutgers.edu
>>>
>>>
>>>
>>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>> saporta at scarletmail.rutgers.edu> wrote:
>>>
>>>> Hi there,
>>>>
>>>> Try inserting a `by=id` in
>>>>
>>>> a <- db[(has_url), getUrls(text, id), by=id]
>>>>
>>>> Also, no need for "has_url == T"
>>>> instead, use
>>>> (has_url)
>>>> If the variable is alread logical. (Otherwise, you are just slowing
>>>> things down ;)
>>>>
>>>>
>>>>
>>>> Ricardo Saporta
>>>> Graduate Student, Data Analytics
>>>> Rutgers University, New Jersey
>>>> e: saporta at rutgers.edu
>>>>
>>>>
>>>>
>>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com>wrote:
>>>>
>>>>> I'm trying to run a function on every row fulfilling a certain
>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>> of data frames and rbindlist them together for a totally separate
>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>> tagging them with the forum post they came from).
>>>>>
>>>>> I tried doing this with a data.table
>>>>>
>>>>> a <- db[has_url == T, getUrls(text, id)]
>>>>>
>>>>> and get the message
>>>>>
>>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>> 4L, :
>>>>> replacement has 11007 rows, data has 29787
>>>>>
>>>>> Because some rows have several URLs... However, I don't care that
>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>> names, etc.
>>>>>
>>>>> Here's the function it's running, but that shouldn't be relevant
>>>>>
>>>>> getUrls <- function(text, id) {
>>>>> matches <- str_match_all(text, url_pattern)
>>>>> a <- data.frame(urls=unlist(matches))
>>>>> a$id <- id
>>>>> a
>>>>> }
>>>>>
>>>>>
>>>>> Thanks, and thanks for an amazing package - data.table has made my
>>>>> life so much easier. It should be part of base, I think.
>>>>> Stian Haklev, University of Toronto
>>>>>
>>>>> --
>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
--
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/8d678a1f/attachment-0001.html>
More information about the datatable-help
mailing list