[datatable-help] Using data.table to run a function on every row

Ricardo Saporta saporta at scarletmail.rutgers.edu
Fri Sep 27 19:25:19 CEST 2013


hm... not sure about `j`  (sorry, I havent taken a close look at your
code), but my comment was to point out that these two statements are
different:

   DT [  TRUE,   ]
   DT [ .(TRUE), ]

The first one is giving you the whole data.table
   DT[TRUE, ]  is the same as DT
(since TRUE is getting recycled)

The second one is giving you all rows within DT where the first column of
the key has a value of TRUE.



Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu



On Fri, Sep 27, 2013 at 12:20 PM, Stian Håklev <shaklev at gmail.com> wrote:

> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>    user  system elapsed
>  19.610   0.475  20.304
> > system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
> Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text,
> url_pattern))) :
>   All items in j=list(...) should be atomic vectors or lists. If you are
> trying something like j=list(.SD,newcol=mean(colA)) then use := by group
> instead (much quicker), or cbind or merge afterwards.
> Timing stopped at: 6.339 0.043 6.403
>
>
> On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> Hi Stian,
>>
>> Try the following two and look at the difference:
>>
>>   db[T, matches := str_match_all(text, url_pattern)]
>>  db[.(T), matches := str_match_all(text, url_pattern)]
>>
>> ;)
>>
>>
>>
>> On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <shaklev at gmail.com> wrote:
>>
>>> I really appreciate all your help - amazingly supportive community. I
>>> could probably figure out a "brute-force" way of doing things, but since
>>> I'm going to be writing a lot of R in the future too, I always want to find
>>> the "correct" way of doing it, which both looks clear, and is quick. (I
>>> come from a background in Ruby, and am always interested in writing very
>>> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
>>> of time in R struggling with various data formats - lists, nested lists,
>>> vectors, matrices, different forms of apply/ddply/for loops etc).
>>>
>>> Anyway, a few different points.
>>>
>>> I tried db[has_url,], but got "object has_url not found"
>>>
>>> I then tried setkey(db, "has_url"), and using this, but somehow it was a
>>> lot slower than what I used to do (I repeated a few times). Not sure if I'm
>>> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
>>> this once. But good to understand the underlying principles).
>>>
>>> setkey(db, "has_url")
>>> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>>>    user  system elapsed
>>>  17.514   0.334  17.847
>>> > system.time( db[has_url == T, matches := str_match_all(text,
>>> url_pattern)] )
>>>    user  system elapsed
>>>   5.943   0.040   5.984
>>>
>>> The second point was how to get out the matches. The idea was that you
>>> have a text field which might contain several urls, which I want to
>>> extract, but I need each URL tagged with the row it came from (so I can
>>> link it back to properties of the post and author, look at whether certain
>>> students are more likely to post certain kinds of URLs etc).
>>>
>>> Instead of a function, you'll see above that I rewrote it to use :=,
>>> which creates a new column that holds a list. That worked wonderfully, but
>>> now how do I get these "out" of this data.table, and into a new one.
>>>
>>> Made-up example data:
>>> a <- c(1,2,3)
>>> b <- c(2,3,4)
>>> dt <- data.table(names=c("Stian", "Christian", "John"),
>>> numbers=list(a,b, NULL))
>>>
>>> Now my goal is to have a new data.table that looks like this
>>> Name Number
>>> Stian 1
>>> Stian 2
>>> Stian 3
>>> Christian 2
>>> Christian 3
>>> Christian 4
>>>
>>> Again, I'm sure I could do this with a for() or lapply? But I'd love to
>>> see the most elegant solution.
>>>
>>> Note that this:
>>>
>>> getUrls <- function(text, id) {
>>>   matches <- str_match_all(text, url_pattern)
>>>   data.frame(urls=unlist(matches), id=id)
>>> }
>>>
>>> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>>>
>>> Works perfectly, the result is
>>> idurlsid116
>>> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166
>>> 16224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
>>> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
>>> 44461
>>> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
>>> 61575
>>> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
>>> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>>>
>>> which is exactly what I was looking for. So I've really reached my goal,
>>> but I'm curious about the other method as well.
>>>
>>> Thanks!
>>> Stian
>>>
>>>
>>> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>
>>>>
>>>> That was my thought too.  I don't know what str_match_all is,  but
>>>> given the unlist() in getUrls(),  it seems to return a list.   Rather than
>>>> unlist(),  leave it as list,  and data.table should happily make a `list`
>>>> column where each cell is itself a vector.  In fact each cell can be
>>>> anything at all,  even embedded data.table, function definitions, or any
>>>> type of object.
>>>> You might need a list(list(str_match_all(...))) in j to do that.
>>>>
>>>> Or what Rick has suggested here might work first time.  It's hard to
>>>> visualise it without a small reproducible example, so we're having to make
>>>> educated guesses.
>>>>
>>>> Many thanks for the kind words about data.table.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>>>
>>>> In fact, you should be able to skip the function altogether and just
>>>> use:
>>>>
>>>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>>>
>>>>
>>>>  (and now, my apologies to all for the email clutter)
>>>> good night
>>>>
>>>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>>>> saporta at scarletmail.rutgers.edu> wrote:
>>>>
>>>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>>>
>>>>>  The error you are seeing is most likely coming from your getURL
>>>>> function in that you are adding several ids to a data.frame of varying
>>>>> rows, and `R` cannot recycle it correctly.
>>>>>
>>>>>  If you instead breakdown by id, then each time you are only
>>>>> assigning one id and R will be able to recycle appropriately, without
>>>>> issue.
>>>>>
>>>>>  good luck!
>>>>> Rick
>>>>>
>>>>>
>>>>>  Ricardo Saporta
>>>>>  Graduate Student, Data Analytics
>>>>> Rutgers University, New Jersey
>>>>> e: saporta at rutgers.edu
>>>>>
>>>>>
>>>>>
>>>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>>>> saporta at scarletmail.rutgers.edu> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>>  Try inserting a `by=id` in
>>>>>>
>>>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>>>
>>>>>>  Also, no need for "has_url == T"
>>>>>> instead, use
>>>>>>   (has_url)
>>>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>>>> things down ;)
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Ricardo Saporta
>>>>>> Graduate Student, Data Analytics
>>>>>> Rutgers University, New Jersey
>>>>>>  e: saporta at rutgers.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com>wrote:
>>>>>>
>>>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>>>> of data frames and rbindlist them together for a totally separate
>>>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>>>> tagging them with the forum post they came from).
>>>>>>>
>>>>>>>  I tried doing this with a data.table
>>>>>>>
>>>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>>>
>>>>>>>  and get the message
>>>>>>>
>>>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>>>> 4L,  :
>>>>>>>   replacement has 11007 rows, data has 29787
>>>>>>>
>>>>>>>  Because some rows have several URLs... However, I don't care that
>>>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>>>> names, etc.
>>>>>>>
>>>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>>>
>>>>>>>  getUrls <- function(text, id) {
>>>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>>>   a$id <- id
>>>>>>>   a
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>>>> life so much easier. It should be part of base, I think.
>>>>>>> Stian Haklev, University of Toronto
>>>>>>>
>>>>>>>  --
>>>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>>>
>>>>>>>  _______________________________________________
>>>>>>> datatable-help mailing list
>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>
>>
>>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/73bbefac/attachment-0001.html>


More information about the datatable-help mailing list