[datatable-help] Using data.table to run a function on every row

Matthew Dowle mdowle at mdowle.plus.com
Fri Sep 27 14:48:41 CEST 2013


That was my thought too.  I don't know what str_match_all is,  but given 
the unlist() in getUrls(),  it seems to return a list. Rather than 
unlist(),  leave it as list,  and data.table should happily make a 
`list` column where each cell is itself a vector. In fact each cell can 
be anything at all,  even embedded data.table, function definitions, or 
any type of object.
You might need a list(list(str_match_all(...))) in j to do that.

Or what Rick has suggested here might work first time.  It's hard to 
visualise it without a small reproducible example, so we're having to 
make educated guesses.

Many thanks for the kind words about data.table.

Matthew


On 27/09/13 07:44, Ricardo Saporta wrote:
> In fact, you should be able to skip the function altogether and just use:
>
>    db[ (has_url), str_match_all(text, url_pattern), by=id]
>
>
> (and now, my apologies to all for the email clutter)
> good night
>
> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta 
> <saporta at scarletmail.rutgers.edu 
> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
>     sorry, I probably should have elaborated  (it's late here, in NJ)
>
>     The error you are seeing is most likely coming from your getURL
>     function in that you are adding several ids to a data.frame of
>     varying rows, and `R` cannot recycle it correctly.
>
>     If you instead breakdown by id, then each time you are only
>     assigning one id and R will be able to recycle appropriately,
>     without issue.
>
>     good luck!
>     Rick
>
>
>     Ricardo Saporta
>     Graduate Student, Data Analytics
>     Rutgers University, New Jersey
>     e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
>     On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
>     <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
>         Hi there,
>
>         Try inserting a `by=id` in
>
>         a <- db[(has_url), getUrls(text, id), by=id]
>
>         Also, no need for "has_url == T"
>         instead, use
>         (has_url)
>         If the variable is alread logical.  (Otherwise, you are just
>         slowing things down ;)
>
>
>
>         Ricardo Saporta
>         Graduate Student, Data Analytics
>         Rutgers University, New Jersey
>         e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
>         On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev
>         <shaklev at gmail.com <mailto:shaklev at gmail.com>> wrote:
>
>             I'm trying to run a function on every row fulfilling a
>             certain criterium, which returns a data frame - the idea
>             is then to take the list of data frames and rbindlist them
>             together for a totally separate data.table. (I'm
>             extracting several URL links from each forum post, and
>             tagging them with the forum post they came from).
>
>             I tried doing this with a data.table
>
>             a <- db[has_url == T, getUrls(text, id)]
>
>             and get the message
>
>             Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L,
>             1L, 2L, 4L,  :
>               replacement has 11007 rows, data has 29787
>
>             Because some rows have several URLs... However, I don't
>             care that these rowlengths don't match, I still want these
>             rows :) I thought J would just let me execute arbitrary R
>             code in the context of the rows as variable names, etc.
>
>             Here's the function it's running, but that shouldn't be
>             relevant
>
>             getUrls <- function(text, id) {
>               matches <- str_match_all(text, url_pattern)
>               a <- data.frame(urls=unlist(matches))
>               a$id <- id
>               a
>             }
>
>
>             Thanks, and thanks for an amazing package - data.table has
>             made my life so much easier. It should be part of base, I
>             think.
>             Stian Haklev, University of Toronto
>
>             -- 
>             http://reganmian.net/blog -- Random Stuff that Matters
>
>             _______________________________________________
>             datatable-help mailing list
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/6c9cd699/attachment.html>


More information about the datatable-help mailing list