[datatable-help] Using data.table to run a function on every row
Matthew Dowle
mdowle at mdowle.plus.com
Fri Sep 27 14:48:41 CEST 2013
That was my thought too. I don't know what str_match_all is, but given
the unlist() in getUrls(), it seems to return a list. Rather than
unlist(), leave it as list, and data.table should happily make a
`list` column where each cell is itself a vector. In fact each cell can
be anything at all, even embedded data.table, function definitions, or
any type of object.
You might need a list(list(str_match_all(...))) in j to do that.
Or what Rick has suggested here might work first time. It's hard to
visualise it without a small reproducible example, so we're having to
make educated guesses.
Many thanks for the kind words about data.table.
Matthew
On 27/09/13 07:44, Ricardo Saporta wrote:
> In fact, you should be able to skip the function altogether and just use:
>
> db[ (has_url), str_match_all(text, url_pattern), by=id]
>
>
> (and now, my apologies to all for the email clutter)
> good night
>
> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu
> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
> sorry, I probably should have elaborated (it's late here, in NJ)
>
> The error you are seeing is most likely coming from your getURL
> function in that you are adding several ids to a data.frame of
> varying rows, and `R` cannot recycle it correctly.
>
> If you instead breakdown by id, then each time you are only
> assigning one id and R will be able to recycle appropriately,
> without issue.
>
> good luck!
> Rick
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu
> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
> Hi there,
>
> Try inserting a `by=id` in
>
> a <- db[(has_url), getUrls(text, id), by=id]
>
> Also, no need for "has_url == T"
> instead, use
> (has_url)
> If the variable is alread logical. (Otherwise, you are just
> slowing things down ;)
>
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev
> <shaklev at gmail.com <mailto:shaklev at gmail.com>> wrote:
>
> I'm trying to run a function on every row fulfilling a
> certain criterium, which returns a data frame - the idea
> is then to take the list of data frames and rbindlist them
> together for a totally separate data.table. (I'm
> extracting several URL links from each forum post, and
> tagging them with the forum post they came from).
>
> I tried doing this with a data.table
>
> a <- db[has_url == T, getUrls(text, id)]
>
> and get the message
>
> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L,
> 1L, 2L, 4L, :
> replacement has 11007 rows, data has 29787
>
> Because some rows have several URLs... However, I don't
> care that these rowlengths don't match, I still want these
> rows :) I thought J would just let me execute arbitrary R
> code in the context of the rows as variable names, etc.
>
> Here's the function it's running, but that shouldn't be
> relevant
>
> getUrls <- function(text, id) {
> matches <- str_match_all(text, url_pattern)
> a <- data.frame(urls=unlist(matches))
> a$id <- id
> a
> }
>
>
> Thanks, and thanks for an amazing package - data.table has
> made my life so much easier. It should be part of base, I
> think.
> Stian Haklev, University of Toronto
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/6c9cd699/attachment.html>
More information about the datatable-help
mailing list