[datatable-help] Using data.table to run a function on every row

Ricardo Saporta saporta at scarletmail.rutgers.edu
Fri Sep 27 08:44:37 CEST 2013


In fact, you should be able to skip the function altogether and just use:

   db[ (has_url), str_match_all(text, url_pattern), by=id]


(and now, my apologies to all for the email clutter)
good night

On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
saporta at scarletmail.rutgers.edu> wrote:

> sorry, I probably should have elaborated  (it's late here, in NJ)
>
> The error you are seeing is most likely coming from your getURL function
> in that you are adding several ids to a data.frame of varying rows, and `R`
> cannot recycle it correctly.
>
> If you instead breakdown by id, then each time you are only assigning one
> id and R will be able to recycle appropriately, without issue.
>
> good luck!
> Rick
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu
>
>
>
> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> Hi there,
>>
>> Try inserting a `by=id` in
>>
>>    a <- db[(has_url), getUrls(text, id), by=id]
>>
>> Also, no need for "has_url == T"
>> instead, use
>>   (has_url)
>> If the variable is alread logical.  (Otherwise, you are just slowing
>> things down ;)
>>
>>
>>
>> Ricardo Saporta
>> Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
>>
>>
>>
>> On Thu, Sep 26, 2013 at 11:16 PM, Stian Håklev <shaklev at gmail.com> wrote:
>>
>>> I'm trying to run a function on every row fulfilling a certain
>>> criterium, which returns a data frame - the idea is then to take the list
>>> of data frames and rbindlist them together for a totally separate
>>> data.table. (I'm extracting several URL links from each forum post, and
>>> tagging them with the forum post they came from).
>>>
>>> I tried doing this with a data.table
>>>
>>> a <- db[has_url == T, getUrls(text, id)]
>>>
>>> and get the message
>>>
>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L,
>>>  :
>>>   replacement has 11007 rows, data has 29787
>>>
>>> Because some rows have several URLs... However, I don't care that these
>>> rowlengths don't match, I still want these rows :) I thought J would just
>>> let me execute arbitrary R code in the context of the rows as variable
>>> names, etc.
>>>
>>> Here's the function it's running, but that shouldn't be relevant
>>>
>>> getUrls <- function(text, id) {
>>>   matches <- str_match_all(text, url_pattern)
>>>   a <- data.frame(urls=unlist(matches))
>>>   a$id <- id
>>>   a
>>> }
>>>
>>>
>>> Thanks, and thanks for an amazing package - data.table has made my life
>>> so much easier. It should be part of base, I think.
>>> Stian Haklev, University of Toronto
>>>
>>> --
>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/002920af/attachment-0001.html>


More information about the datatable-help mailing list