[datatable-help] Using data.table to run a function on every row
Matthew Dowle
mdowle at mdowle.plus.com
Fri Sep 27 20:49:15 CEST 2013
Stian,
datatable-help isn't really for this kind of question. It's a very good
question and belongs on S.O. where you can edit it given comments.
datatable-help is more for discussion about future developments,
notices, things that aren't allowed on S.O., etc.
This was your example :
> a <- c(1,2,3)
> b <- c(2,3,4)
> dt <- data.table(names=c("Stian", "Christian", "John"),
numbers=list(a,b, NULL))
The output of that is :
> dt
names numbers
1: Stian 1,2,3
2: Christian 2,3,4
3: John
Are you possibly mistaken about the output of list columns? Those
commas are just how it displays. They aren't strings in the numbers
column. The `numbers` column is a list column where each item is a vector.
To get the output you asked for it's just :
> dt[,unlist(numbers),by=names]
names V1
1: Stian 1
2: Stian 2
3: Stian 3
4: Christian 2
5: Christian 3
6: Christian 4
>
If I've misunderstood, then please start again with a new question on S.O.
http://stackoverflow.com/questions/tagged/data.table
Thanks,
Matthew
On 27/09/13 18:25, Ricardo Saporta wrote:
> hm... not sure about `j` (sorry, I havent taken a close look at your
> code), but my comment was to point out that these two statements are
> different:
>
> DT [ TRUE, ]
> DT [ .(TRUE), ]
>
> The first one is giving you the whole data.table
> DT[TRUE, ] is the same as DT
> (since TRUE is getting recycled)
>
> The second one is giving you all rows within DT where the first column
> of the key has a value of TRUE.
>
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
> On Fri, Sep 27, 2013 at 12:20 PM, Stian Håklev <shaklev at gmail.com
> <mailto:shaklev at gmail.com>> wrote:
>
> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
> user system elapsed
> 19.610 0.475 20.304
> > system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
> Error in `[.data.table`(db, .(T), `:=`(matches,
> str_match_all(text, url_pattern))) :
> All items in j=list(...) should be atomic vectors or lists. If
> you are trying something like j=list(.SD,newcol=mean(colA)) then
> use := by group instead (much quicker), or cbind or merge afterwards.
> Timing stopped at: 6.339 0.043 6.403
>
>
> On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu
> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
> Hi Stian,
>
> Try the following two and look at the difference:
>
> db[T, matches := str_match_all(text, url_pattern)]
> db[.(T), matches := str_match_all(text, url_pattern)]
>
> ;)
>
>
>
> On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev
> <shaklev at gmail.com <mailto:shaklev at gmail.com>> wrote:
>
> I really appreciate all your help - amazingly supportive
> community. I could probably figure out a "brute-force" way
> of doing things, but since I'm going to be writing a lot
> of R in the future too, I always want to find the
> "correct" way of doing it, which both looks clear, and is
> quick. (I come from a background in Ruby, and am always
> interested in writing very clear and DRY (do not repeat
> yourself) code, but I find I still spend a lot of time in
> R struggling with various data formats - lists, nested
> lists, vectors, matrices, different forms of
> apply/ddply/for loops etc).
>
> Anyway, a few different points.
>
> I tried db[has_url,], but got "object has_url not found"
>
> I then tried setkey(db, "has_url"), and using this, but
> somehow it was a lot slower than what I used to do (I
> repeated a few times). Not sure if I'm doing it wrong.
> (Not important - even 15 sec is totally fine, I'll only
> run this once. But good to understand the underlying
> principles).
>
> setkey(db, "has_url")
> > system.time( db[T, matches := str_match_all(text,
> url_pattern)] )
> user system elapsed
> 17.514 0.334 17.847
> > system.time( db[has_url == T, matches :=
> str_match_all(text, url_pattern)] )
> user system elapsed
> 5.943 0.040 5.984
>
> The second point was how to get out the matches. The idea
> was that you have a text field which might contain several
> urls, which I want to extract, but I need each URL tagged
> with the row it came from (so I can link it back to
> properties of the post and author, look at whether certain
> students are more likely to post certain kinds of URLs etc).
>
> Instead of a function, you'll see above that I rewrote it
> to use :=, which creates a new column that holds a list.
> That worked wonderfully, but now how do I get these "out"
> of this data.table, and into a new one.
>
> Made-up example data:
> a <- c(1,2,3)
> b <- c(2,3,4)
> dt <- data.table(names=c("Stian", "Christian", "John"),
> numbers=list(a,b, NULL))
>
> Now my goal is to have a new data.table that looks like this
> Name Number
> Stian 1
> Stian 2
> Stian 3
> Christian 2
> Christian 3
> Christian 4
>
> Again, I'm sure I could do this with a for() or lapply?
> But I'd love to see the most elegant solution.
>
> Note that this:
>
> getUrls <- function(text, id) {
> matches <- str_match_all(text, url_pattern)
> data.frame(urls=unlist(matches), id=id)
> }
>
> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
> Works perfectly, the result is
>
> id urls id
> 1 16
> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166
> 16
> 2 24 http://www.youtube.com/watch?v=JUiGF4TGI9w 24
> 3 44
> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
> 44
> 4 61
> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
> 61
> 5 75
> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
> 75
> 6 75
> https://www.facebook.com/photo.php?fbid=10151324672623754 75
>
>
> which is exactly what I was looking for. So I've really
> reached my goal, but I'm curious about the other method as
> well.
>
> Thanks!
> Stian
>
>
> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle
> <mdowle at mdowle.plus.com <mailto:mdowle at mdowle.plus.com>>
> wrote:
>
>
> That was my thought too. I don't know what
> str_match_all is, but given the unlist() in
> getUrls(), it seems to return a list. Rather than
> unlist(), leave it as list, and data.table should
> happily make a `list` column where each cell is itself
> a vector. In fact each cell can be anything at all,
> even embedded data.table, function definitions, or any
> type of object.
> You might need a list(list(str_match_all(...))) in j
> to do that.
>
> Or what Rick has suggested here might work first
> time. It's hard to visualise it without a small
> reproducible example, so we're having to make educated
> guesses.
>
> Many thanks for the kind words about data.table.
>
> Matthew
>
>
>
> On 27/09/13 07:44, Ricardo Saporta wrote:
>> In fact, you should be able to skip the function
>> altogether and just use:
>>
>> db[ (has_url), str_match_all(text, url_pattern),
>> by=id]
>>
>>
>> (and now, my apologies to all for the email clutter)
>> good night
>>
>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta
>> <saporta at scarletmail.rutgers.edu
>> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>>
>> sorry, I probably should have elaborated (it's
>> late here, in NJ)
>>
>> The error you are seeing is most likely coming
>> from your getURL function in that you are adding
>> several ids to a data.frame of varying rows, and
>> `R` cannot recycle it correctly.
>>
>> If you instead breakdown by id, then each time
>> you are only assigning one id and R will be able
>> to recycle appropriately, without issue.
>>
>> good luck!
>> Rick
>>
>>
>> Ricardo Saporta
>> Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>>
>>
>>
>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
>> <saporta at scarletmail.rutgers.edu
>> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>>
>> Hi there,
>>
>> Try inserting a `by=id` in
>>
>> a <- db[(has_url), getUrls(text, id), by=id]
>>
>> Also, no need for "has_url == T"
>> instead, use
>> (has_url)
>> If the variable is alread logical.
>> (Otherwise, you are just slowing things down ;)
>>
>>
>>
>> Ricardo Saporta
>> Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
>> <mailto:saporta at rutgers.edu>
>>
>>
>>
>> On Thu, Sep 26, 2013 at 11:16 PM, Stian
>> Håklev <shaklev at gmail.com
>> <mailto:shaklev at gmail.com>> wrote:
>>
>> I'm trying to run a function on every row
>> fulfilling a certain criterium, which
>> returns a data frame - the idea is then
>> to take the list of data frames and
>> rbindlist them together for a totally
>> separate data.table. (I'm extracting
>> several URL links from each forum post,
>> and tagging them with the forum post they
>> came from).
>>
>> I tried doing this with a data.table
>>
>> a <- db[has_url == T, getUrls(text, id)]
>>
>> and get the message
>>
>> Error in `$<-.data.frame`(`*tmp*`, "id",
>> value = c(1L, 6L, 1L, 2L, 4L, :
>> replacement has 11007 rows, data has 29787
>>
>> Because some rows have several URLs...
>> However, I don't care that these
>> rowlengths don't match, I still want
>> these rows :) I thought J would just let
>> me execute arbitrary R code in the
>> context of the rows as variable names, etc.
>>
>> Here's the function it's running, but
>> that shouldn't be relevant
>>
>> getUrls <- function(text, id) {
>> matches <- str_match_all(text, url_pattern)
>> a <- data.frame(urls=unlist(matches))
>> a$id <- id
>> a
>> }
>>
>>
>> Thanks, and thanks for an amazing package
>> - data.table has made my life so much
>> easier. It should be part of base, I think.
>> Stian Haklev, University of Toronto
>>
>> --
>> http://reganmian.net/blog -- Random Stuff
>> that Matters
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
>
>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/307c1649/attachment-0001.html>
More information about the datatable-help
mailing list