[datatable-help] Using data.table to run a function on every row

Matthew Dowle mdowle at mdowle.plus.com
Fri Sep 27 20:49:15 CEST 2013


Stian,

datatable-help isn't really for this kind of question.  It's a very good 
question and belongs on S.O. where you can edit it given comments.  
datatable-help is more for discussion about future developments,  
notices,  things that aren't allowed on S.O.,  etc.

This was your example :

 > a <- c(1,2,3)
 > b <- c(2,3,4)
 > dt <- data.table(names=c("Stian", "Christian", "John"), 
numbers=list(a,b, NULL))

The output of that is :

 > dt
        names numbers
1:     Stian   1,2,3
2: Christian   2,3,4
3:      John

Are you possibly mistaken about the output of list columns?  Those 
commas are just how it displays.  They aren't strings in the numbers 
column.  The `numbers` column is a list column where each item is a vector.

To get the output you asked for it's just :

 > dt[,unlist(numbers),by=names]
        names V1
1:     Stian  1
2:     Stian  2
3:     Stian  3
4: Christian  2
5: Christian  3
6: Christian  4
 >

If I've misunderstood,  then please start again with a new question on S.O.

http://stackoverflow.com/questions/tagged/data.table

Thanks,
Matthew




On 27/09/13 18:25, Ricardo Saporta wrote:
> hm... not sure about `j`  (sorry, I havent taken a close look at your 
> code), but my comment was to point out that these two statements are 
> different:
>
>    DT [  TRUE,   ]
>    DT [ .(TRUE), ]
>
> The first one is giving you the whole data.table
>    DT[TRUE, ]  is the same as DT
> (since TRUE is getting recycled)
>
> The second one is giving you all rows within DT where the first column 
> of the key has a value of TRUE.
>
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
> On Fri, Sep 27, 2013 at 12:20 PM, Stian Håklev <shaklev at gmail.com 
> <mailto:shaklev at gmail.com>> wrote:
>
>     > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>        user  system elapsed
>      19.610   0.475  20.304
>     > system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
>     Error in `[.data.table`(db, .(T), `:=`(matches,
>     str_match_all(text, url_pattern))) :
>       All items in j=list(...) should be atomic vectors or lists. If
>     you are trying something like j=list(.SD,newcol=mean(colA)) then
>     use := by group instead (much quicker), or cbind or merge afterwards.
>     Timing stopped at: 6.339 0.043 6.403
>
>
>     On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta
>     <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
>         Hi Stian,
>
>         Try the following two and look at the difference:
>
>          db[T, matches := str_match_all(text, url_pattern)]
>          db[.(T), matches := str_match_all(text, url_pattern)]
>
>         ;)
>
>
>
>         On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev
>         <shaklev at gmail.com <mailto:shaklev at gmail.com>> wrote:
>
>             I really appreciate all your help - amazingly supportive
>             community. I could probably figure out a "brute-force" way
>             of doing things, but since I'm going to be writing a lot
>             of R in the future too, I always want to find the
>             "correct" way of doing it, which both looks clear, and is
>             quick. (I come from a background in Ruby, and am always
>             interested in writing very clear and DRY (do not repeat
>             yourself) code, but I find I still spend a lot of time in
>             R struggling with various data formats - lists, nested
>             lists, vectors, matrices, different forms of
>             apply/ddply/for loops etc).
>
>             Anyway, a few different points.
>
>             I tried db[has_url,], but got "object has_url not found"
>
>             I then tried setkey(db, "has_url"), and using this, but
>             somehow it was a lot slower than what I used to do (I
>             repeated a few times). Not sure if I'm doing it wrong.
>             (Not important - even 15 sec is totally fine, I'll only
>             run this once. But good to understand the underlying
>             principles).
>
>             setkey(db, "has_url")
>             > system.time( db[T, matches := str_match_all(text,
>             url_pattern)] )
>                user  system elapsed
>              17.514   0.334  17.847
>             > system.time( db[has_url == T, matches :=
>             str_match_all(text, url_pattern)] )
>                user  system elapsed
>               5.943   0.040   5.984
>
>             The second point was how to get out the matches. The idea
>             was that you have a text field which might contain several
>             urls, which I want to extract, but I need each URL tagged
>             with the row it came from (so I can link it back to
>             properties of the post and author, look at whether certain
>             students are more likely to post certain kinds of URLs etc).
>
>             Instead of a function, you'll see above that I rewrote it
>             to use :=, which creates a new column that holds a list.
>             That worked wonderfully, but now how do I get these "out"
>             of this data.table, and into a new one.
>
>             Made-up example data:
>             a <- c(1,2,3)
>             b <- c(2,3,4)
>             dt <- data.table(names=c("Stian", "Christian", "John"),
>             numbers=list(a,b, NULL))
>
>             Now my goal is to have a new data.table that looks like this
>             Name Number
>             Stian 1
>             Stian 2
>             Stian 3
>             Christian 2
>             Christian 3
>             Christian 4
>
>             Again, I'm sure I could do this with a for() or lapply?
>             But I'd love to see the most elegant solution.
>
>             Note that this:
>
>             getUrls <- function(text, id) {
>               matches <- str_match_all(text, url_pattern)
>             data.frame(urls=unlist(matches), id=id)
>             }
>
>             system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
>             Works perfectly, the result is
>
>             	id 	urls 	id
>             1 	16
>             https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166
>             	16
>             2 	24 	http://www.youtube.com/watch?v=JUiGF4TGI9w 	24
>             3 	44
>             http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
>             	44
>             4 	61
>             http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
>             	61
>             5 	75
>             http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
>             	75
>             6 	75
>             https://www.facebook.com/photo.php?fbid=10151324672623754 	75
>
>
>             which is exactly what I was looking for. So I've really
>             reached my goal, but I'm curious about the other method as
>             well.
>
>             Thanks!
>             Stian
>
>
>             On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle
>             <mdowle at mdowle.plus.com <mailto:mdowle at mdowle.plus.com>>
>             wrote:
>
>
>                 That was my thought too.  I don't know what
>                 str_match_all is,  but given the unlist() in
>                 getUrls(),  it seems to return a list.   Rather than
>                 unlist(),  leave it as list,  and data.table should
>                 happily make a `list` column where each cell is itself
>                 a vector.  In fact each cell can be anything at all, 
>                 even embedded data.table, function definitions, or any
>                 type of object.
>                 You might need a list(list(str_match_all(...))) in j
>                 to do that.
>
>                 Or what Rick has suggested here might work first
>                 time.  It's hard to visualise it without a small
>                 reproducible example, so we're having to make educated
>                 guesses.
>
>                 Many thanks for the kind words about data.table.
>
>                 Matthew
>
>
>
>                 On 27/09/13 07:44, Ricardo Saporta wrote:
>>                 In fact, you should be able to skip the function
>>                 altogether and just use:
>>
>>                    db[ (has_url), str_match_all(text, url_pattern),
>>                 by=id]
>>
>>
>>                 (and now, my apologies to all for the email clutter)
>>                 good night
>>
>>                 On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta
>>                 <saporta at scarletmail.rutgers.edu
>>                 <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>>
>>                     sorry, I probably should have elaborated  (it's
>>                     late here, in NJ)
>>
>>                     The error you are seeing is most likely coming
>>                     from your getURL function in that you are adding
>>                     several ids to a data.frame of varying rows, and
>>                     `R` cannot recycle it correctly.
>>
>>                     If you instead breakdown by id, then each time
>>                     you are only assigning one id and R will be able
>>                     to recycle appropriately, without issue.
>>
>>                     good luck!
>>                     Rick
>>
>>
>>                     Ricardo Saporta
>>                     Graduate Student, Data Analytics
>>                     Rutgers University, New Jersey
>>                     e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>>
>>
>>
>>                     On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
>>                     <saporta at scarletmail.rutgers.edu
>>                     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>>
>>                         Hi there,
>>
>>                         Try inserting a `by=id` in
>>
>>                         a <- db[(has_url), getUrls(text, id), by=id]
>>
>>                         Also, no need for "has_url == T"
>>                         instead, use
>>                         (has_url)
>>                         If the variable is alread logical.
>>                          (Otherwise, you are just slowing things down ;)
>>
>>
>>
>>                         Ricardo Saporta
>>                         Graduate Student, Data Analytics
>>                         Rutgers University, New Jersey
>>                         e: saporta at rutgers.edu
>>                         <mailto:saporta at rutgers.edu>
>>
>>
>>
>>                         On Thu, Sep 26, 2013 at 11:16 PM, Stian
>>                         Håklev <shaklev at gmail.com
>>                         <mailto:shaklev at gmail.com>> wrote:
>>
>>                             I'm trying to run a function on every row
>>                             fulfilling a certain criterium, which
>>                             returns a data frame - the idea is then
>>                             to take the list of data frames and
>>                             rbindlist them together for a totally
>>                             separate data.table. (I'm extracting
>>                             several URL links from each forum post,
>>                             and tagging them with the forum post they
>>                             came from).
>>
>>                             I tried doing this with a data.table
>>
>>                             a <- db[has_url == T, getUrls(text, id)]
>>
>>                             and get the message
>>
>>                             Error in `$<-.data.frame`(`*tmp*`, "id",
>>                             value = c(1L, 6L, 1L, 2L, 4L,  :
>>                             replacement has 11007 rows, data has 29787
>>
>>                             Because some rows have several URLs...
>>                             However, I don't care that these
>>                             rowlengths don't match, I still want
>>                             these rows :) I thought J would just let
>>                             me execute arbitrary R code in the
>>                             context of the rows as variable names, etc.
>>
>>                             Here's the function it's running, but
>>                             that shouldn't be relevant
>>
>>                             getUrls <- function(text, id) {
>>                               matches <- str_match_all(text, url_pattern)
>>                               a <- data.frame(urls=unlist(matches))
>>                               a$id <- id
>>                               a
>>                             }
>>
>>
>>                             Thanks, and thanks for an amazing package
>>                             - data.table has made my life so much
>>                             easier. It should be part of base, I think.
>>                             Stian Haklev, University of Toronto
>>
>>                             -- 
>>                             http://reganmian.net/blog -- Random Stuff
>>                             that Matters
>>
>>                             _______________________________________________
>>                             datatable-help mailing list
>>                             datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>
>>                             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>>
>>
>>                 _______________________________________________
>>                 datatable-help mailing list
>>                 datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>             -- 
>             http://reganmian.net/blog -- Random Stuff that Matters
>
>
>
>
>
>     -- 
>     http://reganmian.net/blog -- Random Stuff that Matters
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/307c1649/attachment-0001.html>


More information about the datatable-help mailing list