<div dir="ltr">OK, so I just realized a few things. <div><br>First of all, I should have had has_db in a parenthesis to use it as an index (like Ricardo did, I just didn't notice that it was important). However, this still doesn't make much of a difference, because we're only talking about 146k entries, and most of the time is spent on the string extraction:</div>

<div><br></div><div><div>> system.time( a <- db[(has_url), getUrls(text, id), by=id] )</div><div>   user  system elapsed </div><div> 10.246   0.027  10.275 </div><div>> system.time( a <- db[has_url == T, getUrls(text, id), by=id] )</div>

<div>   user  system elapsed </div><div> 10.094   0.029  10.123 </div></div><div><br></div><div style>Either way, good to know!</div><div style><br></div><div style>Secondly, I tried this form:</div><div style><div>system.time( b <- db[(has_url), </div>

<div>                     j=list(urls = str_match_all(text, url_pattern)), </div><div>                     by=id] )</div><div><br></div><div><br></div><div style>The problem is that it only accepts one value per row, so the output format looks exactly like what I want - but </div>

<div style><div>> nrow(db) # all records</div><div>[1] 146058</div><div>> nrow(a) # using the function getUrls</div><div>[1] 30019</div><div>> nrow(b) # using str_match_all directly with j=list</div><div style>[1] 11007 </div>

<div>> length(unique(a$id)) # similar number of IDs, but not similar number of URLs</div><div>[1] 11007</div><div>> length(unique(b$id))<br></div><div>[1] 11007</div><div><br></div><div style>thanks again,</div><div style>

Stian</div></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Sep 27, 2013 at 11:21 AM, Stian Håklev <span dir="ltr"><<a href="mailto:shaklev@gmail.com" target="_blank">shaklev@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I really appreciate all your help - amazingly supportive community. I could probably figure out a "brute-force" way of doing things, but since I'm going to be writing a lot of R in the future too, I always want to find the "correct" way of doing it, which both looks clear, and is quick. (I come from a background in Ruby, and am always interested in writing very clear and DRY (do not repeat yourself) code, but I find I still spend a lot of time in R struggling with various data formats - lists, nested lists, vectors, matrices, different forms of apply/ddply/for loops etc). <div>



<br></div><div>Anyway, a few different points.</div><div><br></div><div>I tried db[has_url,], but got "object has_url not found"</div><div><br></div><div>I then tried setkey(db, "has_url"), and using this, but somehow it was a lot slower than what I used to do (I repeated a few times). Not sure if I'm doing it wrong. (Not important - even 15 sec is totally fine, I'll only run this once. But good to understand the underlying principles).</div>



<div><br></div><div><div>setkey(db, "has_url")</div><div>> system.time( db[T, matches := str_match_all(text, url_pattern)] )</div><div>   user  system elapsed </div><div> 17.514   0.334  17.847 </div><div>> system.time( db[has_url == T, matches := str_match_all(text, url_pattern)] )</div>



<div>   user  system elapsed </div><div>  5.943   0.040   5.984 </div></div><div><br></div><div>The second point was how to get out the matches. The idea was that you have a text field which might contain several urls, which I want to extract, but I need each URL tagged with the row it came from (so I can link it back to properties of the post and author, look at whether certain students are more likely to post certain kinds of URLs etc).</div>



<div><br></div><div>Instead of a function, you'll see above that I rewrote it to use :=, which creates a new column that holds a list. That worked wonderfully, but now how do I get these "out" of this data.table, and into a new one.</div>



<div><br></div><div>Made-up example data:</div><div><div>a <- c(1,2,3)</div><div>b <- c(2,3,4)</div><div>dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, NULL))</div>



<div><br></div><div>Now my goal is to have a new data.table that looks like this</div><div><div><div>Name <span style="white-space:pre-wrap">   </span>Number</div><div>Stian <span style="white-space:pre-wrap">     </span>1</div>



<div>Stian <span style="white-space:pre-wrap">    </span>2</div><div>Stian <span style="white-space:pre-wrap">  </span>3</div><div>Christian <span style="white-space:pre-wrap">      </span>2</div><div>Christian <span style="white-space:pre-wrap">      </span>3</div>



<div>Christian <span style="white-space:pre-wrap">        </span>4</div></div><div><br></div></div><div>Again, I'm sure I could do this with a for() or lapply? But I'd love to see the most elegant solution.</div>
<div><br></div><div>Note that this:</div><div><br></div><div><div class="im"><div>getUrls <- function(text, id) {</div><div>  matches <- str_match_all(text, url_pattern)</div></div><div>  data.frame(urls=unlist(matches), id=id)</div>


<div>}</div><div><br></div><div>system.time( a <- db[(has_url), getUrls(text, id), by=id] )</div><div><br></div><div>Works perfectly, the result is</div><div><table border="1" style="font-family:Times"><tbody><tr>
<th></th><th>id</th><th>urls</th><th>id</th></tr><tr><td align="right">1</td><td align="right">16</td><td><a href="https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166" target="_blank">https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166</a></td>


<td align="right">16</td></tr><tr><td align="right">2</td><td align="right">24</td><td><a href="http://www.youtube.com/watch?v=JUiGF4TGI9w" target="_blank">http://www.youtube.com/watch?v=JUiGF4TGI9w</a></td><td align="right">

24</td></tr>
<tr><td align="right">3</td><td align="right">44</td><td><a href="http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/" target="_blank">http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/</a></td>


<td align="right">44</td></tr><tr><td align="right">4</td><td align="right">61</td><td><a href="http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html" target="_blank">http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html</a></td>


<td align="right">61</td></tr><tr><td align="right">5</td><td align="right">75</td><td><a href="http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html" target="_blank">http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html</a></td>


<td align="right">75</td></tr><tr><td align="right">6</td><td align="right">75</td><td><a href="https://www.facebook.com/photo.php?fbid=10151324672623754" target="_blank">https://www.facebook.com/photo.php?fbid=10151324672623754</a></td>


<td align="right">75</td></tr></tbody></table></div><div><br></div><div>which is exactly what I was looking for. So I've really reached my goal, but I'm curious about the other method as well.</div></div><div>
<br></div><div>Thanks!<span class="HOEnZb"><font color="#888888"><br>Stian</font></span></div></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com" target="_blank">mdowle@mdowle.plus.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <div><br>
      That was my thought too.  I don't know what str_match_all is,  but
      given the unlist() in getUrls(),  it seems to return a list.  
      Rather than unlist(),  leave it as list,  and data.table should
      happily make a `list` column where each cell is itself a vector. 
      In fact each cell can be anything at all,  even embedded
      data.table, function definitions, or any type of object.<br>
      You might need a list(list(str_match_all(...))) in j to do that.<br>
      <br>
      Or what Rick has suggested here might work first time.  It's hard
      to visualise it without a small reproducible example, so we're
      having to make educated guesses.<br>
      <br>
      Many thanks for the kind words about data.table.<span><font color="#888888"><br>
      <br>
      Matthew</font></span><div><div><br>
      <br>
      <br>
      On 27/09/13 07:44, Ricardo Saporta wrote:<br>
    </div></div></div><div><div>
    <blockquote type="cite">
      <div dir="ltr">In fact, you should be able to skip the function
        altogether and just use: 
        <div><br>
        </div>
        <div>   db[ (has_url), str_match_all(text, url_pattern), by=id]<br>
        </div>
        <div><br>
        </div>
        <div><br>
        </div>
        <div>(and now, my apologies to all for the email clutter)</div>
        <div>good night</div>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On Fri, Sep 27, 2013 at 2:41 AM,
            Ricardo Saporta <span dir="ltr"><<a href="mailto:saporta@scarletmail.rutgers.edu" target="_blank">saporta@scarletmail.rutgers.edu</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div dir="ltr">sorry, I probably should have elaborated
                 (it's late here, in NJ)
                <div><br>
                </div>
                <div>The error you are seeing is most likely coming from
                  your getURL function in that you are adding several
                  ids to a data.frame of varying rows, and `R` cannot
                  recycle it correctly.   </div>
                <div><br>
                </div>
                <div>If you instead breakdown by id, then each time you
                  are only assigning one id and R will be able to
                  recycle appropriately, without issue. </div>
                <div><br>
                </div>
                <div>good luck! </div>
                <div>Rick</div>
                <div>
                  <br>
                </div>
              </div>
              <div class="gmail_extra">
                <div><br clear="all">
                  <div>
                    <div style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
                      <div style="font-size:13px">Ricardo Saporta</div>
                      <div style="font-size:13px">
                        Graduate Student, Data Analytics</div>
                      <div style="font-size:13px"><span style="font-size:13px">Rutgers University, New
                          Jersey</span></div>
                      <div style="font-size:13px"><span style="font-size:13px">e: </span><a href="mailto:saporta@rutgers.edu" style="color:rgb(17,85,204);font-size:13px" target="_blank">saporta@rutgers.edu</a></div>
                      <div><br>
                      </div>
                    </div>
                  </div>
                  <br>
                  <br>
                </div>
                <div>
                  <div>
                    <div class="gmail_quote">On Fri, Sep 27, 2013 at
                      2:37 AM, Ricardo Saporta <span dir="ltr"><<a href="mailto:saporta@scarletmail.rutgers.edu" target="_blank">saporta@scarletmail.rutgers.edu</a>></span>
                      wrote:<br>
                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                        <div dir="ltr">Hi there, 
                          <div><br>
                          </div>
                          <div>Try inserting a `by=id` in </div>
                          <div><br>
                          </div>
                          <div>   <span style="font-family:arial,sans-serif;font-size:13px">a
                              <- db[(has_url), getUrls(text, id),
                              by=id]</span></div>
                          <div>
                            <span style="font-family:arial,sans-serif;font-size:13px"><br>
                            </span></div>
                          <div><span style="font-family:arial,sans-serif;font-size:13px">Also,
                              no need for "</span><span style="font-family:arial,sans-serif;font-size:13px">has_url
                              == T"</span></div>
                          <div><span style="font-family:arial,sans-serif;font-size:13px">instead,
                              use </span></div>
                          <div><span style="font-family:arial,sans-serif;font-size:13px"> 
                              (</span><span style="font-family:arial,sans-serif;font-size:13px">has_url) </span></div>
                          <div><span style="font-family:arial,sans-serif;font-size:13px">If
                              the variable is alread logical.
                               (Otherwise, you are just slowing things
                              down ;) </span></div>
                          <div><span style="font-family:arial,sans-serif;font-size:13px"><br>
                            </span></div>
                          <div><span style="font-family:arial,sans-serif;font-size:13px"><br>
                            </span></div>
                        </div>
                        <div class="gmail_extra"><br clear="all">
                          <div>
                            <div style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
                              <div style="font-size:13px">Ricardo
                                Saporta</div>
                              <div style="font-size:13px">Graduate
                                Student, Data Analytics</div>
                              <div style="font-size:13px"><span style="font-size:13px">Rutgers
                                  University, New Jersey</span></div>
                              <div style="font-size:13px">
                                <span style="font-size:13px">e: </span><a href="mailto:saporta@rutgers.edu" style="color:rgb(17,85,204);font-size:13px" target="_blank">saporta@rutgers.edu</a></div>
                              <div><br>
                              </div>
                            </div>
                          </div>
                          <br>
                          <br>
                          <div class="gmail_quote">
                            <div>
                              <div>On Thu, Sep 26, 2013 at 11:16 PM,
                                Stian Håklev <span dir="ltr"><<a href="mailto:shaklev@gmail.com" target="_blank">shaklev@gmail.com</a>></span>
                                wrote:<br>
                              </div>
                            </div>
                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                              <div>
                                <div>
                                  <div dir="ltr">I'm trying to run a
                                    function on every row fulfilling a
                                    certain criterium, which returns a
                                    data frame - the idea is then to
                                    take the list of data frames and
                                    rbindlist them together for a
                                    totally separate data.table. (I'm
                                    extracting several URL links from
                                    each forum post, and tagging them
                                    with the forum post they came
                                    from). 
                                    <div>
                                      <br>
                                    </div>
                                    <div>I tried doing this with a
                                      data.table</div>
                                    <div><br>
                                    </div>
                                    <div>a <- db[has_url == T,
                                      getUrls(text, id)]</div>
                                    <div><br>
                                    </div>
                                    <div>and get the message</div>
                                    <div><br>
                                    </div>
                                    <div>
                                      <div>Error in
                                        `$<-.data.frame`(`*tmp*`,
                                        "id", value = c(1L, 6L, 1L, 2L,
                                        4L,  : </div>
                                      <div>  replacement has 11007 rows,
                                        data has 29787 </div>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div>Because some rows have several
                                      URLs... However, I don't care that
                                      these rowlengths don't match, I
                                      still want these rows :) I thought
                                      J would just let me execute
                                      arbitrary R code in the context of
                                      the rows as variable names, etc. </div>
                                    <div><br>
                                    </div>
                                    <div>Here's the function it's
                                      running, but that shouldn't be
                                      relevant</div>
                                    <div><br>
                                    </div>
                                    <div>
                                      <div>getUrls <- function(text,
                                        id) {</div>
                                      <div>  matches <-
                                        str_match_all(text, url_pattern)</div>
                                      <div>  a <-
                                        data.frame(urls=unlist(matches))</div>
                                      <div>  a$id <- id</div>
                                      <div>  a</div>
                                      <div>}</div>
                                      <div><br>
                                      </div>
                                      <div><br>
                                      </div>
                                      <div>Thanks, and thanks for an
                                        amazing package - data.table has
                                        made my life so much easier. It
                                        should be part of base, I think.</div>
                                      <div>Stian Haklev, University of
                                        Toronto</div>
                                    </div>
                                    <span><font color="#888888">
                                        <div>
                                          <div><br>
                                          </div>
                                          -- <br>
                                          <a href="http://reganmian.net/blog" target="_blank">http://reganmian.net/blog</a>
                                          -- Random Stuff that Matters<br>
                                        </div>
                                      </font></span></div>
                                  <br>
                                </div>
                              </div>
_______________________________________________<br>
                              datatable-help mailing list<br>
                              <a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a><br>
                              <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
                            </blockquote>
                          </div>
                          <br>
                        </div>
                      </blockquote>
                    </div>
                    <br>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
      <br>
      <fieldset></fieldset>
      <br>
      <pre>_______________________________________________
datatable-help mailing list
<a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></pre>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><a href="http://reganmian.net/blog" target="_blank">http://reganmian.net/blog</a> -- Random Stuff that Matters<br>
</div>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><a href="http://reganmian.net/blog">http://reganmian.net/blog</a> -- Random Stuff that Matters<br>
</div>