<div dir="ltr">Hi Matt,<div><br></div><div>You have the right of it. The problem is somewhat complicated, however, since I would want to substitute "DT[word=="good"..." with "DT[J("good")..." after setting the key to word and reordering the rows. Hence the two-step process I have now where I key by document and position first, create the lag_word column, key by the word and lag_word columns and query by row.</div>
<div><br></div><div><br></div><div>Matt</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com" target="_blank">mdowle@mdowle.plus.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <div><br>
      Hi,<br>
      <br>
      Not sure exactly what you need but looks interesting.<br>
      <br>
      Something a bit like this ?<br>
      <br>
      DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]<br>
      <br>
      Your idea being you don't want to have to repeat all the pre and
      post words alongside each word but rather express it in the query.
      Makes sense.   Leads to classifying "not good" and "not very good"
      as both negative phrases I guess.<br>
      <br>
      Matt<div><div class="h5"><br>
      <br>
      <br>
      On 26/06/14 21:56, Matthew DeAngelis wrote:<br>
    </div></div></div>
    <blockquote type="cite"><div><div class="h5">
      <div dir="ltr">Hello data.table gurus,
        <div><br>
        </div>
        <div>I have been using data.table to efficiently work with
          textual data and I love it for that purpose. I have
          transformed my data so that it looks something like this:</div>
        <div><br>
        </div>
        <div>
          <table dir="ltr" style="table-layout:fixed;font-size:13px;font-family:arial,sans,sans-serif;border-collapse:collapse;border:1px solid rgb(204,204,204)" cellpadding="0" cellspacing="0" border="1">
            <colgroup><col width="100"><col width="100"><col width="100"></colgroup><tbody>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">word</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">document</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">position</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">I</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  1</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">have</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  2</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">transformed</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  3</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">my</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  4</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">data</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  5</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">so</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  1</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">that</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  2</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">it</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  3</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">looks</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  4</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">something</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  5</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">like</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  6</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">this</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  7</td>
              </tr>
            </tbody>
          </table>
          <br>
        </div>
        <div>(I actually use a unique number for each word, so that I am
          able to use data.table's excellent features to do
          lightning-fast word counts. This has revolutionized my
          workflow over looping through text files with Perl.)</div>
        <div><br>
        </div>
        <div>My problem is that I sometimes need to search for phrases
          or to select words based on their context (for instance, I may
          want to exclude a word if it is preceded by "not" or followed
          by a word that changes its meaning). Currently, I am using the
          solution <a href="http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression" target="_blank">here</a> to
          create a new column for a word in another position, like this:</div>
        <div><br>
        </div>
        <div>
          <table dir="ltr" style="table-layout:fixed;font-size:13px;font-family:arial,sans,sans-serif;border-collapse:collapse;border:1px solid rgb(204,204,204)" cellpadding="0" cellspacing="0" border="1">
            <colgroup><col width="100"><col width="100"><col width="100"><col width="100"></colgroup><tbody>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">word</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">document</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">position</td>
                <td style="padding:2px 3px;vertical-align:bottom">lead_word</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">I</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom">have</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">have</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom">transformed</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">transformed</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">3</td>
                <td style="padding:2px 3px;vertical-align:bottom">
                  my</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">my</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  4</td>
                <td style="padding:2px 3px;vertical-align:bottom">data</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">data</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  1</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">5</td>
                <td style="padding:2px 3px;vertical-align:bottom">NA</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">so</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
                <td style="padding:2px 3px;vertical-align:bottom">that</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">that</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom">
                  it</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">it</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  3</td>
                <td style="padding:2px 3px;vertical-align:bottom">looks</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">looks</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">4</td>
                <td style="padding:2px 3px;vertical-align:bottom">something</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">
                  something</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">5</td>
                <td style="padding:2px 3px;vertical-align:bottom">like</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">like</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  6</td>
                <td style="padding:2px 3px;vertical-align:bottom">this</td>
              </tr>
              <tr style="height:21px">
                <td style="padding:2px 3px;vertical-align:bottom">this</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">
                  2</td>
                <td style="padding:2px 3px;vertical-align:bottom;text-align:center">7</td>
                <td style="padding:2px 3px;vertical-align:bottom">NA</td>
              </tr>
            </tbody>
          </table>
          <br>
          using a command like:
          DT[,lead_word:=DT[list(document,position+1),word].<br>
          <br>
        </div>
        <div>This approach has two problems, however. First, it consumes
          more resources as the dataset grows. I am currently working
          with a file containing over 150 million rows, so adding a
          column is costly. Second, I may want to check both one and two
          words ahead, so that I have to add two columns, and this can
          quickly get out of hand.</div>
        <div><br>
        </div>
        <div>Is there a better way to use data.table to check the value
          in a row N distance from the row of interest within a group
          and select a row based on that value? Perhaps the .I variable
          could be useful here?</div>
        <div><br>
        </div>
        <div>I appreciate any suggestions.</div>
        <div><br>
        </div>
        <div><br>
        </div>
        <div>Regards,</div>
        <div>Matt</div>
      </div>
      <br>
      <fieldset></fieldset>
      <br>
      </div></div><pre>_______________________________________________
datatable-help mailing list
<a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></pre>
    </blockquote>
    <br>
  </div>

</blockquote></div><br></div>