<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
Hi Matt,<br>
<br>
Great. If you can prepare some dummy data with the appropriate
properties and a parameter or two to scale up the size (or just
provide an online large example to download) and a query that gets
to the right answer but is slow or ugly, then we've got
something to chew on ...<br>
<br>
Matt<br>
<br>
On 28/06/14 10:55, Matthew DeAngelis wrote:<br>
</div>
<blockquote
cite="mid:CAMjp+0cc3d9r3mSnM5DRfd9ygTGQgDe2wdbWXOHc2rH-JR617Q@mail.gmail.com"
type="cite">
<div dir="ltr">Hi Matt,
<div><br>
</div>
<div>You have the right of it. The problem is somewhat
complicated, however, since I would want to substitute
"DT[word=="good"..." with "DT[J("good")..." after setting the
key to word and reordering the rows. Hence the two-step
process I have now where I key by document and position first,
create the lag_word column, key by the word and lag_word
columns and query by row.</div>
<div><br>
</div>
<div><br>
</div>
<div>Matt</div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jun 27, 2014 at 3:17 PM, Matt
Dowle <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:mdowle@mdowle.plus.com" target="_blank">mdowle@mdowle.plus.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div><br>
Hi,<br>
<br>
Not sure exactly what you need but looks interesting.<br>
<br>
Something a bit like this ?<br>
<br>
DT[ word == "good", .SD[ lag(word, N) != "not" ],
by=document]<br>
<br>
Your idea being you don't want to have to repeat all the
pre and post words alongside each word but rather
express it in the query. Makes sense. Leads to
classifying "not good" and "not very good" as both
negative phrases I guess.<br>
<br>
Matt
<div>
<div class="h5"><br>
<br>
<br>
On 26/06/14 21:56, Matthew DeAngelis wrote:<br>
</div>
</div>
</div>
<blockquote type="cite">
<div>
<div class="h5">
<div dir="ltr">Hello data.table gurus,
<div><br>
</div>
<div>I have been using data.table to efficiently
work with textual data and I love it for that
purpose. I have transformed my data so that it
looks something like this:</div>
<div><br>
</div>
<div>
<table dir="ltr"
style="table-layout:fixed;font-size:13px;font-family:arial,sans,sans-serif;border-collapse:collapse;border:1px
solid rgb(204,204,204)" cellpadding="0"
cellspacing="0" border="1">
<colgroup><col width="100"><col width="100"><col
width="100"></colgroup><tbody>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">word</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">document</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">position</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">I</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
1</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">have</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
2</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">transformed</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
3</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">my</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
4</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">data</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
5</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">so</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
1</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">that</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
2</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">it</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
3</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">looks</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
4</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">something</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
5</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">like</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
6</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">this</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
7</td>
</tr>
</tbody>
</table>
<br>
</div>
<div>(I actually use a unique number for each
word, so that I am able to use data.table's
excellent features to do lightning-fast word
counts. This has revolutionized my workflow over
looping through text files with Perl.)</div>
<div><br>
</div>
<div>My problem is that I sometimes need to search
for phrases or to select words based on their
context (for instance, I may want to exclude a
word if it is preceded by "not" or followed by a
word that changes its meaning). Currently, I am
using the solution <a moz-do-not-send="true"
href="http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression"
target="_blank">here</a> to create a new
column for a word in another position, like
this:</div>
<div><br>
</div>
<div>
<table dir="ltr"
style="table-layout:fixed;font-size:13px;font-family:arial,sans,sans-serif;border-collapse:collapse;border:1px
solid rgb(204,204,204)" cellpadding="0"
cellspacing="0" border="1">
<colgroup><col width="100"><col width="100"><col
width="100"><col width="100"></colgroup><tbody>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">word</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">document</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">position</td>
<td style="padding:2px
3px;vertical-align:bottom">lead_word</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">I</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom">have</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">have</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom">transformed</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">transformed</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">3</td>
<td style="padding:2px
3px;vertical-align:bottom"> my</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">my</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
4</td>
<td style="padding:2px
3px;vertical-align:bottom">data</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">data</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
1</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">5</td>
<td style="padding:2px
3px;vertical-align:bottom">NA</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">so</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px
3px;vertical-align:bottom">that</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">that</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom"> it</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">it</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
3</td>
<td style="padding:2px
3px;vertical-align:bottom">looks</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">looks</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">4</td>
<td style="padding:2px
3px;vertical-align:bottom">something</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom"> something</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">5</td>
<td style="padding:2px
3px;vertical-align:bottom">like</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">like</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
6</td>
<td style="padding:2px
3px;vertical-align:bottom">this</td>
</tr>
<tr style="height:21px">
<td style="padding:2px
3px;vertical-align:bottom">this</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">
2</td>
<td style="padding:2px
3px;vertical-align:bottom;text-align:center">7</td>
<td style="padding:2px
3px;vertical-align:bottom">NA</td>
</tr>
</tbody>
</table>
<br>
using a command like:
DT[,lead_word:=DT[list(document,position+1),word].<br>
<br>
</div>
<div>This approach has two problems, however.
First, it consumes more resources as the dataset
grows. I am currently working with a file
containing over 150 million rows, so adding a
column is costly. Second, I may want to check
both one and two words ahead, so that I have to
add two columns, and this can quickly get out of
hand.</div>
<div><br>
</div>
<div>Is there a better way to use data.table to
check the value in a row N distance from the row
of interest within a group and select a row
based on that value? Perhaps the .I variable
could be useful here?</div>
<div><br>
</div>
<div>I appreciate any suggestions.</div>
<div><br>
</div>
<div><br>
</div>
<div>Regards,</div>
<div>Matt</div>
</div>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>_______________________________________________
datatable-help mailing list
<a moz-do-not-send="true" href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a>
<a moz-do-not-send="true" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></pre>
</blockquote>
<br>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>