<div dir="ltr">Hello data.table gurus,<div><br></div><div>I have been using data.table to efficiently work with textual data and I love it for that purpose. I have transformed my data so that it looks something like this:</div>
<div><br></div><div><table cellspacing="0" cellpadding="0" dir="ltr" border="1" style="table-layout:fixed;font-size:13px;font-family:arial,sans,sans-serif;border-collapse:collapse;border:1px solid rgb(204,204,204)"><colgroup><col width="100"><col width="100"><col width="100"></colgroup><tbody><tr style="height:21px">
<td style="padding:2px 3px;vertical-align:bottom">word</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">document</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">position</td>
</tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">I</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
1</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">have</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
2</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">transformed</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
3</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">my</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
4</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">data</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
5</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">so</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
1</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">that</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
2</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">it</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
3</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">looks</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
4</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">something</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
5</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">like</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
6</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">this</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
7</td></tr></tbody></table><br></div><div>(I actually use a unique number for each word, so that I am able to use data.table's excellent features to do lightning-fast word counts. This has revolutionized my workflow over looping through text files with Perl.)</div>
<div><br></div><div>My problem is that I sometimes need to search for phrases or to select words based on their context (for instance, I may want to exclude a word if it is preceded by "not" or followed by a word that changes its meaning). Currently, I am using the solution <a href="http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression">here</a> to create a new column for a word in another position, like this:</div>
<div><br></div><div><table cellspacing="0" cellpadding="0" dir="ltr" border="1" style="table-layout:fixed;font-size:13px;font-family:arial,sans,sans-serif;border-collapse:collapse;border:1px solid rgb(204,204,204)"><colgroup><col width="100"><col width="100"><col width="100"><col width="100"></colgroup><tbody><tr style="height:21px">
<td style="padding:2px 3px;vertical-align:bottom">word</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">document</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">position</td>
<td style="padding:2px 3px;vertical-align:bottom">lead_word</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">I</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td>
<td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom">have</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">have</td>
<td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom">transformed</td></tr><tr style="height:21px">
<td style="padding:2px 3px;vertical-align:bottom">transformed</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">3</td><td style="padding:2px 3px;vertical-align:bottom">
my</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">my</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
4</td><td style="padding:2px 3px;vertical-align:bottom">data</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">data</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
1</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">5</td><td style="padding:2px 3px;vertical-align:bottom">NA</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">so</td>
<td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">1</td><td style="padding:2px 3px;vertical-align:bottom">that</td></tr><tr style="height:21px">
<td style="padding:2px 3px;vertical-align:bottom">that</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom">
it</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">it</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
3</td><td style="padding:2px 3px;vertical-align:bottom">looks</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">looks</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">4</td><td style="padding:2px 3px;vertical-align:bottom">something</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">
something</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">5</td><td style="padding:2px 3px;vertical-align:bottom">like</td>
</tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">like</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
6</td><td style="padding:2px 3px;vertical-align:bottom">this</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom">this</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">
2</td><td style="padding:2px 3px;vertical-align:bottom;text-align:center">7</td><td style="padding:2px 3px;vertical-align:bottom">NA</td></tr></tbody></table><br>using a command like: DT[,lead_word:=DT[list(document,position+1),word].<br>
<br></div><div>This approach has two problems, however. First, it consumes more resources as the dataset grows. I am currently working with a file containing over 150 million rows, so adding a column is costly. Second, I may want to check both one and two words ahead, so that I have to add two columns, and this can quickly get out of hand.</div>
<div><br></div><div>Is there a better way to use data.table to check the value in a row N distance from the row of interest within a group and select a row based on that value? Perhaps the .I variable could be useful here?</div>
<div><br></div><div>I appreciate any suggestions.</div><div><br></div><div><br></div><div>Regards,</div><div>Matt</div></div>