[datatable-help] Efficiently checking value of other row in data.table

Matthew DeAngelis ronin78 at gmail.com
Thu Jun 26 22:56:40 CEST 2014


Hello data.table gurus,

I have been using data.table to efficiently work with textual data and I
love it for that purpose. I have transformed my data so that it looks
something like this:

worddocumentpositionI11have12transformed13my14data15so21that22it23looks24
something25like26this27
(I actually use a unique number for each word, so that I am able to use
data.table's excellent features to do lightning-fast word counts. This has
revolutionized my workflow over looping through text files with Perl.)

My problem is that I sometimes need to search for phrases or to select
words based on their context (for instance, I may want to exclude a word if
it is preceded by "not" or followed by a word that changes its meaning).
Currently, I am using the solution here
<http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression>
to
create a new column for a word in another position, like this:

worddocumentpositionlead_wordI11havehave12transformedtransformed13mymy14data
data15NAso21thatthat22itit23lookslooks24somethingsomething25likelike26this
this27NA
using a command like: DT[,lead_word:=DT[list(document,position+1),word].

This approach has two problems, however. First, it consumes more resources
as the dataset grows. I am currently working with a file containing over
150 million rows, so adding a column is costly. Second, I may want to check
both one and two words ahead, so that I have to add two columns, and this
can quickly get out of hand.

Is there a better way to use data.table to check the value in a row N
distance from the row of interest within a group and select a row based on
that value? Perhaps the .I variable could be useful here?

I appreciate any suggestions.


Regards,
Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140626/0eaad67d/attachment.html>


More information about the datatable-help mailing list