[datatable-help] Efficiently checking value of other row in data.table

Sat Jun 28 11:55:12 CEST 2014

Hi Matt,

You have the right of it. The problem is somewhat complicated, however,
since I would want to substitute "DT[word=="good"..." with
"DT[J("good")..." after setting the key to word and reordering the rows.
Hence the two-step process I have now where I key by document and position
first, create the lag_word column, key by the word and lag_word columns and
query by row.

Matt

On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Hi,
>
> Not sure exactly what you need but looks interesting.
>
> Something a bit like this ?
>
> DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]
>
> Your idea being you don't want to have to repeat all the pre and post
> words alongside each word but rather express it in the query. Makes
> sense.   Leads to classifying "not good" and "not very good" as both
> negative phrases I guess.
>
> Matt
>
>
>
> On 26/06/14 21:56, Matthew DeAngelis wrote:
>
> Hello data.table gurus,
>
>  I have been using data.table to efficiently work with textual data and I
> love it for that purpose. I have transformed my data so that it looks
> something like this:
>
>    word document position  I 1 1  have 1 2  transformed 1 3  my 1 4  data
> 1 5  so 2 1  that 2 2  it 2 3  looks 2 4  something 2 5  like 2 6  this 2
> 7
>  (I actually use a unique number for each word, so that I am able to use
> data.table's excellent features to do lightning-fast word counts. This has
> revolutionized my workflow over looping through text files with Perl.)
>
>  My problem is that I sometimes need to search for phrases or to select
> words based on their context (for instance, I may want to exclude a word if
> it is preceded by "not" or followed by a word that changes its meaning).
> Currently, I am using the solution here
> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
> create a new column for a word in another position, like this:
>
>    word document position lead_word  I 1 1 have  have 1 2 transformed
> transformed 1 3 my  my 1 4 data  data 1 5 NA  so 2 1 that  that 2 2 it  it
> 2 3 looks  looks 2 4 something  something 2 5 like  like 2 6 this  this 2
> 7 NA
> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>
>  This approach has two problems, however. First, it consumes more
> resources as the dataset grows. I am currently working with a file
> containing over 150 million rows, so adding a column is costly. Second, I
> may want to check both one and two words ahead, so that I have to add two
> columns, and this can quickly get out of hand.
>
>  Is there a better way to use data.table to check the value in a row N
> distance from the row of interest within a group and select a row based on
> that value? Perhaps the .I variable could be useful here?
>
>  I appreciate any suggestions.
>
>
>  Regards,
> Matt
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140628/98ac9c51/attachment-0001.html>