[datatable-help] Efficiently checking value of other row in data.table
Matthew DeAngelis
ronin78 at gmail.com
Sat Jun 28 11:55:12 CEST 2014
Hi Matt,
You have the right of it. The problem is somewhat complicated, however,
since I would want to substitute "DT[word=="good"..." with
"DT[J("good")..." after setting the key to word and reordering the rows.
Hence the two-step process I have now where I key by document and position
first, create the lag_word column, key by the word and lag_word columns and
query by row.
Matt
On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com> wrote:
>
> Hi,
>
> Not sure exactly what you need but looks interesting.
>
> Something a bit like this ?
>
> DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document]
>
> Your idea being you don't want to have to repeat all the pre and post
> words alongside each word but rather express it in the query. Makes
> sense. Leads to classifying "not good" and "not very good" as both
> negative phrases I guess.
>
> Matt
>
>
>
> On 26/06/14 21:56, Matthew DeAngelis wrote:
>
> Hello data.table gurus,
>
> I have been using data.table to efficiently work with textual data and I
> love it for that purpose. I have transformed my data so that it looks
> something like this:
>
> word document position I 1 1 have 1 2 transformed 1 3 my 1 4 data
> 1 5 so 2 1 that 2 2 it 2 3 looks 2 4 something 2 5 like 2 6 this 2
> 7
> (I actually use a unique number for each word, so that I am able to use
> data.table's excellent features to do lightning-fast word counts. This has
> revolutionized my workflow over looping through text files with Perl.)
>
> My problem is that I sometimes need to search for phrases or to select
> words based on their context (for instance, I may want to exclude a word if
> it is preceded by "not" or followed by a word that changes its meaning).
> Currently, I am using the solution here
> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
> create a new column for a word in another position, like this:
>
> word document position lead_word I 1 1 have have 1 2 transformed
> transformed 1 3 my my 1 4 data data 1 5 NA so 2 1 that that 2 2 it it
> 2 3 looks looks 2 4 something something 2 5 like like 2 6 this this 2
> 7 NA
> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>
> This approach has two problems, however. First, it consumes more
> resources as the dataset grows. I am currently working with a file
> containing over 150 million rows, so adding a column is costly. Second, I
> may want to check both one and two words ahead, so that I have to add two
> columns, and this can quickly get out of hand.
>
> Is there a better way to use data.table to check the value in a row N
> distance from the row of interest within a group and select a row based on
> that value? Perhaps the .I variable could be useful here?
>
> I appreciate any suggestions.
>
>
> Regards,
> Matt
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140628/98ac9c51/attachment-0001.html>
More information about the datatable-help
mailing list