[datatable-help] Efficiently checking value of other row in data.table

Mon Jun 30 15:24:00 CEST 2014

Hi Matt,

Thanks for the suggestion. I am placing an example below that I hope
illustrates the problem more clearly. Please let me know if I can provide
additional detail or clarification.

Regards,
Matt

First we create a dummy dataset with ten documents containing one million
words. There are three unique words in the set.

library(data.table)options(scipen=2)set.seed(1000)DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))setkey(DT,docindex)DT[,position:=seq.int(1:.N),by=docindex]

##          wordindex docindex position
##       1:         1        1        1
##       2:         1        1        2
##       3:         3        1        3
##       4:         3        1        4
##       5:         1        1        5
##      ---
##  999996:         2       10    99811
##  999997:         2       10    99812
##  999998:         3       10    99813
##  999999:         1       10    99814
## 1000000:         3       10    99815

This is a query to count the occurrences of the first unique word across
all documents. It is also beautiful.

setkey(DT,wordindex)count<-DT[J(1),list(count.1=.N),by=docindex]count

##     docindex count.1
##  1:        1   33533
##  2:        2   33067
##  3:        3   33538
##  4:        4   33053
##  5:        5   33231
##  6:        6   33002
##  7:        7   33369
##  8:        8   33353
##  9:        9   33485
## 10:       10   33225

It gets messier when we have to take the position ahead into account. This
is a query to count the occurrences of the first unique word across all
documents UNLESS it is followed by the second unique word. We create a new
column containing the word one position ahead and then key on both words.

setkey(DT,docindex,position)DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]

##          wordindex docindex position lead_wordindex
##       1:         1        1        1              1
##       2:         1        1        2              3
##       3:         3        1        3              3
##       4:         3        1        4              1
##       5:         1        1        5              2
##      ---
##  999996:         2       10    99811              2
##  999997:         2       10    99812              3
##  999998:         3       10    99813              1
##  999999:         1       10    99814              3
## 1000000:         3       10    99815             NA

setkey(DT,wordindex,lead_wordindex)countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]countr2

##     docindex count.1
##  1:        1   22301
##  2:        2   21835
##  3:        3   22490
##  4:        4   21830
##  5:        5   22218
##  6:        6   21914
##  7:        7   22370
##  8:        8   22265
##  9:        9   22211
## 10:       10   22190

I have a very large dataset for which the above query fails for memory
allocation. As an alternative, we can create this new column for only the
relevant subset of data by filtering the original dataset and then joining
it back on the desired position:

setkey(DT,wordindex)filter<-DT[J(1),list(wordindex,docindex,position)]filter[,lead_position:=position+1]

##         wordindex wordindex docindex position lead_position
##      1:         1         1        2    99717         99718
##      2:         1         1        3    99807         99808
##      3:         1         1        4   100243        100244
##      4:         1         1        1        1             2
##      5:         1         1        1       42            43
##     ---
## 332852:         1         1       10    99785         99786
## 332853:         1         1       10    99787         99788
## 332854:         1         1       10    99798         99799
## 332855:         1         1       10    99804         99805
## 332856:         1         1       10    99814         99815

setkey(DT,docindex,position)filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]

##         wordindex wordindex docindex position lead_position lead_wordindex
##      1:         1         1        2    99717         99718             NA
##      2:         1         1        3    99807         99808             NA
##      3:         1         1        4   100243        100244             NA
##      4:         1         1        1        1             2              1
##      5:         1         1        1       42            43              1
##     ---
## 332852:         1         1       10    99785         99786              3
## 332853:         1         1       10    99787         99788              3
## 332854:         1         1       10    99798         99799              3
## 332855:         1         1       10    99804         99805              3
## 332856:         1         1       10    99814         99815              3

setkey(filter,wordindex,lead_wordindex)countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]countr2.1

##     docindex count.1
##  1:        1   22301
##  2:        2   21835
##  3:        3   22490
##  4:        4   21830
##  5:        5   22218
##  6:        6   21914
##  7:        7   22370
##  8:        8   22265
##  9:        9   22211
## 10:       10   22190

Pretty ugly, I think. In addition, we may want to look more than one word
ahead. We have to create yet another column. The easy but costly way is:

setkey(DT,docindex,position)DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]

##          wordindex docindex position lead_wordindex lead_lead_wordindex
##       1:         1        1        1              1                   3
##       2:         1        1        2              3                   3
##       3:         3        1        3              3                   1
##       4:         3        1        4              1                   2
##       5:         1        1        5              2                   3
##      ---
##  999996:         2       10    99811              2                   3
##  999997:         2       10    99812              3                   1
##  999998:         3       10    99813              1                   3
##  999999:         1       10    99814              3                  NA
## 1000000:         3       10    99815             NA                  NA

setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]countr23

##     docindex count.1
##  1:        1    3684
##  2:        2    3746
##  3:        3    3717
##  4:        4    3727
##  5:        5    3700
##  6:        6    3779
##  7:        7    3702
##  8:        8    3756
##  9:        9    3702
## 10:       10    3744

However, I currently have to use the ugly filter-and-join way because of
size.

So the question is, is there an easier and more beautiful way?

On Sat, Jun 28, 2014 at 6:00 PM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Hi Matt,
>
> Great.  If you can prepare some dummy data with the appropriate properties
> and a parameter or two to scale up the size (or just provide an online
> large example to download) and a query that gets to the right answer but is
> slow or ugly,   then we've got something to chew on ...
>
> Matt
>
>
> On 28/06/14 10:55, Matthew DeAngelis wrote:
>
> Hi Matt,
>
>  You have the right of it. The problem is somewhat complicated, however,
> since I would want to substitute "DT[word=="good"..." with
> "DT[J("good")..." after setting the key to word and reordering the rows.
> Hence the two-step process I have now where I key by document and position
> first, create the lag_word column, key by the word and lag_word columns and
> query by row.
>
>
>  Matt
>
>
> On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com>
> wrote:
>
>>
>> Hi,
>>
>> Not sure exactly what you need but looks interesting.
>>
>> Something a bit like this ?
>>
>> DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]
>>
>> Your idea being you don't want to have to repeat all the pre and post
>> words alongside each word but rather express it in the query. Makes
>> sense.   Leads to classifying "not good" and "not very good" as both
>> negative phrases I guess.
>>
>> Matt
>>
>>
>>
>> On 26/06/14 21:56, Matthew DeAngelis wrote:
>>
>>  Hello data.table gurus,
>>
>>  I have been using data.table to efficiently work with textual data and
>> I love it for that purpose. I have transformed my data so that it looks
>> something like this:
>>
>>    word document position  I 1 1  have 1 2  transformed 1 3  my 1 4  data
>> 1 5  so 2 1  that 2 2  it 2 3  looks 2 4  something 2 5  like 2 6  this 2
>> 7
>>  (I actually use a unique number for each word, so that I am able to use
>> data.table's excellent features to do lightning-fast word counts. This has
>> revolutionized my workflow over looping through text files with Perl.)
>>
>>  My problem is that I sometimes need to search for phrases or to select
>> words based on their context (for instance, I may want to exclude a word if
>> it is preceded by "not" or followed by a word that changes its meaning).
>> Currently, I am using the solution here
>> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
>> create a new column for a word in another position, like this:
>>
>>    word document position lead_word  I 1 1 have  have 1 2 transformed
>> transformed 1 3 my  my 1 4 data  data 1 5 NA  so 2 1 that  that 2 2 it
>> it 2 3 looks  looks 2 4 something  something 2 5 like  like 2 6 this
>> this 2 7 NA
>> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>>
>>  This approach has two problems, however. First, it consumes more
>> resources as the dataset grows. I am currently working with a file
>> containing over 150 million rows, so adding a column is costly. Second, I
>> may want to check both one and two words ahead, so that I have to add two
>> columns, and this can quickly get out of hand.
>>
>>  Is there a better way to use data.table to check the value in a row N
>> distance from the row of interest within a group and select a row based on
>> that value? Perhaps the .I variable could be useful here?
>>
>>  I appreciate any suggestions.
>>
>>
>>  Regards,
>> Matt
>>
>>
>>  _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/d32dcccf/attachment-0001.html>