[datatable-help] Efficiently checking value of other row in data.table

Sun Jun 29 00:00:58 CEST 2014

Hi Matt,

Great.  If you can prepare some dummy data with the appropriate 
properties and a parameter or two to scale up the size (or just provide 
an online large example to download) and a query that gets to the right 
answer but is slow or ugly,   then we've got something to chew on ...

Matt

On 28/06/14 10:55, Matthew DeAngelis wrote:
> Hi Matt,
>
> You have the right of it. The problem is somewhat complicated, 
> however, since I would want to substitute "DT[word=="good"..." with 
> "DT[J("good")..." after setting the key to word and reordering the 
> rows. Hence the two-step process I have now where I key by document 
> and position first, create the lag_word column, key by the word and 
> lag_word columns and query by row.
>
>
> Matt
>
>
> On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com 
> <mailto:mdowle at mdowle.plus.com>> wrote:
>
>
>     Hi,
>
>     Not sure exactly what you need but looks interesting.
>
>     Something a bit like this ?
>
>     DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document]
>
>     Your idea being you don't want to have to repeat all the pre and
>     post words alongside each word but rather express it in the query.
>     Makes sense.   Leads to classifying "not good" and "not very good"
>     as both negative phrases I guess.
>
>     Matt
>
>
>
>     On 26/06/14 21:56, Matthew DeAngelis wrote:
>>     Hello data.table gurus,
>>
>>     I have been using data.table to efficiently work with textual
>>     data and I love it for that purpose. I have transformed my data
>>     so that it looks something like this:
>>
>>     word 	document 	position
>>     I 	1 	1
>>     have 	1 	2
>>     transformed 	1 	3
>>     my 	1 	4
>>     data 	1 	5
>>     so 	2 	1
>>     that 	2 	2
>>     it 	2 	3
>>     looks 	2 	4
>>     something 	2 	5
>>     like 	2 	6
>>     this 	2 	7
>>
>>
>>     (I actually use a unique number for each word, so that I am able
>>     to use data.table's excellent features to do lightning-fast word
>>     counts. This has revolutionized my workflow over looping through
>>     text files with Perl.)
>>
>>     My problem is that I sometimes need to search for phrases or to
>>     select words based on their context (for instance, I may want to
>>     exclude a word if it is preceded by "not" or followed by a word
>>     that changes its meaning). Currently, I am using the solution
>>     here
>>     <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
>>     create a new column for a word in another position, like this:
>>
>>     word 	document 	position 	lead_word
>>     I 	1 	1 	have
>>     have 	1 	2 	transformed
>>     transformed 	1 	3 	my
>>     my 	1 	4 	data
>>     data 	1 	5 	NA
>>     so 	2 	1 	that
>>     that 	2 	2 	it
>>     it 	2 	3 	looks
>>     looks 	2 	4 	something
>>     something 	2 	5 	like
>>     like 	2 	6 	this
>>     this 	2 	7 	NA
>>
>>
>>     using a command like:
>>     DT[,lead_word:=DT[list(document,position+1),word].
>>
>>     This approach has two problems, however. First, it consumes more
>>     resources as the dataset grows. I am currently working with a
>>     file containing over 150 million rows, so adding a column is
>>     costly. Second, I may want to check both one and two words ahead,
>>     so that I have to add two columns, and this can quickly get out
>>     of hand.
>>
>>     Is there a better way to use data.table to check the value in a
>>     row N distance from the row of interest within a group and select
>>     a row based on that value? Perhaps the .I variable could be
>>     useful here?
>>
>>     I appreciate any suggestions.
>>
>>
>>     Regards,
>>     Matt
>>
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140628/fa52d469/attachment.html>