[datatable-help] Efficiently checking value of other row in data.table

Matt Dowle mdowle at mdowle.plus.com
Fri Jun 27 21:17:18 CEST 2014


Hi,

Not sure exactly what you need but looks interesting.

Something a bit like this ?

DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]

Your idea being you don't want to have to repeat all the pre and post 
words alongside each word but rather express it in the query. Makes 
sense.   Leads to classifying "not good" and "not very good" as both 
negative phrases I guess.

Matt


On 26/06/14 21:56, Matthew DeAngelis wrote:
> Hello data.table gurus,
>
> I have been using data.table to efficiently work with textual data and 
> I love it for that purpose. I have transformed my data so that it 
> looks something like this:
>
> word 	document 	position
> I 	1 	1
> have 	1 	2
> transformed 	1 	3
> my 	1 	4
> data 	1 	5
> so 	2 	1
> that 	2 	2
> it 	2 	3
> looks 	2 	4
> something 	2 	5
> like 	2 	6
> this 	2 	7
>
>
> (I actually use a unique number for each word, so that I am able to 
> use data.table's excellent features to do lightning-fast word counts. 
> This has revolutionized my workflow over looping through text files 
> with Perl.)
>
> My problem is that I sometimes need to search for phrases or to select 
> words based on their context (for instance, I may want to exclude a 
> word if it is preceded by "not" or followed by a word that changes its 
> meaning). Currently, I am using the solution here 
> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to 
> create a new column for a word in another position, like this:
>
> word 	document 	position 	lead_word
> I 	1 	1 	have
> have 	1 	2 	transformed
> transformed 	1 	3 	my
> my 	1 	4 	data
> data 	1 	5 	NA
> so 	2 	1 	that
> that 	2 	2 	it
> it 	2 	3 	looks
> looks 	2 	4 	something
> something 	2 	5 	like
> like 	2 	6 	this
> this 	2 	7 	NA
>
>
> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>
> This approach has two problems, however. First, it consumes more 
> resources as the dataset grows. I am currently working with a file 
> containing over 150 million rows, so adding a column is costly. 
> Second, I may want to check both one and two words ahead, so that I 
> have to add two columns, and this can quickly get out of hand.
>
> Is there a better way to use data.table to check the value in a row N 
> distance from the row of interest within a group and select a row 
> based on that value? Perhaps the .I variable could be useful here?
>
> I appreciate any suggestions.
>
>
> Regards,
> Matt
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140627/c619d88d/attachment.html>


More information about the datatable-help mailing list