[datatable-help] Efficiently checking value of other row in data.table
Matt Dowle
mdowle at mdowle.plus.com
Sun Jun 29 00:00:58 CEST 2014
Hi Matt,
Great. If you can prepare some dummy data with the appropriate
properties and a parameter or two to scale up the size (or just provide
an online large example to download) and a query that gets to the right
answer but is slow or ugly, then we've got something to chew on ...
Matt
On 28/06/14 10:55, Matthew DeAngelis wrote:
> Hi Matt,
>
> You have the right of it. The problem is somewhat complicated,
> however, since I would want to substitute "DT[word=="good"..." with
> "DT[J("good")..." after setting the key to word and reordering the
> rows. Hence the two-step process I have now where I key by document
> and position first, create the lag_word column, key by the word and
> lag_word columns and query by row.
>
>
> Matt
>
>
> On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com
> <mailto:mdowle at mdowle.plus.com>> wrote:
>
>
> Hi,
>
> Not sure exactly what you need but looks interesting.
>
> Something a bit like this ?
>
> DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document]
>
> Your idea being you don't want to have to repeat all the pre and
> post words alongside each word but rather express it in the query.
> Makes sense. Leads to classifying "not good" and "not very good"
> as both negative phrases I guess.
>
> Matt
>
>
>
> On 26/06/14 21:56, Matthew DeAngelis wrote:
>> Hello data.table gurus,
>>
>> I have been using data.table to efficiently work with textual
>> data and I love it for that purpose. I have transformed my data
>> so that it looks something like this:
>>
>> word document position
>> I 1 1
>> have 1 2
>> transformed 1 3
>> my 1 4
>> data 1 5
>> so 2 1
>> that 2 2
>> it 2 3
>> looks 2 4
>> something 2 5
>> like 2 6
>> this 2 7
>>
>>
>> (I actually use a unique number for each word, so that I am able
>> to use data.table's excellent features to do lightning-fast word
>> counts. This has revolutionized my workflow over looping through
>> text files with Perl.)
>>
>> My problem is that I sometimes need to search for phrases or to
>> select words based on their context (for instance, I may want to
>> exclude a word if it is preceded by "not" or followed by a word
>> that changes its meaning). Currently, I am using the solution
>> here
>> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
>> create a new column for a word in another position, like this:
>>
>> word document position lead_word
>> I 1 1 have
>> have 1 2 transformed
>> transformed 1 3 my
>> my 1 4 data
>> data 1 5 NA
>> so 2 1 that
>> that 2 2 it
>> it 2 3 looks
>> looks 2 4 something
>> something 2 5 like
>> like 2 6 this
>> this 2 7 NA
>>
>>
>> using a command like:
>> DT[,lead_word:=DT[list(document,position+1),word].
>>
>> This approach has two problems, however. First, it consumes more
>> resources as the dataset grows. I am currently working with a
>> file containing over 150 million rows, so adding a column is
>> costly. Second, I may want to check both one and two words ahead,
>> so that I have to add two columns, and this can quickly get out
>> of hand.
>>
>> Is there a better way to use data.table to check the value in a
>> row N distance from the row of interest within a group and select
>> a row based on that value? Perhaps the .I variable could be
>> useful here?
>>
>> I appreciate any suggestions.
>>
>>
>> Regards,
>> Matt
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140628/fa52d469/attachment.html>
More information about the datatable-help
mailing list