[datatable-help] jaccard index calculation

Ben Weinstein benweinstein2010 at gmail.com
Tue Apr 14 15:51:18 CEST 2015


Hi,

Just following up on RMach's question with a bit on an example and further
explanation since this is something i've always wondered about.

I often find myself trying to compute pairwise distances on a series of
rows. For each we have a keyed data.table that has 5000 columns and 10,000
row, which equates to (n*n-1)/2 comparisons ~ about 50 million in this
case. The basic data structure and design looks like this:

library(reshape2)
a<-data.frame(ID=1:10,Site1=rbinom(1,1,.5),Site2=rbinom(10,1,.5),Site3=rbinom(10,1,0.5))

dista<-dist(a[,-1])

pairwise<-melt(as.matrix(dista))

colnames(pairwise)<-c("To","From","Dist")

head(pairwise)


 We use a parallel computing strategy to chunk pieces, but its a real mess
keeping track. The goal would be to find a data.table solution, especially
one that does not repeat pairwise comparisons. For example, comparing row 4
to row 9 is the same as 9 to 4.

The same could be done for any dist metric, including the vegdist vegan
function.

Thanks for your thoughts,

Ben

On Tue, Apr 14, 2015 at 4:24 AM, RMach <rdpmachado at gmail.com> wrote:

> Hi all,
>
> how should the input matrix structure be in order to use vegdist(vegan) to
> compute jaccard index.
>
> thanks in advance.
> RMach
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/jaccard-index-calculation-tp4705824.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>



-- 
Ben Weinstein
PhD Candidate
Ecology and Evolution
Stony Brook University

http://benweinstein.weebly.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150414/93a31150/attachment.html>


More information about the datatable-help mailing list