[datatable-help] jaccard index calculation
Ben Weinstein
benweinstein2010 at gmail.com
Tue Apr 14 15:51:18 CEST 2015
Hi,
Just following up on RMach's question with a bit on an example and further
explanation since this is something i've always wondered about.
I often find myself trying to compute pairwise distances on a series of
rows. For each we have a keyed data.table that has 5000 columns and 10,000
row, which equates to (n*n-1)/2 comparisons ~ about 50 million in this
case. The basic data structure and design looks like this:
library(reshape2)
a<-data.frame(ID=1:10,Site1=rbinom(1,1,.5),Site2=rbinom(10,1,.5),Site3=rbinom(10,1,0.5))
dista<-dist(a[,-1])
pairwise<-melt(as.matrix(dista))
colnames(pairwise)<-c("To","From","Dist")
head(pairwise)
We use a parallel computing strategy to chunk pieces, but its a real mess
keeping track. The goal would be to find a data.table solution, especially
one that does not repeat pairwise comparisons. For example, comparing row 4
to row 9 is the same as 9 to 4.
The same could be done for any dist metric, including the vegdist vegan
function.
Thanks for your thoughts,
Ben
On Tue, Apr 14, 2015 at 4:24 AM, RMach <rdpmachado at gmail.com> wrote:
> Hi all,
>
> how should the input matrix structure be in order to use vegdist(vegan) to
> compute jaccard index.
>
> thanks in advance.
> RMach
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/jaccard-index-calculation-tp4705824.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
--
Ben Weinstein
PhD Candidate
Ecology and Evolution
Stony Brook University
http://benweinstein.weebly.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150414/93a31150/attachment.html>
More information about the datatable-help
mailing list