[adegenet-forum] find.clusters without PCA

Tue Nov 4 15:08:03 CET 2014

Dear all,

naive questions are welcome here of course. Both the question and the answer make sense here, though Fede's answer makes me think he is sometimes so rude he could be French ;)

Seriously though. The pre-PCA step has two purposes:
1) reduce the number of variables to its minimum
2) separate the noise from the structured signal

If you are not interested in #2, #1 still has a computational interest. find.cluster uses k-means, which works with squared Euclidean distances between individual profiles. Generally speaking, when you have 'N' individuals and 'P' alleles, the number of dimensions necessary to represent all the information (all the distances) is min(N-1, P). K-means works faster with less variables. So running it on 'N-1' principal components (PCs) is generally faster than on 'P' alleles. If all PCs are retained, there is no loss of information. So in short, you don't need to remove the PCA step, just to keep all PCs.

Makes sense?

Cheers
Thibaut

________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Roberto Oliveira Santos [roberto at geodev.com.br]
Sent: 30 October 2014 18:41
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] find.clusters without PCA

Hi Federico

"shaming reputations"? sorry..., pretty much sure I don't have any reputation :-) if anyone ask a naive question this should be response? I disagree... anyway, thanks for the text. I'll keep in mind.

Cheers,

Roberto

2014-10-30 16:16 GMT+00:00 Federico Calboli <f.calboli at imperial.ac.uk<mailto:f.calboli at imperial.ac.uk>>:
You’re welcome.  I would not be presenting the results to referees, PhD examiners or colleagues.

http://judgestarling.tumblr.com/post/79974811093/shaming-reputations-as-a-means-of-reducing-the

Happy reading!

F

On 30 Oct 2014, at 16:02, Roberto Oliveira Santos <roberto at geodev.com.br<mailto:roberto at geodev.com.br>> wrote:

> Dear Federico
>
> Many thanks. Very kind of you the "It would also be completely and utterly idiotic.".
>
> Best wishes
>
> Roberto
>
>
> 2014-10-30 15:56 GMT+00:00 Federico Calboli <f.calboli at imperial.ac.uk<mailto:f.calboli at imperial.ac.uk>>:
> On 30 Oct 2014, at 15:40, Roberto Oliveira Santos <roberto at geodev.com.br<mailto:roberto at geodev.com.br>> wrote:
>
> > Dear all
> >
> > Is it possible to run find.clusters without the PCA analysis?
>
> I would not know whether find.clusters would like it, but in general you can surely find clusters without bothering with a PCA first — you have a formula, you input some data, you get your results.
>
> It would also be completely and utterly idiotic.
>
> You use a PCA before because of correlation betwen the data, and you transform the data with a PCA in a set of independent variables (and you also have an idea of what linear combinations explain little or nothing in the bargain).  You use a PCA to get some signal out of the noise.
>
> So, you can well not use a PCA and cluster.  You will get some results, that might, or not, look like the results you get after a PCA decomposition.  You will also have biased your clustering to an unknown amount, in a way that is not clear what might actually mean.
>
> BW
>
> F
>
>
> > I have interested in the clustering procedure but would like to compare the results with and without PCA transformation.
> >
> > Best wishes
> >
> > Roberto
> > _______________________________________________
> > adegenet-forum mailing list
> > adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20141104/6f03398f/attachment.html>