[adegenet-forum] Question about how to interpret Cross validation in my analysis. Thanks!

Fri Oct 24 19:03:11 CEST 2014

Hello again,

In response to your two questions:

*1) *

The output element “mean and CI for random chance” provides the values that
are used to draw the horizontal solid (mean) and dashed (CI) lines on the
plot generated for cross-validation.

In your case, the mean and CI for random chance was 49% (43%, 60%). The
interpretation of this would be that if the highest success in outcome
prediction that you were able to achieve with any model was between 43% and
60%, then you could be 95% confident that the ability of even the best
model to assign individuals to the correct group does not differ
significantly from the success rate you could achieve by assigning
individuals to a group at random by, say, flipping a coin as a method of
determining what group they belonged to. Ergo, you would not have succeeded
in creating a useful model.

However, your results indicate that with 25 PCs retained, your model had a
success rate of 69.5%, so you *have* created a “useful” model. Even though
it is not a particularly successful model, it still has a mean success rate
that is 20% higher than the mean success for the coin toss approach, and
10% higher than the upper limit of the CI for random chance. So you can be
95% confident that the somewhat modest ability of your best model to
discriminate between groups is not just happening by chance—the model is
truly doing something useful.

------

*2)   2)*

       While your interpretation is generally true, in that group
membership is not well-predicted by any model, I think you have mis-read
the results. The way they are laid out, at least in the text you copied
into the e-mail, has skewed the values given for the means to the right of
the number of PCs that they should be corresponding to… With 25 PCs, your
optimal model is actually achieving a mean success of nearly 70%. Still not
too good, but better than 63%. The MSE for 25 PCs is 32.4%, which is indeed
quite high.

However, the interpretation of this is not that you can only be “sure” of
correctly predicting around 20% to the right pre-defined group. Rather, you
can be “sure” of correctly predicting almost 70%! I think your confusion
here may come from your interpretation of what the random chance values
mean. Finding that the mean success for your best model is 20% above the
mean success for random chance does not mean you can only be sure of 20%
correct predictions. Rather, you could say that while you can in fact
expect a 70% success rate (your highest mean success), your model is only
providing an improvement of ~ 20% over the success rate you could have
achieved by tossing a coin.

This changes the severity of your final conclusion. First, I should mention
that it’s not fair to say that “[your] set of microsatellites can’t explain
well [your] pre-defined groups”. Instead, it might be more accurate to say,
“*With* the set of microsatellites available, you are unable to build a
*model* with DAPC that explains well the variation between your pre-defined
groups.” Finally, in light of the points above, while it is still true that
the model does not explain the variation between groups particularly well,
it does explain about 70% of that variation, so I wouldn’t consider it to
be “unsuccessful”.

-----

Sorry for the long answer, but I hope it helps a bit at least!

Please let me know if it doesn’t though, or if you have any more questions.

All the best,
Caitlin.

On Thu, Oct 16, 2014 at 11:30 PM, Angela Merino <
Angela.Merino at cawthron.org.nz> wrote:

>  Thanks you very much! It was really helpful! J
>
>
>
> Then I understand that my models is not significantly the best model that
> could be found using my variables (in my case, microsatellites). If I use a
> model with n.pca=20 or =40 I got pretty the same success of membership
> prediction (and with the same big root mean squared error).
>
>
>
> 1)      My last questions (I hope!) to understand the output of the
> *cross.validation* function is what does it mean the Median and
> Confidence Interval for Random Chance (below in yellow)? I think it means
> that with a confidence of 95% the value of successful assignment would be a
> value between 43% and 60%, which therefore means again that the
> optimization of my model was “not successful”. (??)
>
> 2)      About the global interpretation of this results, I would say that
> membership of my predefined groups are not well predicted by any model as
> the mean successful assignment is not higher than 63% (Maximum when
> n.pcs=25) and in addition the mean squared errors is quite high (30-40%). I
> would be “sure” of predicting only around 20% to the right predefined
> group. In short, my set of microsatellites can’t explain well my predefined
> groups.
>
>
>
>
>
> [image: cid:image002.jpg at 01CFE7A4.CCC02130]*$`Median and Confidence
> Interval for Random Chance`*
>
> *     2.5%       50%     97.5% *
>
> *0.4294840 0.4928747 0.5962807*
>
> *$`Mean Successful Assignment by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
> 35        40 *
>
> *0.5871429 0.6000000 0.5819048 0.6014286 0.6952381 0.6747619 0.6333333
> 0.6109524 *
>
> *$`Number of PCs Achieving Highest Mean Success`*
>
> *[1] "25"*
>
> *$`Root Mean Squared Error by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
>    35        40 *
>
> *0.4301795 0.4141872 0.4389381 0.4131429 0.3241735 0.3531491 0.3885084
> 0.4145894 *
>
> *$`Number of PCs Achieving Lowest MSE`*
>
> *[1] "25"*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Thanks in advance! I am learning a lot about R and adegenet package and I
> find really interesting to assess weak genetic population structure.
>
>
>
> Kind regards,
>
>
>
> ‘Angela
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Caitlin Collins [mailto:caitiecollins at gmail.com]
> *Sent:* Friday, 17 October 2014 1:28 a.m.
> *To:* Angela Merino
> *Cc:* Collins, Caitlin; Jombart, Thibaut
> *Subject:* Re: Question about how to interpret Cross validation in my
> analysis. Thanks!
>
>
>
> Hi Angela,
>
> Well, I have two pieces of good news for you, and one piece of mediocre
> news.
>
> First, there’s nothing to worry about with respect to the “NULL” that you
> are seeing. It just gets printed when xval.plot=TRUE as an artefact of one
> of the lines of the printing function. It has no meaning, and certainly
> does not imply that your model is not valid. (Given the stress that I now
> realise this glaring “NULL” may cause, I’ve changed the way the plots print
> now, so in the next release of adegenet this won’t happen.)
>
> Second, you are absolutely correct in your interpretation of the results
> of xvalDapc (which are stored in whatever object you assigned the results
> to, in your case, “xval”).
>
>
>
> This brings me to the mediocre news: given that your interpretation is
> correct, it seems that the best model you can achieve with DAPC, where
> n.pca=25, is only able to predict the group membership of validation set
> individuals in 63% of the cases, with a 32% root mean squared error.
> Arguably, this is not great. Your final comment on the matter, though, is
> quite insightful. The fact that you can achieve the same modest level of
> success with 20-80 PCs indicates that the optimisation procedure has not
> been particularly successful. Ideally, one would like to see an arch, with
> a maximum success point somewhere in the middle. In your case, there is a
> bit of an arch, but it isn’t particularly striking.
>
>
>
> The only thing I might add to your interpretation of this result is that
> it’s not so much that the model is poor because a similar level of success
> can be achieved with variable numbers of PCs. If mean success was virtually
> constant, but varying around 90%, the interpretation would not be that the
> model is poor, but rather that most levels of PC retention can compose a
> model that effectively discriminates between groups.
>
> I hope this has helped answer some of your questions. If you have any
> more, please feel free to ask.
>
> Best,
> Caitlin.
>
>
>
>
>
> On Mon, Oct 13, 2014 at 11:48 PM, Angela Merino <
> Angela.Merino at cawthron.org.nz> wrote:
>
> Hi Caitlin Collins and Thibaut Jombart,
>
>
>
> My name is Angela Parody-Merino and I am a PhD student at Massey
> University (New Zealand). I am studying the population genetic structure in
> a migratory bird (the New Zealand Godwit) with 23 microsatellites. Anyway,
> maybe this is a very simple question but I really want to understand and be
> sure about the meaning and interpretation of the output when doing
> cross-validation. I have been some days looking in the internet and reading
> explanations etc…without being able to really understand what’s going on
> with my analysis. Could you help me please? J
>
>
>
> This is the script of the analysis:
>
> > x <- ELpop
>
> > mat <- as.matrix(na.replace(x, method="mean"))
>
>
>
> Replaced 371 missing values
>
> > grp <- pop(x)
>
> > xval <- xvalDapc(mat, grp, n.pca.max = 40, training.set = 0.9,
>
> + result = "groupMean", center = TRUE, scale = FALSE,
>
> + n.pca = NULL, n.rep = 500, xval.plot = TRUE)
>
> NULL *>>> What does it mean this NULL? Does it mean that the model is not
> valid?*
>
> *$`Median and Confidence Interval for Random Chance`*
>
> *     2.5%       50%     97.5% *
>
> *0.4294840 0.4928747 0.5962807 *
>
>
>
> *$`Mean Successful Assignment by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
> 35        40 *
>
> *0.5871429 0.6000000 0.5819048 0.6014286 0.6952381 0.6747619 0.6333333
> 0.6109524 *
>
>
>
> *$`Number of PCs Achieving Highest Mean Success`*
>
> *[1] "25"*
>
>
>
> *$`Root Mean Squared Error by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
>    35        40 *
>
> *0.4301795 0.4141872 0.4389381 0.4131429 0.3241735 0.3531491 0.3885084
> 0.4145894 *
>
>
>
> *$`Number of PCs Achieving Lowest MSE`*
>
> *[1] "25"*
>
>
>
> *From the screenshot and the output results of the cross validation (in
> blue), I would say that my model (retaining 25PCs) can predict  with a mean
> of 63% but it is not such a good model because most of the models that can
> be obtained by retaining 20, 40, 60, 80 PCs are quite the same successful.
> Is it my interpretation correct?*
>
>
>
>
>
>
>
> Thanks in advance,
>
>
>
> Kind regards,
>
>
>
> ‘Angela Parody-Merino
>  ------------------------------
>
> *Attention: *
> This message is for the named person's use only.  It may contain
> confidential, proprietary or legally privileged information.  If you
> receive this message in error, please immediately delete it and all copies
> of it from your system, destroy any hard copies of it and notify the
> sender.  You must not, directly or indirectly, use, disclose, distribute,
> print, or copy any part of this message if you are not the intended
> recipient. Cawthron reserves the right to monitor all e-mail communications
> through its networks.  Any opinions expressed in this message are those of
> the individual sender, except where the message states otherwise and the
> sender is authorised to make that statement.
>
> This e-mail message has been scanned and cleared by *MailMarshal *
>  ------------------------------
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20141024/cddb720e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 48953 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20141024/cddb720e/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 31124 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20141024/cddb720e/attachment-0003.jpg>