[adegenet-forum] Question about how to interpret Cross validation in my analysis. Thanks!

Caitlin Collins caitiecollins at gmail.com
Wed Aug 19 21:38:33 CEST 2015


Hi Angela,


Sorry for the delay in getting back to you. I’ve just returned from
teaching in the woods without Internet.


I’m glad to see you’ve been thinking so much about cross-validation and
mean-squared error!


First of all, *I think you may be focusing too much on the spread of your
RMSE*. The MSE is just a measure of how far, on average, your estimates are
from the target (in our case, 100% success). We use RMSE instead of the
mean success to pick the “best” n.pc in cross-validation because, as I
demonstrated with my toy example, in cases where the mean is the same, RMSE
allows us to identify which of these cases with the same mean has, in
addition, a lower error.


That said, I think there may be an error in your sample calculations for
the toy example you present regarding RMSE! *My calculations for those two
sets of numbers show that while they both have the same mean, their RMSEs
do differ*. Try running the following commands to get those results (and
then, if you want, try the examples that follow):



## your case 1
x <- c(30.5, 34.5, 31, 31, 35)
mean(x)
# 32.4
sqrt(mean(((100-x)^2)))
# 67.62766

## your case 2
x <- c(50, 15, 34, 38, 25)
mean(x)
# 32.4
sqrt(mean(((100-x)^2)))
# 68.62944

## Try these out too to get a feel for how RMSE can vary while the mean
stays the same:
x <- c(5,95)
mean(x)
# 50
sqrt(mean(((100-x)^2)))
# 67.27

x <- c(25,75)
mean(x)
# 50
sqrt(mean(((100-x)^2)))
# 55.9

x <- c(5, 25, 75, 95)
mean(x)
# 50
sqrt(mean(((100-x)^2)))
# 61.85

x <- c(5, 5, 25, 25, 75, 75, 95, 95)
mean(x)
# 50
sqrt(mean(((100-x)^2)))
# 61.85

x <- c(5, 5, 25, 75, 95, 95)
mean(x)
# 50
sqrt(mean(((100-x)^2)))
# 63.71



In general, I would say that *because RMSE is able to tell you which n.pcs
gives you the best model, you don’t necessarily need to delve much further
into the results to pick the optimum model*.


If you do, nevertheless, want to look into your results for each replicate
of cross-validation, that should be possible. When you ran the analysis to
which you are referring, did you happen to save the output of xvalDapc? If
so, *the first element of this output contains the results for each
replicate*, at each level of PC retention. If not, I would suggest running
it again with the same argument settings as before. And, as an added
suggestion, try running the command set.seed(1) directly before running
xvalDapc. This will allow you to control the random behaviour inherent in
xvalDapc’s random sampling (ie. While effectively random, you would be able
to get the exact same results every time you ran set.seed(1) followed by
xvalDapc, in case you want to be able to perfectly replicate the results
you get on a future occasion.)



I’ve also posted something relatively recently that I think may help
address your final point. *Please take a look at out one of my previous
posts to the adegenet forum on the subject of interpreting “accuracy” in
xvalDapc results:*
http://lists.r-forge.r-project.org/pipermail/adegenet-forum/2015-July/001212.html


Does that help at all? Overall, I would say that I think *particularly because
you said you are not actually trying to build a model for the purposes of
prediction, you may be better off just reporting the proportion of
successful assignment achieved with the DAPC model built with the optimum
number of PCs selected via xvalDapc (ie. using assign.prop from
summary(dapc))*.


If you still have questions after reading this e-mail and that post, of
course I’ll still be here to address them!


All the best,

Caitlin.

On Tue, Aug 4, 2015 at 4:05 AM, Angela Merino <Angela.Merino at cawthron.org.nz
> wrote:

>
>
> Thank you very much for your detailed and really useful explanations. J
>
>
>
> I will try to explain what I aim to do:
>
>
>
> First of all, a bit of the background:  I am working with a migratory bird
> species from which any genetic analysis has been done before. This species
> migrates from New Zealand to Alaska: they breed in Alaska. We don’t know
> much about its ecology in Alaska, but from data recorded in New Zealand and
> geolocators that have been set in some individuals and successfully
> retrieved from them we have seen that there is a relationship between
> departure time from NZ and breeding latitude in Alaska (earlier departers
> breed southern, later departers northern: in a cline). Moreover, although
> the species has a span time of departure for migration, individuals show
> high consistency in their departure time (so I categorized individuals into
> predefined clusters based on behavior = i.e. “early birds” “late birds”).
> In other words, my hypothesis is that because there is a relationship
> between timing of migration and breeding site, they may show genetic
> structuration (which could be seen using neutral markers, as
> microsatellites). If I find a sign of significant structuration in the
> population using predefined clusters based on its behavior (I have 145
> individuals sampled and with behavior data) I could say that migratory
> patterns are at least associated with the subtle population structure.
>
>
>
> STRUCTURE v.2.3 (Bayesian clustering) found no structure (k = 1) however k
> = 2 had quite similar values.
>
> In Arlequin (AMOVA) I got very weak Fst between two predefined groups with
> the most extreme behaviours, and significant!
>
> K-means algorithm find k = 2 when I use predefined subpopulations “early”
> vs. “late”. Although the correct assignment of these inferred clusters
> don’t correspond very much with my predefined clusters: only about 60%
> correct assignment. But I think this is in accordance with the subtle
> population structure found in the other analyses.
>
>
>
> So my aim with DAPC is to explore more this very weak genetic structure
> that Bayesian clustering and AMOVA are suggesting. I don’t aim to build any
> model to use later for identifying behavior of an individual from its
> markers, but I aim to see whether it could be possible to build a model
> (using a number of PCs from the information provided by the
> microsatellites) that could identify/ classify/ group/ distinguish
> individuals in their correct behavior, using those 145 individuals from
> which I have genotyping (microsatellites) and behavior. Therefore, if I can
> built a model from these microsatellites with better success than random
> chance in distinguishing the migratory behavior, that could suggest that
> neutral markers structure in the population is associated to migratory
> behavior in this long-distance migratory bird: my predefined clusters based
> on behavior are not random groups but they mean something (in relation to
> its population structure).
>
>
>
>
>
> ---------
>
> Answering this question:
>
> Based on the results of those other clustering methods you mention, I
> assume you have identified potential clusters based on the genetic markers
> you are using (right?). When running the DAPC analysis, you have grouped
> individuals into clusters of interest. So my first question is:* Have you
> used either (i) The clusters identified by other methods? Or (ii) Clusters
> based on “behaviour”, that are NOT the same as those identified by other
> methods?*
>
>  *If you have used (ii)* Clusters based on “behaviour”, that are NOT the
> same as those identified by other methods, *then can you please try to
> explain again what exactly you are trying to do? *
>
>
>
> Yes, I did not use clusters identified by other methods since there was
> not much accordance of my clustering with K-means inferred clusters, as I
> said before. Then I decided to keep my predefined clusters based on my
> behavior data (which is quire robust data from observations and geolocators
> along about 6 years).
>
>
>
> I hope this clarifies what I am doing J. Please tell me if you find that
> this methods would help me to solve my aim.
>
>
>
> As for the specific question about the meaning of the value of MSE, thanks
> very much for such a detailed explanation, I understand now much better. J
>
>
>
> I think my situation is more similar to the case 2 you explained. However
> I don’t have the data for each replicate (if I well remember I did 15
> replicates (?) which correspond to each point per PCs in the graphic-see
> below-don’t pay attention to the yellow highlighted text-), so I don’t know
> how different the MSE for each replicate is from each other (it could be
> that the RMSE which is 32.4% came from =
>
>
>
> sqrt((((30.5-100)^2) + ((34.5-100)^2) + ((31-100)^2) + ((31-100)^2) + ((35-100)^2))
> / 5)
>
> *RMSE = 32.4% (in blue, the lower success value, in red, the higher
> success value) **à** more consistent*
>
>
>
> *Or*
>
>
>
> sqrt((((50-100)^2) + ((15-100)^2) + ((34-100)^2) + ((38-100)^2) +
> ((25-100)^2)) / 5)
>
> *RMSE = 32.4% (in blue, the lower success value, in red, the higher
> success value) **à** less consistent*
>
>
>
> *among many other possibilities…..*
>
>
>
> Therefore, one need to see the values from which the mean success and the
> RMSE are calculated from, if not, I can’t determine how consistent the
> error (or success) of my model is (right?).
>
> In other words, without this information (success and/or error for each
> replicate per X number of PCs), I don’t know if the value of my RMSE
> (32.4%) is far from the higher error my model could have. Or, in other
> words, I won’t know (and I don’t know indeed), what is the lower success I
> could see in my “best” model.
>
>
>
> In the case the mean success and RMSE were consistent along the replicates
> when PCs=25 of my “best” model, it would mean that I found a model that is
> able to assign individuals correctly to the behavior-group they belong to
> better than just classifying them randomly. And, although it is not a very
> useful model (not a great mean success and quite high error, as discussed
> before), it is meaningful.
>
>
>
> Thank you very much,
>
>
>
> I will be looking forward for your answer, J. Let me know if you need
> more clarifications.
>
>
>
> ‘Angela
>
>
>
>
>
> *From:* Caitlin Collins [mailto:caitiecollins at gmail.com
> <caitiecollins at gmail.com>]
> *Sent:* Wednesday, 29 April 2015 6:22 a.m.
> *To:* Angela Merino
> *Cc:* t.jombart at imperial.ac.uk
>
> *Subject:* Re: Question about how to interpret Cross validation in my
> analysis. Thanks!
>
>
>
> Hi Angela,
>
>
>
> Nice to hear from you.
>
>
>
> *Before I can answer you, I think I need to ask you to explain in a little
> more detail what you are trying to do. *
>
> My confusion is mainly with:
>
> (1) Your aim
>
> (2) Your clusters. (ie. DAPC clusters based on behaviour, versus clusters
> identified by Bayesian, K-means, and AMOVA)
>
>
>
> You say, “… if I can build a model from genotyping (neutral markers) I
> could say that the behavior I used to defined those subpopulations would be
> associated with the very weak pop. structure I found with other methods
> (Bayesian, K-means, AMOVA)”. This is mostly what I am confused about…
>
>
>
> Based on the results of those other clustering methods you mention, I
> assume you have identified potential clusters based on the genetic markers
> you are using (right?). When running the DAPC analysis, you have grouped
> individuals into clusters of interest. So my first question is:* Have you
> used either (i) The clusters identified by other methods? Or (ii) Clusters
> based on “behaviour”, that are NOT the same as those identified by other
> methods?*
>
>
>
> *If you have used (i)* The clusters identified by other methods, then I
> would take your aim in using DAPC to be: “Identify the genetic markers that
> best describe the differences between the (weakly-supported (??))
> population clusters identified by other methods”. *Does that sound like
> what you are trying to do?*
>
>
>
> *If you have used (ii)* Clusters based on “behaviour”, that are NOT the
> same as those identified by other methods, *then can you please try to
> explain again what exactly you are trying to do? *
>
>
>
>
> *---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------*
>
>
>
>
> *---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------*
>
>
>
> *For now, without fully understanding what you are trying to do, I can try
> to give you some input…*
>
>
>
> First, you are correct in saying that the MSE (32.4%) is quite high (ie.
> Not good). However, as we discussed in previous e-mails, because your mean
> success falls outside of the limits of the confidence interval for random
> chance, you will still be able to use the "best" model based on 25 PCs (ie.
> While it is not a great model, it is not a meaningless model either).
>
>
>
> Second, *let me try and explain why RMSE is given as a percent, and what
> RMSE really is:*
>
> -       The root-mean-square error or “root-mean-square deviation” (same
> thing, we’ll just call it “*RMSE*”) is just *a measure of the difference
> between the expected value for a model and the observed values*.
>
> -       In the case of cross-validation, we are trying to *build a model
> by combining some number of principal components (PCs).*
>
> -       As you can see in the cross-validation plot produced by xvalDapc,
> the *“value”* we are measuring is the *proportion of successful outcome
> prediction.*
>
> o   *NOTE*: In the plot, this value ranges from 0 to 1, representing the
> proportion of individuals, out of 1, for which we were able to successfully
> predict the cluster to which they belong, based on the model composed of
> however many PCs are indicated on the x-axis. However, this “proportion”
> can also be thought of as a percent. This is also the case for the *RMSE,
> which can be thought of approximately as the inverse of the “Proportion
> successful outcome prediction” or “Mean successful assignment” **(as you
> know, these last two are the same thing, both representing “success”), so
> the RMSE is approximately (1 - success).  *
>
>
>
> *I think I will be able to explain this best with a simple example, based
> loosely on your results (but not identical to them): *
>
>
>
> *Example.1: *
>
> -       You see that the *highest mean success* you can get with any
> number of PCs in the model is *0.69*. This proportion if written as a
> percent would be *69%*.
>
> o   So you can say that, “the highest mean success you can achieve in
> correctly predicting the cluster of the individuals in your dataset is 69%
> (with 25 PCs in the model)”. (ie. 69 times out of 100, you would correctly
> identify the true cluster of an individual, with a model based on 25 PCs).
>
> -       You also see that the *lowest RMSE* you can achieve is *0.32417… *This
> proportion, if written as a percent, would be ~ *32.4%* RMSE.
>
> o   You may have noticed that the RMSE with a model based on 25 PCs (RMSE
> = 32.4%) is *approximately* (*100% - success)*, based on 25 PCs.
>
> -       This brings me to *the reason we used RMSE to choose the “best”
> number of PCs to keep in the model, instead of  mean success:*
>
> o   While in your case, *both* highest mean success *and *lowest RMSE
> were found to occur when the model was based on 25 PCs, * this is not
> always the case*.
>
> o   Sometimes, the model with the lowest RMSE is based on a *different *number
> of PCs than the model with the highest mean success. This is because, as
> you will have noticed, the* RMSE is NOT exactly (100% - success)*.
>
> o   *Let me illustrate this with another little example:*
>
>
>
> *Example 1.1:*
>
> o   Say you have performed your cross-validation analysis with only *5
> replications*.
>
> o   *Both** of the following sets of observed values would allow you to
> achieve a mean success of 69%:*
>
> §  *Case 1:* *69, 69, 69, 69, 69*
>
> §  *Case 2:* *65 69, 73, 54.5, 83.5 *
>
> o   We can calculate the RMSE for both of these sets of values.
>
> o   First,* recall* that our hypothetical* “expected value”* is* always
> 100%.*
>
> o   The following is the equation for calculating the RMSE:
>
> §  *RMSE = sqrt( E(x’ – x)^2)*
>
> o   In English, this formula says that the root mean squared error is
> equal to the square root of the expected value (aka the sum across all
> observations) of the difference between x’ (the observed value (ie. the
> observed success for a given point)) and x (the “expected value”, which we
> have said is always 100% success), squared.
>
> o   Let’s calculate RMSE for both cases:
>
> o   *Case 1: *
>
> o   RMSE =
>
> §  sqrt((((69-100)^2) + ((69-100)^2) + ((69-100)^2) + ((69-100)^2) + ((69-100)^2))
> / 5)
>
> §  *RMSE = 31%*
>
> o   *Case 2:*
>
> o   RMSE =
>
> §  sqrt((((65-100)^2) + ((69-100)^2) + ((73-100)^2) + ((54.5-100)^2) +
> ((83.5-100)^2)) / 5)
>
> §  *RMSE = 32.4%* (*just like yours!*)
>
> -       In *Case 1*, because for each point, we got a success rate of
> exactly 69%, the RMSE is *exactly (100% – success) because 100 – 69 = 31.
> *
>
> -     However, in *Case 2*, the success rates for each round varied
> (although they still gave a mean success of 69%), so the *RMSE is NOT
> exactly 100 – success, but a little more than this number (32.4%).*
>
> -       The reason Case 2 had a higher (read: *worse*) RMSE than Case 1
> is because, while they both had a mean success of 69%, the points in the
> Case 2 set varied from this value (ie. sometimes the success rate was much
> lower than 69, and sometimes it was higher).
>
> -       *This is why we use RMSE to determine which model has the “best”
> number of PCs, instead of mean success*. If, for example, we had gotten
> the values for Case 1 with 30 PCs in the model, and the values for Case 2
> with 25 PCs in the model, we would want to choose the model with 30 PCs as
> the “best” model. This is because, while both models were able to give a
> mean success rate of 69%, the model with 30 PCs (ie. Case 1) was able to do
> this *more consistently*. Therefore, if we choose the model with 30 PCs,
> we can be confident that we will consistently have about 69% success and
> 31% error, while if we chose the model with 25 PCs (Case 2), we would
> expect about 69% success, but we could see success as low as 54.5%
> (corresponding to errors as high as 45.5%.
>
>
>
>
>
> Okay, now I will stop writing and say sorry for the very very long answer!
> But I hope this helps you in thinking about and interpreting RMSE.
>
>
>
>
> Please let me know more about my question in Part I, and I will do my best
> to help you there.
>
>
>
> All the best,
>
> Caitlin.
>
>
>
> On Fri, Apr 24, 2015 at 2:23 AM, Angela Merino <
> Angela.Merino at cawthron.org.nz> wrote:
>
> Hi Caitlin Collins,
>
>
>
> I have another question regarding the interpretation of the *root mean
> square error*. Sorry for this lap of time between questions! I find this
> very interesting, but a challenge for me to really get the right
> interpretation.
>
>
>
> I understood all what you explained me in previous emails (very helpful,
> thanks so much!!!). J J
>
>
>
> But now I have been thinking about the mean squared error and what is it
> telling me about the model for predicting predefined subpopulations.
> Refreshing my situation: I aim to test predefined subpopulations by using
> genotype data and DAPC to build a model that may be able to find a function
> to accurately distinguish those two hypothetical subpopulations. My
> rational is that if I can build a model from genotyping (neutral markers) I
> could say that the behavior I used to defined those subpopulations would be
> associated with the very weak pop. structure I found with other methods
> (Bayesian, K-means, AMOVA). I got a model with a success rate of 69%
> (random chance 49%, 43%-60%), and I understood/agree that being my success
> rate higher than the higher limit of random chance….etc. However, as u
> said, the MSE is quite high (*32.4%).*
>
>
>
> I read about MSE and understood that it gives information about how well
> data points fit to the line of the function (right?). Well, it is actually
> the average of the distances of my set of data points from the line given
> by the function (I try to visualize it…). But I don’t understand well *why
> is it given in %?*
>
>
>
> I have the impression that what I have is lots of outliers that don’t fit
> well with the function (I visualize this as noise, lot of noise…which for
> me means, not accuracy, then not good predictor of my predefined
> subpopulations, then genetic data is not very useful to identify my
> hypothetical subpopulations…then my hypothetical subpopulations may don’t
> have sense with the weak genetic structure other methods (i.e. Bayesian,
> AMOVA and K-means) suggested). *What do you think?*
>
>
>
> Thanks in advance, J
>
>
>
> Kind regards,
>
>
>
> ‘Angela
>
>
>
> *From:* Caitlin Collins [mailto:caitiecollins at gmail.com]
> *Sent:* Saturday, 25 October 2014 6:03 a.m.
> *To:* Angela Merino; adegenet-forum at lists.r-forge.r-project.org
>
>
> *Cc:* Collins, Caitlin; Jombart, Thibaut
> *Subject:* Re: Question about how to interpret Cross validation in my
> analysis. Thanks!
>
>
>
> Hello again,
>
>
>
> In response to your two questions:
>
>
>
> *1) *
>
>
>
> The output element “mean and CI for random chance” provides the values
> that are used to draw the horizontal solid (mean) and dashed (CI) lines on
> the plot generated for cross-validation.
>
>
>
> In your case, the mean and CI for random chance was 49% (43%, 60%). The
> interpretation of this would be that if the highest success in outcome
> prediction that you were able to achieve with any model was between 43% and
> 60%, then you could be 95% confident that the ability of even the best
> model to assign individuals to the correct group does not differ
> significantly from the success rate you could achieve by assigning
> individuals to a group at random by, say, flipping a coin as a method of
> determining what group they belonged to. Ergo, you would not have succeeded
> in creating a useful model.
>
>
>
> However, your results indicate that with 25 PCs retained, your model had a
> success rate of 69.5%, so you *have* created a “useful” model. Even
> though it is not a particularly successful model, it still has a mean
> success rate that is 20% higher than the mean success for the coin toss
> approach, and 10% higher than the upper limit of the CI for random chance.
> So you can be 95% confident that the somewhat modest ability of your best
> model to discriminate between groups is not just happening by chance—the
> model is truly doing something useful.
>
> ------
>
> *2)**   2)*
>
>        While your interpretation is generally true, in that group
> membership is not well-predicted by any model, I think you have mis-read
> the results. The way they are laid out, at least in the text you copied
> into the e-mail, has skewed the values given for the means to the right of
> the number of PCs that they should be corresponding to… With 25 PCs, your
> optimal model is actually achieving a mean success of nearly 70%. Still not
> too good, but better than 63%. The MSE for 25 PCs is 32.4%, which is indeed
> quite high.
>
> However, the interpretation of this is not that you can only be “sure” of
> correctly predicting around 20% to the right pre-defined group. Rather, you
> can be “sure” of correctly predicting almost 70%! I think your confusion
> here may come from your interpretation of what the random chance values
> mean. Finding that the mean success for your best model is 20% above the
> mean success for random chance does not mean you can only be sure of 20%
> correct predictions. Rather, you could say that while you can in fact
> expect a 70% success rate (your highest mean success), your model is only
> providing an improvement of ~ 20% over the success rate you could have
> achieved by tossing a coin.
>
> This changes the severity of your final conclusion. First, I should
> mention that it’s not fair to say that “[your] set of microsatellites can’t
> explain well [your] pre-defined groups”. Instead, it might be more accurate
> to say, “*With* the set of microsatellites available, you are unable to
> build a *model* with DAPC that explains well the variation between your
> pre-defined groups.” Finally, in light of the points above, while it is
> still true that the model does not explain the variation between groups
> particularly well, it does explain about 70% of that variation, so I
> wouldn’t consider it to be “unsuccessful”.
>
> -----
>
> Sorry for the long answer, but I hope it helps a bit at least!
>
> Please let me know if it doesn’t though, or if you have any more
> questions.
>
>
>
> All the best,
> Caitlin.
>
>
>
> On Thu, Oct 16, 2014 at 11:30 PM, Angela Merino <
> Angela.Merino at cawthron.org.nz> wrote:
>
> Thanks you very much! It was really helpful! J
>
>
>
> Then I understand that my models is not significantly the best model that
> could be found using my variables (in my case, microsatellites). If I use a
> model with n.pca=20 or =40 I got pretty the same success of membership
> prediction (and with the same big root mean squared error).
>
>
>
> 1)      My last questions (I hope!) to understand the output of the
> *cross.validation* function is what does it mean the Median and
> Confidence Interval for Random Chance (below in yellow)? I think it means
> that with a confidence of 95% the value of successful assignment would be a
> value between 43% and 60%, which therefore means again that the
> optimization of my model was “not successful”. (??)
>
> 2)      About the global interpretation of this results, I would say that
> membership of my predefined groups are not well predicted by any model as
> the mean successful assignment is not higher than 63% (Maximum when
> n.pcs=25) and in addition the mean squared errors is quite high (30-40%). I
> would be “sure” of predicting only around 20% to the right predefined
> group. In short, my set of microsatellites can’t explain well my predefined
> groups.
>
>
>
>
>
> [image: cid:image002.jpg at 01CFE7A4.CCC02130]*$`Median and Confidence
> Interval for Random Chance`*
>
> *     2.5%       50%     97.5% *
>
> *0.4294840 0.4928747 0.5962807*
>
> *$`Mean Successful Assignment by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
> 35        40 *
>
> *0.5871429 0.6000000 0.5819048 0.6014286 0.6952381 0.6747619 0.6333333
> 0.6109524 *
>
> *$`Number of PCs Achieving Highest Mean Success`*
>
> *[1] "25"*
>
> *$`Root Mean Squared Error by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
>    35        40 *
>
> *0.4301795 0.4141872 0.4389381 0.4131429 0.3241735 0.3531491 0.3885084
> 0.4145894 *
>
> *$`Number of PCs Achieving Lowest MSE`*
>
> *[1] "25"*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Thanks in advance! I am learning a lot about R and adegenet package and I
> find really interesting to assess weak genetic population structure.
>
>
>
> Kind regards,
>
>
>
> ‘Angela
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Caitlin Collins [mailto:caitiecollins at gmail.com]
> *Sent:* Friday, 17 October 2014 1:28 a.m.
> *To:* Angela Merino
> *Cc:* Collins, Caitlin; Jombart, Thibaut
> *Subject:* Re: Question about how to interpret Cross validation in my
> analysis. Thanks!
>
>
>
> Hi Angela,
>
>
> Well, I have two pieces of good news for you, and one piece of mediocre
> news.
>
> First, there’s nothing to worry about with respect to the “NULL” that you
> are seeing. It just gets printed when xval.plot=TRUE as an artefact of one
> of the lines of the printing function. It has no meaning, and certainly
> does not imply that your model is not valid. (Given the stress that I now
> realise this glaring “NULL” may cause, I’ve changed the way the plots print
> now, so in the next release of adegenet this won’t happen.)
>
> Second, you are absolutely correct in your interpretation of the results
> of xvalDapc (which are stored in whatever object you assigned the results
> to, in your case, “xval”).
>
>
>
> This brings me to the mediocre news: given that your interpretation is
> correct, it seems that the best model you can achieve with DAPC, where
> n.pca=25, is only able to predict the group membership of validation set
> individuals in 63% of the cases, with a 32% root mean squared error.
> Arguably, this is not great. Your final comment on the matter, though, is
> quite insightful. The fact that you can achieve the same modest level of
> success with 20-80 PCs indicates that the optimisation procedure has not
> been particularly successful. Ideally, one would like to see an arch, with
> a maximum success point somewhere in the middle. In your case, there is a
> bit of an arch, but it isn’t particularly striking.
>
>
>
> The only thing I might add to your interpretation of this result is that
> it’s not so much that the model is poor because a similar level of success
> can be achieved with variable numbers of PCs. If mean success was virtually
> constant, but varying around 90%, the interpretation would not be that the
> model is poor, but rather that most levels of PC retention can compose a
> model that effectively discriminates between groups.
>
> I hope this has helped answer some of your questions. If you have any
> more, please feel free to ask.
>
> Best,
> Caitlin.
>
>
>
>
>
> On Mon, Oct 13, 2014 at 11:48 PM, Angela Merino <
> Angela.Merino at cawthron.org.nz> wrote:
>
> Hi Caitlin Collins and Thibaut Jombart,
>
>
>
> My name is Angela Parody-Merino and I am a PhD student at Massey
> University (New Zealand). I am studying the population genetic structure in
> a migratory bird (the New Zealand Godwit) with 23 microsatellites. Anyway,
> maybe this is a very simple question but I really want to understand and be
> sure about the meaning and interpretation of the output when doing
> cross-validation. I have been some days looking in the internet and reading
> explanations etc…without being able to really understand what’s going on
> with my analysis. Could you help me please? J
>
>
>
> This is the script of the analysis:
>
> > x <- ELpop
>
> > mat <- as.matrix(na.replace(x, method="mean"))
>
>
>
> Replaced 371 missing values
>
> > grp <- pop(x)
>
> > xval <- xvalDapc(mat, grp, n.pca.max = 40, training.set = 0.9,
>
> + result = "groupMean", center = TRUE, scale = FALSE,
>
> + n.pca = NULL, n.rep = 500, xval.plot = TRUE)
>
> NULL *>>> What does it mean this NULL? Does it mean that the model is not
> valid?*
>
> *$`Median and Confidence Interval for Random Chance`*
>
> *     2.5%       50%     97.5% *
>
> *0.4294840 0.4928747 0.5962807 *
>
>
>
> *$`Mean Successful Assignment by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
> 35        40 *
>
> *0.5871429 0.6000000 0.5819048 0.6014286 0.6952381 0.6747619 0.6333333
> 0.6109524 *
>
>
>
> *$`Number of PCs Achieving Highest Mean Success`*
>
> *[1] "25"*
>
>
>
> *$`Root Mean Squared Error by Number of PCs of PCA`*
>
> *        5        10        15        20        25        30
>    35        40 *
>
> *0.4301795 0.4141872 0.4389381 0.4131429 0.3241735 0.3531491 0.3885084
> 0.4145894 *
>
>
>
> *$`Number of PCs Achieving Lowest MSE`*
>
> *[1] "25"*
>
>
>
> *From the screenshot and the output results of the cross validation (in
> blue), I would say that my model (retaining 25PCs) can predict  with a mean
> of 63% but it is not such a good model because most of the models that can
> be obtained by retaining 20, 40, 60, 80 PCs are quite the same successful.
> Is it my interpretation correct?*
>
>
>
>
>
>
>
> Thanks in advance,
>
>
>
> Kind regards,
>
>
>
> ‘Angela Parody-Merino
> ------------------------------
>
> *Attention: *
> This message is for the named person's use only.  It may contain
> confidential, proprietary or legally privileged information.  If you
> receive this message in error, please immediately delete it and all copies
> of it from your system, destroy any hard copies of it and notify the
> sender.  You must not, directly or indirectly, use, disclose, distribute,
> print, or copy any part of this message if you are not the intended
> recipient. Cawthron reserves the right to monitor all e-mail communications
> through its networks.  Any opinions expressed in this message are those of
> the individual sender, except where the message states otherwise and the
> sender is authorised to make that statement.
>
> This e-mail message has been scanned and cleared by *MailMarshal *
> ------------------------------
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150819/9ed9669e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 31124 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150819/9ed9669e/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 48953 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150819/9ed9669e/attachment-0003.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 308994 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150819/9ed9669e/attachment-0001.png>


More information about the adegenet-forum mailing list