[adegenet-forum] adegenet-forum Digest, Vol 70, Issue 16

Thu Jun 19 14:27:39 CEST 2014

Hi Caitlin.

Good point!

In fact, I' didn´t notice this tiny  nuance in the rationale behind
cross-validation on using a stratified sampling of 10% of individuals
(validation set sample) in the well-exemplified nancycats datset, through
the ciclic process of PC retention, sampling and DAPC procedures in each
set number of PCAs retained, BUT not the same set of individuals in each
round.

>From the second one based on supplementary individuals used on predicting
results. Also the way they were selected was different. They result from a
split of the original sample into a stratified "testing sample" of X
individuals, BUT using a non-random sample as defined by set.seed()
function.

Later, I'll present you a new set of questions raised by clines for being
thoroughly evaluated on modelling by DAPC.

Cheers,
M.

2014-06-19 11:00 GMT+01:00 <
adegenet-forum-request at lists.r-forge.r-project.org>:

> Send adegenet-forum mailing list submissions to
>         adegenet-forum at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> or, via email, send a message with subject or body 'help' to
>         adegenet-forum-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         adegenet-forum-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of adegenet-forum digest..."
>
>
> Today's Topics:
>
>    1. Re: set.seeds in DAPC (Caitlin Collins)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 Jun 2014 01:32:04 +0100
> From: Caitlin Collins <caitiecollins at gmail.com>
> To: Manuela <manuelacorreia2 at gmail.com>
> Cc: "adegenet-forum at lists.r-forge.r-project.org"
>         <adegenet-forum at lists.r-forge.r-project.org>
> Subject: Re: [adegenet-forum] set.seeds in DAPC
> Message-ID:
>         <
> CAMon0MDGDDZmFji6_T2McFtsqTzNmr7ENTE0Fj1rXiFYP_P_9g at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Manuela,
>
> Glad to hear I could help a bit!
>
> I should stress that our use of set.seed() in the tutorial has been mainly
> for the purpose of making the tutorial, as a document, consistent and
> identically reproducible. In an experimental context, however, eg. in the
> case of selecting supplementary individuals, if you are truly attempting to
> test a concept (for example, in validating a model), you would actually
> *want* random behaviour (ie. an effectively random sample). This is
> particularly the case if you are performing repeated sampling, as one often
> does with supplementary individuals. So be careful to only set the seed
> when you do NOT want a random sample; otherwise, just leave out set.seed()
> from the process and let the computer pick a sample at random.
>
> Best,
> Caitlin.
>
>
> On Thu, Jun 19, 2014 at 12:17 AM, Manuela <manuelacorreia2 at gmail.com>
> wrote:
>
> > Dear Caitlin,
> >
> >
> > Thank you  for such a clear response and at same time for being so
> > knowledgeable. It was quiet interesting to have a glimpse on the way how
> > the Adegenet team decided to use the set.seeds to obtain consistent
> > results, as well as (that was just brilliant!) to control the lab.
> jitter.
> >
> > As you point up with the 3 examples its better to try several set.seeds
> in
> > order to find out the best labels position with our dataset. And when we
> > reach the final stage of cross-validation we ought to choose one seed to
> > ensure that the training set of supplementary individuals (no matter the
> > number (10%, 20%)) will always made up of  the same set of individuals.
> >
> > Thank you. I've learnt so much with this long response.
> >
> > Cheers,
> > M.
> >
> >
> > 2014-06-18 19:48 GMT+01:00 Caitlin Collins <caitiecollins at gmail.com>:
> >
> > Hi,
> >>
> >> Glad to see you've been reading the tutorial in such detail!
> >>
> >> These are great questions, and the way you have asked them actually
> hints
> >> at the answer: set.seed() is not inherently linked to multivariate
> >> techniques or datasets, but rather with random number generation (more
> >> specifically, with getting *reproducible* results from "random"
> >> processes). This is probably why you have seen set.seed come up in the
> >> context of bootstrap Monte Carlo procedures!
> >>
> >> Essentially, when R is asked to generate a "random" number, it actually
> >> generates a pseudo-random number by taking some input and generating an
> >> output that seems random. Without being given an input, R does this by
> >> using your computer's clock and using the current time as its starting
> >> point, from which it generates a seemingly random number. You would not
> get
> >> the same random number at a different time, so we find this adequate to
> >> call the process "random" number generation, BUT if in fact you tried to
> >> generate two "random" numbers at the exact same time (down to the
> >> millisecond), you would actually get the exact same "random" number.
> (Note:
> >> I have glossed over a lot of really interesting things about this
> process,
> >> so if you want to know more about random number generation, please read
> on
> >> here:
> >>
> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf
> >> ).
> >>
> >> This potential problem with random number generation can occasionally be
> >> quite useful in cases where we want to run something that requires
> random
> >> number generation but where we would also like to get the same result
> each
> >> time.
> >> set.seed() is the way we control this. With set.seed(), the "seed" is
> >> used as the input to our random number generation (instead of the
> clock),
> >> which allows you to get *reproducible *"random" numbers.
> >>
> >> Try this example:
> >>
> >> rnorm(3)
> >> rnorm(3)
> >>
> >> set.seed(1)
> >> rnorm(3)
> >>
> >> set.seed(1) # note: for set.seed() to work, you need to use it before
> >> every instance of random number generation.
> >> rnorm(3)
> >>
> >> Neat! Having established this, we can now answer your questions about
> why
> >> we use set.seed() where we do in the DAPC tutorial.
> >>
> >> On page 20, we use it before creating a loading plot. This is just
> >> because we use the argument lab.jitter to move the labels around a bit.
> >> Jitter works by adding random noise, so we can control it with
> set.seed().
> >> We have chosen to use set.seed(4) simply because it "randomly" put the
> >> labels in a nice enough place. Arguably, set.seed(6) would have done a
> >> better job (next time!), but it's a good thing we didn't use
> set.seed(2).
> >>
> >> If you would like, you can see for yourself:
> >>
> >> data(H3N2)
> >> pop(H3N2) <- factor(H3N2$other$epid)
> >> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10)
> >>
> >> set.seed(4)
> >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
> >> lab.jitter=1)
> >>
> >> set.seed(6)
> >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
> >> lab.jitter=1)
> >>
> >> set.seed(2)
> >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
> >> lab.jitter=1)
> >>
> >> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20
> >> individuals (you were right about that) to serve as our "supplementary
> >> individuals" for that exercise. Here, the use of set.seed(2) just
> ensures
> >> that no matter how many times we edit and re-build that tutorial, we
> will
> >> always get the same set of 20 individuals, which is useful for
> >> consistency's sake.
> >>
> >> All in all, I apologise for the long response that was possibly less
> >> related to DAPC than you might have expected, but I hope that helped
> answer
> >> your question!
> >>
> >> Best,
> >> Caitlin.
> >>
> >>
> >>
> >>
> >> On Wed, Jun 18, 2014 at 6:51 PM, Manuela <manuelacorreia2 at gmail.com>
> >> wrote:
> >>
> >>> Hi there,
> >>>
> >>>
> >>> I'd like to understand  the role of set.seeds and the criteria chosen
> >>> in the DAPC examples according to the two examples presented in the
> >>> lattested version of DAPC tutorial.
> >>>
> >>> I used to see set. seeds(N?) in the context of significance as well as
> >>> bootstrap Monte Carlo procedures, but not within multivariate
> techniques or
> >>> even with datasets.
> >>>
> >>> At page 20 from DAPC tutorial there is a set. seed(4) before getting
> the
> >>> loadingplot. Also, another example at page 39, before split the dataset
> >>> microbov in two parts. And by the way, what is  20 in the
> sample(e,20....)?
> >>> 20 individuals picked at random from all microbov populations?
> >>>
> >>>
> >>> So, I do have two questions.
> >>> One is  "why to use them?" here in these particular examples?
> >>> The second one "what criteria were behind the choice of the number 4 in
> >>> the former case, and the number 2 in the latter?
> >>>
> >>> How do I know which seed will be the best one for my datased in case I
> >>> need to have the loadingplot?
> >>>
> >>> Thanks in advance,
> >>> M.
> >>>
> >>> _______________________________________________
> >>> adegenet-forum mailing list
> >>> adegenet-forum at lists.r-forge.r-project.org
> >>>
> >>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> >>>
> >>
> >>
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140619/db7b9f27/attachment-0001.html
> >
>
> ------------------------------
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> End of adegenet-forum Digest, Vol 70, Issue 16
> **********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140619/8ac44885/attachment.html>