[adegenet-forum] df2genind never stops

Thibaut Jombart thibautjombart at gmail.com
Wed Aug 3 18:41:36 CEST 2016

Hi Julien,

this may be pushing the limits of genind objects, as they really weren't
designed for more than a few hundreds / couple of thousands loci. As a
sanity check, I would still try converting a small subset to check all is
fine, e.g.:

my_genind=df2genind(tab[,1:1000], ploidy=2, sep="", NA.char = "N")

If you wrap this within a 'system.time', you'll get an approximate idea of
how long the conversion of 1,000 loci takes; the extrapolation will give
you a lower bound for the actual time to expect for the entire dataset (the
algorithm does not scale linearly).

As for the further steps, this will not be straightforward. genlight and
genind objectsd cannot be combined as they are structurally very different:
the first codes SNPs as binary variables (where 0 and 1 have no specific
meaning other than differentiating 2 alleles), while the second stores data
as allele counts. As for repool, it does handle differences in alleles but
loci have to be the same. If you are to combine the two datasets, the best
course of action would be:
- combine them before (mapping everything against a reference?)
- combine them for the analysis, e.g. adding distances (possibly after some
scaling), or using 2-table methods in the case of factorial analysis


Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology
Imperial College London
Twitter: @TeebzR <https://twitter.com/TeebzR>

On 2 August 2016 at 18:44, VARALDI JULIEN <Julien.Varaldi at univ-lyon1.fr>

> Dear adegenet users,
> I have two datasets that I would like to combine into a single one,
> ideally a genlight one. The first dataset is a vcf file from the 1000
> genomes. I can read it using the package vcfR and then convert it to a
> genlight object. This take a while (few minutes) but works fine:
> vcf=read.vcfR(vcf_file)
> my_genlight <- vcfR2genlight(x=vcf, n.cores = 8)
> The other dataset is a data frame containing genotypes obtained from
> genome-wide SNP array. It contains the genotypes for 31 individuals on
> 868146 loci. The initial file is only 90Mb. I tried to use df2genind but
> without success (I stopped it after 20 minutes or something like that… it
> is running without apparent error). Here is what I did:
> >tab=read.table(my_data, head=T, sep=",")
> >head(tab)
> >loci=tab$rs_number
> >tab=t(tab)
> >tab=tab[-1,]
> >colnames(tab)=loci
> > tab[1:5, 1:4]
>          rs10458597 rs9629043 rs11510103 rs12565286
> Sample_4 "CC"       "CC"      "AA"       "CC"
> Sample_5 "CC"       "NN"      "AA"       "CC"
> Sample_6 "CC"       "CC"      "AA"       "CC"
> Sample_7 "CC"       "CC"      "AA"       "CC"
> Sample_8 "CC"       "CC"      "AA"       "CC"
> > dim(tab)
> [1]     31 868146
> my_genind=df2genind(tab, ploidy=2, sep="", NA.char = "N")
> This last command lasts for ever.
> I would appreciate any suggestion. The next step is to combine the two
> datasets, with the difficulty that one will be a genlight, the other a
> genind, AND the 1000 thousand dataset contains much more loci than the snp
> dataset (does repool deal with this situation?). I would also appreciate
> any input on that.
> I am running R 3.3.1 on a mac os 10.11.4
> thanks a lot,
> cheers,
> Julien
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20160803/ec0b128b/attachment.html>

More information about the adegenet-forum mailing list