Post by Admin on Dec 8, 2019 18:52:38 GMT
Samples and sequencing
We analyzed five DE* samples described previously (Weale et al. 2003), in the context of published worldwide Y-chromosomal sequences including Japanese D and many E Y chromosomes (Mallick et al. 2016). We also included four haplogroup D samples from Tibet (Xue et al. 2006), which were newly sequenced for this study; the Japanese and Tibetan D chromosomes represent the deepest known split within D, since Andamanese D chromosomes lie on the same branch as the Japanese (Mondal et al. 2017).
Sequencing of the Nigerian samples was carried out at the Wellcome Sanger institute on the Illumina HiSeq X Ten platform (paired-end read length 150 bp) to a Y-chromosome mean coverage of ∼16×. Sequences were processed using biobambam version 2.0.79 to remove adapters, mark duplicates, and sort reads. bwa-mem version 0.7.16a was used to map the reads to the hs37d5 reference genome. We found that two pairs of individuals were likely duplicates (Figure S2) and thus one of each pair was removed, leaving three Nigerian individuals for further analysis. The four individuals from Tibet were sequenced in the same way to a Y-chromosome mean coverage of ∼18×.
For comparative data from other haplogroups, we obtained Y-chromosome bam files for 173 males representing worldwide populations from the Simons Genome Diversity Project (Mallick et al. 2016).
Data analysis
Y-chromosome genotypes were called jointly from all 180 samples using FreeBayes v1.2.0 (Garrison and Marth 2012) with the arguments “–report-monomorphic” and “–ploidy 1.” Calling was restricted to 10.3 Mb of the Y chromosome previously determined to be accessible to short-read sequencing (Poznik et al. 2013). Then sites with depth across all samples <1900 or >11,500 (corresponding to DP/2 or DP*3), or missing in >20% of the samples, were filtered. In individuals, alleles with DP <5 or GQ <30 were excluded, and if multiple alleles were observed at a position, the fraction of reads supporting the called allele was required to be >0.8.
Genome-wide genotypes from the Nigerian samples were called using BCFtools version 1.6 (bcftools mpileup -C50 -q30 -Q30 | bcftools call -c), then merged with data from ∼2500 people genotyped on the Affymetrix Human Origins array (Patterson et al. 2012; Lazaridis et al. 2016). Principal Component Analysis (PCA) using genome-wide SNPs was performed using EIGENSOFT v7.2.1 (Patterson et al. 2006) and plotted using R (R Core Team 2017).
We inferred a maximum likelihood phylogeny of Y chromosomes using RAxML v8.2.10 (Stamatakis 2014) with the arguments “-m ASC_GTRGAMMA” and “–asc-corr=stamatakis,” using only variable sites with QUAL ≥1, and selecting the tree with the best likelihood from 100 runs, then replicating the tree 1000 times for bootstrap values. The tree was plotted using Interactive Tree Of Life (iTOL) v3 (Letunic and Bork 2016) and annotated with haplogroup names assigned using yHaplo (Poznik 2016) from SNPs reported by the International Society of Genetic Genealogy (ISOGG v11.01).
The ages of the internal nodes in the tree were estimated using the ρ statistic (Forster et al. 1996), the standard approach for the Y chromosome. We defined the ancestral state of a site by assigning alleles as ancestral when they were monomorphic in the nine samples belonging to the A and B haplogroups in our data set. We then determined the age of a node as follows: Having an ancestral node leading to two clades, we select one sample from each clade and divide the number of derived variants found in the first sample but absent from the second, by the total number of sites having the ancestral state in both samples. We compare all possible pairs under a node and report the average value of divergence times in units of years by applying a point mutation rate of 0.76 × 10−9 mutations per site per year (Fu et al. 2014). We report 95% confidence intervals of the divergence times based on the 95% highest posterior density when estimating the mutation rate (0.67–0.86 × 10−9) (Fu et al. 2014). This model assumes that mutations accumulated on the chromosomes in the different lineages at similar rates, and thus expects all individuals in our data set to have comparable branch lengths from the AB root. But we found considerable differences among individuals in the number of their derived mutations from the root. This heterogeneity in the accumulation of mutations has been previously reported (Scozzari et al. 2014; Barbieri et al. 2016) and appears to be haplogroup-specific (Figure S3), and therefore in our divergence time estimates, we calibrate all lineages to have identical branch length from the root, equal to the average branch length estimated from all individuals in our data set. We first calculated the average number of mutations which accumulated on the branches of all individuals in our data set and found 768.59 derived mutations on average from the root (corresponding to ∼100,000 years). We then derived a calibration coefficient α for each individual by dividing 768.59 by the normalized (in 10,000,000 bp) number of derived mutations an individual has accumulated from the root. And thus for calibrating the branches’ length between any two samples when calculating the split times, we multiply α by the number of derived variants found in the first sample but absent from the second.
We analyzed five DE* samples described previously (Weale et al. 2003), in the context of published worldwide Y-chromosomal sequences including Japanese D and many E Y chromosomes (Mallick et al. 2016). We also included four haplogroup D samples from Tibet (Xue et al. 2006), which were newly sequenced for this study; the Japanese and Tibetan D chromosomes represent the deepest known split within D, since Andamanese D chromosomes lie on the same branch as the Japanese (Mondal et al. 2017).
Sequencing of the Nigerian samples was carried out at the Wellcome Sanger institute on the Illumina HiSeq X Ten platform (paired-end read length 150 bp) to a Y-chromosome mean coverage of ∼16×. Sequences were processed using biobambam version 2.0.79 to remove adapters, mark duplicates, and sort reads. bwa-mem version 0.7.16a was used to map the reads to the hs37d5 reference genome. We found that two pairs of individuals were likely duplicates (Figure S2) and thus one of each pair was removed, leaving three Nigerian individuals for further analysis. The four individuals from Tibet were sequenced in the same way to a Y-chromosome mean coverage of ∼18×.
For comparative data from other haplogroups, we obtained Y-chromosome bam files for 173 males representing worldwide populations from the Simons Genome Diversity Project (Mallick et al. 2016).
Data analysis
Y-chromosome genotypes were called jointly from all 180 samples using FreeBayes v1.2.0 (Garrison and Marth 2012) with the arguments “–report-monomorphic” and “–ploidy 1.” Calling was restricted to 10.3 Mb of the Y chromosome previously determined to be accessible to short-read sequencing (Poznik et al. 2013). Then sites with depth across all samples <1900 or >11,500 (corresponding to DP/2 or DP*3), or missing in >20% of the samples, were filtered. In individuals, alleles with DP <5 or GQ <30 were excluded, and if multiple alleles were observed at a position, the fraction of reads supporting the called allele was required to be >0.8.
Genome-wide genotypes from the Nigerian samples were called using BCFtools version 1.6 (bcftools mpileup -C50 -q30 -Q30 | bcftools call -c), then merged with data from ∼2500 people genotyped on the Affymetrix Human Origins array (Patterson et al. 2012; Lazaridis et al. 2016). Principal Component Analysis (PCA) using genome-wide SNPs was performed using EIGENSOFT v7.2.1 (Patterson et al. 2006) and plotted using R (R Core Team 2017).
We inferred a maximum likelihood phylogeny of Y chromosomes using RAxML v8.2.10 (Stamatakis 2014) with the arguments “-m ASC_GTRGAMMA” and “–asc-corr=stamatakis,” using only variable sites with QUAL ≥1, and selecting the tree with the best likelihood from 100 runs, then replicating the tree 1000 times for bootstrap values. The tree was plotted using Interactive Tree Of Life (iTOL) v3 (Letunic and Bork 2016) and annotated with haplogroup names assigned using yHaplo (Poznik 2016) from SNPs reported by the International Society of Genetic Genealogy (ISOGG v11.01).
The ages of the internal nodes in the tree were estimated using the ρ statistic (Forster et al. 1996), the standard approach for the Y chromosome. We defined the ancestral state of a site by assigning alleles as ancestral when they were monomorphic in the nine samples belonging to the A and B haplogroups in our data set. We then determined the age of a node as follows: Having an ancestral node leading to two clades, we select one sample from each clade and divide the number of derived variants found in the first sample but absent from the second, by the total number of sites having the ancestral state in both samples. We compare all possible pairs under a node and report the average value of divergence times in units of years by applying a point mutation rate of 0.76 × 10−9 mutations per site per year (Fu et al. 2014). We report 95% confidence intervals of the divergence times based on the 95% highest posterior density when estimating the mutation rate (0.67–0.86 × 10−9) (Fu et al. 2014). This model assumes that mutations accumulated on the chromosomes in the different lineages at similar rates, and thus expects all individuals in our data set to have comparable branch lengths from the AB root. But we found considerable differences among individuals in the number of their derived mutations from the root. This heterogeneity in the accumulation of mutations has been previously reported (Scozzari et al. 2014; Barbieri et al. 2016) and appears to be haplogroup-specific (Figure S3), and therefore in our divergence time estimates, we calibrate all lineages to have identical branch length from the root, equal to the average branch length estimated from all individuals in our data set. We first calculated the average number of mutations which accumulated on the branches of all individuals in our data set and found 768.59 derived mutations on average from the root (corresponding to ∼100,000 years). We then derived a calibration coefficient α for each individual by dividing 768.59 by the normalized (in 10,000,000 bp) number of derived mutations an individual has accumulated from the root. And thus for calibrating the branches’ length between any two samples when calculating the split times, we multiply α by the number of derived variants found in the first sample but absent from the second.