|
Post by Admin on Sept 11, 2020 20:41:58 GMT
Results Whole-Genome Sequencing Whole-genome sequencing from 15 African hunter-gatherers was performed by Complete Genomics (Drmanac et al., 2010). We obtained ∼60-fold coverage per genome, resulting in high-confidence calls of variants from 95% of each genome (Table 1 and Supplemental Information available online). These genomes were compared to publicly available high-coverage genomes sequenced and analyzed using the same technology and software in a diverse panel of 69 individuals (including 4 Luhya from Kenya, 4 Maasai from Kenya, 10 Yoruba from Nigeria, and 51 non-Africans; Table S1), allowing the genomes of African hunter-gatherers to be placed within a global context. Table 1 Summary Statistics for Whole-Genome Sequences Statistic Pygmy Hadza Sandawe Number of individuals 5 5 5 Median coverage per genome 63.4x ± 3.4× 67.3× ± 17.8× 61.6× ± 6.2× Called genome fraction 94.8% ± 0.7% 96.1% ± 0.8% 95.1% ± 1.0% Ti/Tv ratio 2.16 ± 0.01 2.14 ± 0.01 2.16 ± 0.01 Variant SNPs per genome 3.75 × 106 ± 0.04 × 106 3.54 × 106 ± 0.06 × 106 3.53 × 106 ± 0.05 × 106 Homozygous variant SNP fraction 34.04% ± 0.39% 37.7% ± 2.7% 33.2% ± 0.60% Exonic SNP fraction 0.621% ± 0.007% 0.612% ± 0.008% 0.617% ± 0.011% Insertions 1.26 × 105 ± 0.02 × 105 1.23 × 105 ± 0.03 × 105 1.19 × 105 ± 0.3 × 105 Deletions 1.49 × 105 ± 0.34 × 105 1.43 × 105 ± 0.03 × 105 1.39 × 105 ± 0.04 × 105 Coding indel fraction 0.116% ± 0.003% 0.115% ± 0.006% 0.118% ± 0.005% Mean values ± one standard deviation are listed for each population. See also Figure S1. After applying stringent quality control filters (Supplemental Information and Figure S1), we identified 13,407,517 variants (SNPs, insertions, and deletions that differ from the human genome reference sequence, GRCh37/hg19) in the 15 hunter-gatherer genomes. In addition, we sequenced two Hadza genomes as technical replicates and found ∼28,000 discordant calls per genome pair, corresponding to less than one error per 100 kb, consistent with previous estimates (Drmanac et al., 2010; Lam et al., 2012). We also assessed sequence accuracy by comparing calls between whole-genome sequencing and the Illumina1M-Duo BeadChip genotyping array. More than 99.96% of calls were identical between platforms, and a large proportion (>34%) of highly discordant SNPs involved known triallelic loci. Figure S1 Quality Control, Related to Table 1 The majority of variants are SNPs (90.9%), and we observe a greater number of deletions than insertions relative to the human reference genome (Table 1). We operationally define novel variants as those that are absent from build 131 of dbSNP. A considerable proportion of hunter-gatherer variants are absent from dbSNP 131 (41.1%, or 5,516,366 variants). Cross-referencing with the October 2011 release of the 1000 Genomes Project reveals that 3,062,541 variants remain absent, and thus our sequence data substantially expand the catalog of human genetic variation. Approximately half (50.8%) of the variants that we identify are shared among multiple hunter-gatherer populations, and genomic regions that contain a large number of variants in one population also contain a large number of variants in other populations (Figures 1B and S2). However, despite shared language and geographic proximity, we do not observe an excess of shared variants between Hadza and Sandawe Khoesan speakers, consistent with an ancient divergence and diverse mtDNA haplogroups (Table S2). Figure S2 Variants from Whole-Genome Sequencing, Related to Figure 1 (A) Variants per sequenced genome for different populations. Non-hunter-gatherer populations analyzed are Yoruba (YRI), Asian (ASN, i.e., CHB and JPT), and European (CEU). “Novel” refers to variants absent from dbSNP131. (B) Power-law parameters for the number of variants observed in each sequenced genome. (C, E, G, and I) Histograms showing the number of variants per Mb. Each histogram contains 150 bins. (D, F, H, and J) Genomic distribution of variants per Mb. (C) and (D) refer to the pooled set of all 15 hunter-gatherer genomes.
|
|
|
Post by Admin on Sept 12, 2020 0:01:48 GMT
Functional Classification of Variants Of the 13.4 million variants, 3.69 million are intronic, 37,797 are nonsynonymous, and 35,747 are synonymous variants. Moreover, 674,808 variants were located in DNase I footprints (Rosenbloom et al., 2012), and 149,072 variants occurred in cis-regulatory motifs within footprints. At the individual level, each hunter-gatherer genome contains ∼11,500 nonsynonymous variants, 12,400 synonymous variants, and 256,000 variants in DNase I footprints. Using PolyPhen-2 classifications (Adzhubei et al., 2010), we find that ∼60% of amino-acid-changing variants identified in the hunter-gatherer genomes are classified as benign, ∼25% as possibly damaging, and ∼15% as probably damaging. In addition, benign amino acid changes are >2.7 times more likely to be found in all three hunter-gatherer populations than possibly damaging or probably damaging changes (p < 10−10, Z test). Figure S3 Signatures of Purifying Selection in Geographically Diverse Populations with Different Subsistence Patterns, Related to Table 2 Comparing African and non-African genomes, we confirm that non-African populations contain a slight excess in the proportion of probably damaging nonsynonymous variants (Figure S3), consistent with population bottlenecks due to migration out of Africa and with prior studies (Lohmueller et al., 2008). By contrast, proportions of synonymous and nonsynonymous SNPs and the predicted number of probably damaging sites are similar for African hunter-gatherers and other African populations (Figure S3). Thus, at this broad genomic scale, natural selection appears to shape the genomes of hunter-gatherers similarly to the genomes of other African populations. However, analyses of larger sample sizes will be required to distinguish subtle differences that may exist. Population Genetics of Functionally Important Regions of the Human Genome The proportion of polymorphic sites, as estimated by θ, are lowest for exons in all three populations, suggesting that natural selection acts on coding sequences to reduce genetic variation (Table 2). θ is lower for introns than intergenic regions, a finding that is consistent with both background selection and positive selective sweeps. In all three hunter-gatherer populations, the mean value of Tajima's D is lower for genic regions. This observation reflects allele frequency distributions that are shifted toward rare alleles in genic regions, a pattern that can be explained by both selective sweeps and purifying selection. Consistent with findings from HapMap Phase II data (Barreiro et al., 2008), we find that mean values of FST, a measure of allele frequency differentiation between populations, are lower for exons and higher for introns and intergenic regions. We also find that 5′ and 3′ untranslated regions have intermediate derived allele frequencies (DAF), θ, Tajima's D, and FST, consistent with evolutionary constraint on regulatory regions. Furthermore, consistent with purifying selection against slightly deleterious mutations, the distribution of derived allele frequencies for exon SNPs are skewed toward low-frequency alleles relative to intergenic SNPs, and the magnitude of this shift was similar for African and non-African populations (Figure S3 and Supplemental Information). Table 2 Population Genetic Statistics for Different Types of Variants Statistic Intergenic 10 kb Upstream 5′ UTR Exon Intron 3′ UTR 10 kb Downstream Overall Within Population θper base pair (Pygmy) 0.001132 0.000783 0.000722 0.000503 0.000860 0.000800 0.000845 0.000966 θper base pair (Hadza) 0.000921 0.000630 0.000564 0.000390 0.000691 0.000639 0.000681 0.000782 θper base pair (Sandawe) 0.001056 0.000728 0.000663 0.000463 0.000797 0.000743 0.000789 0.000899 Mean DAF (Pygmy) 0.2785 0.2756 0.2701 0.2527 0.2728 0.2648 0.2748 0.2760 Mean DAF (Hadza) 0.3239 0.3204 0.3092 0.3015 0.3191 0.3198 0.3222 0.3219 Mean DAF (Sandawe) 0.2912 0.2883 0.2791 0.2700 0.2855 0.2814 0.2889 0.2888 Tajima's D (Pygmy) −0.4124 −0.4350 −0.4977 −0.5955 −0.4411 −0.5010 −0.4328 -0.4272 Tajima's D (Hadza) −0.0175 −0.0145 −0.0347 −0.0918 −0.0126 −0.0300 −0.0129 -0.0148 Tajima's D (Sandawe) −0.3285 −0.3587 −0.4549 −0.5030 −0.3627 −0.4172 −0.3567 -0.3453 Between Population FST (Pygmy, Hadza) 0.0727 0.0723 0.0713 0.0681 0.0727 0.0714 0.0722 0.0726 FST (Pygmy, Sandawe) 0.0485 0.0482 0.0465 0.0448 0.0483 0.0473 0.0478 0.0483 FST (Hadza, Sandawe) 0.0659 0.0654 0.0643 0.0627 0.0653 0.0647 0.0625 0.0657 Statistics are for fully called autosomal variants. Mean DAF refers to mean derived allele frequency after correcting for CpG hypermutability and biased gene conversion. See also Table S4 and Figures S3, S6, and S5. In addition, we investigated functional constraint for different site types by calculating the neutrality index (NI), which contrasts the levels of polymorphism and divergence of a putatively neutral class of sites to a class of sites that may be subject to selection (Rand and Kann, 1996). We used intergenic sites that were at least 50 kb from protein-coding genes as our neutral class and calculated the NI with respect to nonsynonymous, synonymous, intronic, and DNase I footprint sites. Weak purifying selection was found for all of these sites (as defined by a NI significantly greater than 1; Table S4). Strikingly, sites classified as DNase I footprints were more constrained (NI = 1.302, 95% CI = 1.298–1.306) than nonsynonymous sites (NI = 1.153, 95% CI = 1.140–1.167), consistent with the hypothesis that DNase I footprints are enriched for functionally important regulatory variants. Demographic History of African Hunter-Gatherers Principal component analysis (PCA) reveals both continental and population-specific patterns of genetic variation. PC1 distinguishes Africans from non-Africans (with East African populations being closer to non-Africans), and PC2 differentiates Asian and European populations (Figure 1D). The Hadza are differentiated by PC3, and subsequent principal components differentiate Pygmies (PC4) and Sandawe (PC5) from other African populations (Figure S4). Figure S4 PCA Plots, Related to Figure 1 To assess shared ancestry between diverse African hunter-gatherer populations, we examined the percentage of shared variants between Pygmy, Hadza, and Sandawe genomes and a previously sequenced San genome (Schuster et al., 2010). The percentage of San variants that are shared with one other hunter-gatherer population is similar for Pygmy-, Hadza-, and Sandawe-specific variants (5.6%–5.7%), suggesting that the San diverged before other hunter-gatherer populations. However, the D test of admixture (Green et al., 2010) indicates that the San genome shares more derived alleles with Pygmies than with the Hadza or Sandawe (p < 0.01; Table S3). This result suggests that the ancestors of the Tanzanian click-speakers (the Hadza and Sandawe) may have diverged more recently in the past than the Pygmy/San split. However, additional possibilities involve gene flow between the ancestors of Pygmies and the San or stochastic loss of shared derived alleles among the ancestors of the Hadza and Sandawe. A neighbor joining tree indicates that Pygmies diverged before the Hadza and Sandawe split (Figure 1F), and lack of monophyly among Pygmy genomes reveals population substructure involving Baka, Bakola, and Bedzan individuals. Hadza and Sandawe genomes are nested within a cluster that also includes the Maasai, possibly due to recent shared gene flow with neighboring East African populations. With the exception of Pygmies, clustering patterns reflect shared language families: Khoesan-speaking Hadza and Sandawe individuals cluster together, as do Niger-Kordofanian-speaking Yoruba and Luhya individuals. We also observe differences in the number and cumulative size of long runs of homozygosity in each population. Of the 15 hunter-gatherer genomes analyzed in this paper, the five genomes with the most runs of homozygosity all belong to the Hadza (Figure S5). Though some of these differences may be due to a population bottleneck in the Hadza (Henn et al., 2011), an additional cause may be cryptic inbreeding (Pemberton et al., 2012), as indicated by the large variance in cumulative size of runs of homozygosity within the Hadza (Figure S5). Indeed, cumulative runs of homozygosity in three Hadza genomes are more than double the size of other hunter-gatherers analyzed in this paper (Figure S5). Figure S5 Runs of Homozygosity, Related to Table 2 Consistent with an historic bottleneck and/or inbreeding in the Hadza, we find that the proportion of polymorphic sites, as quantified by θ, is lowest for the Hadza and highest for Pygmies (Table 2). Depending on mutation rates, this translates to effective population sizes of 11,300–25,700 (Pygmy), 9,200–20,900 (Hadza), and 10,600–24,000 individuals (Sandawe). Genome-wide estimates of Tajima's D are lower for Pygmies and Sandawe compared to the Hadza (mean values of Tajima's D are −0.4273 for Pygmies, −0.0148 for Hadza, and −0.3453 for Sandawe). These results are consistent with the observation that low-frequency-derived alleles (DAF ≤ 0.1) are overrepresented in Pygmy and Sandawe populations and underrepresented in the Hadza (Figure S6; p < 0.0001, χ2 tests of independence). Together, these results suggest that Pygmy and Sandawe populations have recently expanded in size, whereas the Hadza population has recently decreased in size.
|
|
|
Post by Admin on Sept 12, 2020 4:34:00 GMT
Figure S6 Derived Allele Frequency Distributions for Hunter-Gatherer Populations, Related to Table 2 Hunter-Gatherer Genomes Possess Signatures of Archaic Admixture Gene flow between anatomically modern humans and archaic species has been described for European, Melanesian, and African populations (Hammer et al., 2011; Plagnol and Wall, 2006; Wall et al., 2009; Reich et al., 2010). To detect putatively introgressed regions in the Pygmy, Hadza, and Sandawe hunter-gatherer populations, we modified the summary statistic S∗, which searches for clusters of population-specific SNPs in near complete LD, to be suitable for genome-scale analyses. S∗ has previously been used to detect archaic admixture in individuals of European and African descent (Hammer et al., 2011; Plagnol and Wall, 2006; Wall et al., 2009). We first verified that our implementation of S∗ could accurately identify introgressed regions by performing extensive coalescent simulations (Supplemental Information and Figure S7) and analyzing publicly available whole-genome sequences from nine CEPH and four Tuscan individuals sequenced by Complete Genomics. We calculated S∗ in 50 kb sliding windows and identified the top 350 regions (top ∼0.4%) in each population with unusually large values of S∗ as high-confidence candidates for introgression. TMRCA distributions for these regions are significantly larger than the distribution for all loci (p < 10−16), consistent with the hypothesis that they are enriched for introgression (Figure 2A). Moreover, non-African genomic regions with high values of S∗ were significantly enriched for Neanderthal-specific SNPs (p < 10−16, Figure 3B). Thus, S∗ can robustly detect genomic regions inherited from archaic ancestors. Figure S7S∗ Statistics Are Robust at Detecting Introgression, Related to Figures 2 and 3 Figure 3 Characteristics of S∗ in Real and Simulated Data We next used S∗ to identify putatively introgressed regions in the African hunter-gatherer samples. In all three African hunter-gatherer samples, we found evidence of introgression from at least one archaic population. Strikingly, the median TMRCA for putatively introgressed haplotypes in the hunter-gatherer samples is similar to the median TMRCA for introgressed haplotypes in Europeans (1.2–1.3 Mya versus 1.1–1.2 Mya, respectively; Figure 2A), suggesting that the archaic African population diverged from anatomically modern humans in the same time frame as Neanderthals (simulations suggest that relative time of split with archaic populations can be recovered via TMRCA; Figure 3C). Additionally, we performed a STRUCTURE analysis of the putatively introgressed regions and of 350 random regions. If candidate regions identified by unusually large values of S∗ are enriched for genuine introgressed sequence, then we would expect STRUCTURE to identify two populations, as introgressed regions primarily consist of individuals carrying one archaic and one anatomically modern haplotype. In contrast, we would expect STRUCTURE to identify three populations in the randomly selected regions corresponding to the Pygmy, Hadza, and Sandawe populations. Indeed, this is precisely what we find (Figures 2B and 2C), further demonstrating that top-ranked S∗ regions are enriched for putatively introgressed sequence. There is significant overlap (p < 10−16) among putatively introgressed regions in the three hunter-gatherer populations, consistent with either gene flow among the hunter-gatherer populations or introgression events that predate population splitting of these populations. In addition, the TMRCA of introgressed regions shared between all three populations is significantly older compared to introgressed regions observed in only one population (Wilcoxon rank-sum test, p = 2.2 × 10−5; Figure 2D), consistent with an introgression event predating the divergence of these populations. In contrast, we observed few introgressed regions that overlap with those observed outside of Africa. One exception is a 2 Mb window on chromosome 8 (Figure 2E; chr8:3–5 Mb) that contains introgressed regions in all global populations. However, we note that because the chr8:3–5 Mb region is enriched for CNVs, it may be more prone to false positives (Supplemental Information).
|
|
|
Post by Admin on Sept 12, 2020 22:17:41 GMT
Identification of Genomic Regions with Extreme Times to Most Recent Common Ancestry We scanned the genomes of African hunter-gatherers to identify regions with extremely long or short coalescence times, which are likely to be enriched for targets of natural selection. To this end, we calculated the time to most recent common ancestor (TMRCA) for 50 kb sliding windows in the 15 hunter-gatherer genomes. The mean autosomal TMRCA across all windows is 796 kya. As expected, windows spanning the HLA region, which exhibits strong signatures of balancing selection (Barreiro and Quintana-Murci, 2010), are the most ancient in the genome, with a maximum TMRCA of 5.1 million years for a 50 kb window encompassing HLA-G. The oldest genic regions outside of the HLA locus include NSUN4, HCG9, MYO3A, and APOBEC4. Conversely, we also found multiple genomic regions with short TMRCA times (<10 kya), including multiple tripartite motif-containing genes (TRIM53P, TRIM64, and TRIM64B), the SPAG11A gene that is involved in sperm maturation, and NCF1, which is a subunit of neutrophil NADPH oxidase. Evidence of Local Adaptation in African Hunter-Gatherer Populations To identify signatures of geographically restricted adaptation, we used two complimentary approaches. First, we identified genomic regions in each hunter-gatherer population that differ the most from agricultural (Yoruba) and pastoral (Maasai) populations. This involved calculating locus-specific branch lengths (LSBL; Shriver et al., 2004) for each polymorphic site in the genome and using sliding windows to identify 100 kb regions that are enriched for LSBL outliers (sites with the highest values of LSBL statistics). We then focused on the top 268 (1%) most divergent 100 kb windows in each population (Figures 4A–4C and Table S5). Many highly divergent 100 kb windows do not contain any genes (101/268 Pygmy, 105/268 Hadza, and 119/268 Sandawe windows), and these windows may contain regulatory sequences that are important targets of adaptation. Divergent 100 kb windows shared between pairs of populations include regions containing olfactory receptors (Pygmy and Hadza), major histocompatibility complex genes (Hadza and Sandawe), a gene that regulates lipid content in human breast milk (BTN1A1, Pygmy and Sandawe), and a gene involved in vascular injury repair (FLNB, Pygmy and Sandawe). We find only a single 100 kb window that is highly divergent in all three populations (located 41 kb downstream of PRDM5, a gene involved in bone development; Galli et al., 2012), suggesting that each African hunter-gatherer population has been subject to different local selective pressures. Figure 4 Divergent Genomic Regions between Hunter-Gatherers and Non-Hunter-Gatherers and Genomic Distributions of Ancestry Informative Markers Genes in the top Pygmy-divergent regions of potential interest based on function include TRHR (thyrotropin-releasing hormone receptor involved in regulation of thyroid function), IFIH1 (involved in viral immunity), HESX1 (anterior pituitary development), CYBRD1 (iron absorption), UGT2B10 (breakdown of toxic endobiotic and xenobiotic compounds), and RGS3 (a GTPase-activating protein gene that has been identified in other scans of Pygmy-specific selection; Jarvis et al., 2012; Pickrell et al., 2009). Additionally, multiple genes in Pygmy-divergent regions are involved in spermatogenesis and fertility (ODF3, FSHR, and SLC9A10). Functionally interesting genes in Hadza-divergent regions include IL18R1/IL18RAP (interleukin 18 receptor and accessory protein), CYCS (cytochrome c), CNR2 (cannabinoid receptor), and VWF (a blood glycoprotein involved in hemostasis and wound healing). Functionally interesting genes in Sandawe-divergent regions include ALDH2 (aldehyde dehydrogenase), EGLN1 (cellular response to hypoxia), HLA-DOB (MHC class II protein), and ZPBP (zona-pellucida-binding protein). To determine shared functional characteristics of genes within 200 kb of genomic regions enriched for high-LSBL SNPs, we performed pathway analysis using DAVID, which includes KEGG and PANTHER databases (Huang et al., 2009), for each hunter-gatherer population. These analyses revealed that immune-related pathways were overrepresented near Hadza-divergent and Sandawe-divergent regions (p < 0.05 after Bonferroni corrections), but not for Pygmy-divergent regions. However, these signals in the Hadza and Sandawe were almost entirely driven by the HLA region at 6p21 (Table S6). Additionally, both Pygmy- and Hadza-divergent regions are significantly enriched for genes involved in olfactory transduction. Pathway analysis also pointed toward overrepresented retinol and porphyrin/chlorophyl metabolism pathways in Pygmies (p < 0.02 after Bonferroni corrections) and a taste transduction pathway in the Sandawe (p < 3.3 × 10−5 after Bonferroni corrections). The observation that highly divergent regions in three different hunter-gatherer populations are enriched for genes involved in smell or taste suggests the potential evolutionary importance of these loci with respect to local dietary adaptations. Second, we identified high-frequency population-specific variants in hunter-gatherer genomes, referred to as ancestry informative markers (AIMs, defined here as sites with variant allele frequencies >50% in a single hunter-gatherer sample and absent from the other two hunter-gatherer samples and dbSNP131). We identified <15,000 variants per population that meet these stringent criteria (Figures 4D–4F). These AIMs are not randomly distributed throughout hunter-gatherer genomes (p < 10−5 for each population); instead, we observe multiple clusters of AIMs in each population. AIM clusters may result from population-specific adaptation as well as demographic factors (Bürger and Akerman, 2011; Falush et al., 2003). The number of AIMs is greatest for the Hadza (Figures 4D–4F), a pattern that is consistent with population bottlenecks and greater genetic isolation relative to the other hunter-gatherer populations. Pathway analyses of genes within 50 kb of AIMs suggest that there is enrichment for starch and sucrose metabolism in Pygmy genomes (p = 0.002, p = 0.247 after Bonferroni corrections) and enrichment for melanogenesis in Sandawe genomes (p = 0.0022, p = 0.168 after Bonferroni corrections, Table S6). Identification of Candidate Genes for Short Pygmy Stature Analyses of LSBL and AIM clusters in the Pygmies are particularly informative for identifying variants that play a role in population-specific traits, including short stature. The largest Pygmy AIM cluster spans 170 kb at 3p14.3 and contains a high (70%) frequency haplotype in our sample of 10 genomes, with 44 Pygmy-AIMs in 100% linkage disequilibrium (chr3:57,211,368–57,383,846; Figure 5A). Four genes lie within this cluster: HESX1 (which encodes a homeobox-containing transcriptional repressor that plays a critical role in development of the anterior pituitary, the site of growth hormone synthesis and secretion), APPL1 (which is involved in crosstalk between adiponectin and insulin-signaling pathways), ASB14 (which encodes a SOCS box protein), and the sperm motility gene DNAH12. Mutations within HESX1 in humans cause septo-optic dysplasia, combined pituitary hormone disease, and/or isolated growth hormone deficiency, resulting in short stature (Dattani, 2005). Although HESX1 is expressed early in embryonic development in mouse and plays a critical role in forebrain and pituitary development, it continues to be expressed in adult pituitary in humans, where it may play a role in the maintenance of anterior pituitary cell types and function (Mantovani et al., 2006). Analysis of the October 2011 release of the 1000 Genomes database (1000 Genomes Project Consortium, 2010) indicates that the 44 SNPs encompassing the Pygmy 3p14.3 AIM haplotype are in complete LD and at very low frequency (<5%) in Yoruba and Luhya populations and are absent from non-African populations. Additionally, the Pygmy 3p14.3 AIM haplotype includes a previously identified nonsynonymous SNP within HESX1 (rs9878928; Asn125Ser), which has been shown to be associated with pituitary developmental defects and growth hormone deficiency (Brickman et al., 2001; Gat-Yablonski et al., 2009). This nonsynonymous SNP is at moderate frequency (Pickrell et al., 2009) but on distinct haplotype backgrounds in other African populations. Interestingly, this Pygmy AIM cluster lies within a 15 Mb region that shows high levels of differentiation between Pygmies and neighboring Bantu populations and is associated with height in Pygmies (Jarvis et al., 2012). Figure 5 Pygmy AIMs, Allele Frequencies, and Height Associations An additional cluster of 11 Pygmy AIMs was identified at 3p11.2 (chr3:87,657,988–87,681,226; Figure 5A). The closest gene, located ∼330 kb upstream, is POU1F1 (also known as PIT1), which encodes a pituitary-specific transcription factor that plays an important role in anterior pituitary development and regulation of growth hormone gene expression (Hunsaker et al., 2012), and mutations within this gene cause growth hormone deficiency and short stature (Kiess et al., 2011). These SNPs, referred to as the Pygmy 3p11.2 AIM cluster, are at >60% frequency in sequenced Pygmies and at <12% frequency in African and non-African populations from the 1000 Genomes Project. Both the 3p14.3 (HESX1) and 3p11.2 (POU1F1) AIM clusters are embedded within 4.2 and 3.3 Mb blocks, respectively, where every 100 kb window contains an excess of LSBL outliers (the largest continuous runs of increased LSBL in Pygmy genomes; Figure 5A). To examine the geographic distribution and frequency of Pygmy AIMs, we genotyped a panel of 57 Pygmies (Baka, Bakola, and Bedzan ancestry) and 38 neighboring Bantu individuals (Tikar and Ngumba ancestry), as well as 5 Mbuti Pygmies and 5 Biaka Pygmies, at 15 AIM SNPs on chromosome 3 (Figure 5 and Table S7). We find that the HESX1-containing AIM cluster at 3p14.3 is common (>5% frequency) in Baka, Bakola, and Bedzan Pygmies from Cameroon and Mbuti Pygmies from the Democratic Republic of the Congo and is absent from Cameroonian Bantu populations and Biaka Pygmies from the Central African Republic. Among Baka, Bakola, Bedzan Pygmies, and a single Mbuti Pygmy, the 3p14.3 AIM haplotype extends 172 kb (complete LD), but in three other Mbuti Pygmies, this haplotype is broken down (with 100% LD extending 97 kb from chr3:57273811 to chr3:57370649). The presence of a highly divergent AIM haplotype in multiple Pygmy populations suggests that this haplotype predates the divergence of Eastern and Western Pygmies. Additionally, we tested for genetic associations between height and 15 SNPs, located in the 3p21.31, 3p14.3, and 3p.11.AIM clusters, in a panel of Cameroonian Pygmy and Bantu individuals (Figure 5A). Treating sex as a covariate, we find significant associations between height and two SNPs located at 48.7 Mb and three SNPs located at 87.6 Mb, ∼330 kb from POU1F1 (p < 0.03; Figure 5B and Table S7). Two of these SNPs remain significant after conservative Bonferroni corrections (chr3:48738472 and chr3:87681226). In addition, we find a significant association between the Pygmy HESX1 AIM haplotype at 3p14.3 and shorter height in males (p < 0.02; Figure 5B). When significant associations are found, the alleles associated with shorter height are all Pygmy AIM variants.
|
|
|
Post by Admin on Sept 13, 2020 21:01:21 GMT
Discussion The deluge of data from next-generation sequencing has begun, with massively large data sets of low-coverage whole-genome sequences (1000 Genomes Project Consortium, 2010) and high-coverage exome sequences (Tennessen et al., 2012) being reported in thousands of individuals. Here, we described high-coverage whole-genome sequencing of individuals from three African hunter-gatherer populations, who harbor a large amount of previously unknown genetic diversity that is inaccessible by studying individuals of non-African ancestry or by focusing only on protein-coding regions. Despite evidence of inbreeding and a population bottleneck in the Hadza, high levels of genetic diversity are maintained in all three hunter-gatherer populations. Additionally, we found significant genetic divergence among the three African hunter-gatherer populations, including between the Hadza and Sandawe, who are geographically close (∼150 km apart) and have languages that contain click consonants, demonstrating the continued need to broadly sample human populations in order to comprehensively assess the spectrum of human genomic diversity.
We find evidence of selective constraint near genes, and these patterns are replicated in each hunter-gatherer population. We also observe signatures of local adaptation in Pygmy, Hadza, and Sandawe populations, including high locus-specific branch lengths for genes involved in taste/olfactory perception, pituitary development, reproduction, and immune function. These genetic differences reflect differences in local diets, pathogen pressures, and environments. Thus, Pygmies, Hadza, and Sandawe have continued to adapt to local conditions while sustaining their own unique cultures of hunting and gathering.
Evidence of Archaic Introgression A striking finding in our data set is that compelling evidence exists that extant hunter-gatherer genomes contain introgressed archaic sequence, consistent with previous studies (Hammer et al., 2011; Plagnol and Wall, 2006; Reich et al., 2010; Shimada et al., 2007; Wall et al., 2009). We note that unambiguous evidence of introgression is difficult to obtain in the absence of an archaic reference sequence, which currently does not exist and may never be feasible given the rapid decay of fossils in Africa. Although we carefully filtered our data set in an attempt to analyze only high-quality sequences (Supplementary Information), it is possible that unrecognized structural variants or other alignment errors could generate a spurious signature similar to introgression. Encouragingly, we did not see an enrichment of structural variation calls in our candidate introgression regions.
Additionally, through extensive simulations and analysis of European whole-genome sequences (Supplementary Information), we have demonstrated that the signatures of introgression that we observed are unlikely to be entirely accounted for due to other aspects of population demographic history, natural selection, or sequencing errors. Moreover, we did not find strong evidence that introgressed regions were clustered in the genome more often than expected by chance (p > 0.05; Supplemental Information). Nor did we find significant evidence that introgressed regions were enriched in genic regions (p > 0.05); rather, genic regions were significantly depleted for introgression in several populations (Supplemental Information). Therefore, the simplest interpretation of these data is that introgressed regions in extant human populations represent neutrally evolving vestiges of archaic sequences. In short, we find that low levels of introgression from an unknown archaic population or populations occurred in the three African hunter-gatherer samples examined, consistent with findings of archaic admixture in non-Africans (Reich et al., 2010).
Short Stature, Pituitary Function, and Local Adaptation in Western African Pygmies Short stature in African Pygmies is thought to be an adaptation to a tropical forest environment. Several possible fitness advantages of short height have been proposed, including thermoregulation, early cessation of growth as a trade-off for early reproduction to compensate for shorter life expectancy, easier mobility in a dense forest environment, and reduced caloric requirements (Migliano et al., 2007; Perry and Dominy, 2009). Although stature in Europeans is a highly complex trait (Lango Allen et al., 2010), the genetic architecture of this trait in Pygmies may differ (Pygmy LSBL hits are not enriched for height genes found in largely European GWAS, p = 0.888 for the top 268 LSBL windows, confirming Jarvis et al. [2012]. AIMs within and near HESX1 and POU1F1 are strong candidates for the short stature phenotype in Pygmies, together with previously identified (chr3:45–60 Mb region; Jarvis et al., 2012) and other as yet undiscovered loci. The observation of long-range LD maintained in diverse populations at these loci raises the possibility that undetected inversions in these chromosome 3 regions play a role in population differentiation and adaptation. Additionally, the observation that third-chromosome AIM clusters exist at a very low frequency in other African populations suggests that, if selection has altered the frequency of AIM haplotypes in Pygmies, then it may have acted on standing variation, which existed prior to the divergence of Eastern and Western Pygmies from other African populations. Furthermore, AIM variants are not included in commercially available genome-wide SNP arrays, emphasizing the critical importance of whole-genome sequencing for identifying variants of potential functional significance that may be geographically or ethnically restricted due to distinct selection pressures and/or demographic histories.
In addition to the 3p14.3 (HESX1) and 3p11.2 (POU1F1) AIM clusters, we have identified other candidate loci that may play a role in local adaptation, height, and pituitary function in Pygmies. These loci include TRHR (thyrotropin-releasing hormone receptor), APPL1, FSHR, and genes associated with Williams Syndrome (Supplemental Information). Overall, we find that highly divergent regions of Pygmy genomes (as identified by LSBL scans) are enriched for genes that play a role in pituitary function (p = 0.0082, χ2 test of independence).
Together, these results point toward the possibility that development and expression of hormones produced by the anterior pituitary may play a central role in the Pygmy phenotype, potentially influencing a number of traits, including growth, reproduction, metabolism, and immunity. Further studies of pituitary function and development in vitro and using transgenic animal models will be necessary to elucidate the importance of this system in Pygmy development and physiology and to clarify the role of variants within the 3p14.3 and 3p11.2 Pygmy AIM clusters.
Conclusions In summary, this is one of the first population genomics analyses to use high-coverage whole-genome sequencing. Our results indicate the importance of whole-genome data for reconstructing human origins, identifying targets of local adaptation, reconstructing demographic history, and identifying functionally important variants for complex traits like height. We have identified many novel targets of natural selection that play a role in immunity, reproduction, metabolism, and height in diverse hunter-gatherer populations. As sequencing costs continue to decrease, it will become feasible to do whole-genome sequencing of increasingly larger sample sizes across ethnically diverse global populations and to integrate genomic data with functional studies using in vitro and in vivo models. Such studies will shed light on human evolutionary history and the origin of traits that make each of us unique.
|
|