|
Post by Admin on Jul 17, 2021 1:58:36 GMT
THE DISCOVERY — In the new study, the researchers construct an evolutionary family tree for Homo sapiens, based on the sequenced genomes of modern humans, Neanderthals, and Denisovans. They make two discoveries from their close analysis of these data: First, as little as 1.5 percent and as much as 7 percent of the modern human genome is unique to our species. In other words, at least 93 percent of the modern human genome is shared between our species and the two other ancient hominins. Fernando Villanea, a population geneticist at the University of Colorado, Boulder, who was not involved in the study, tells Inverse that the findings are impressive in what they tell us about how the three species may have mixed. “Their results support the most exciting new views on the interactions between humans, Neanderthals, and Denisovans,” Villanea says. Second, in the last 600,000 years, our genetic adaptations largely were to do with brain development and function only — perhaps when we think of human uniqueness, this is what we should consider as primal, Nathan Schaefer, a bioinformatician at the University of California, San Francisco, and the paper’s lead author, tells Inverse. “Maybe we’ve gotten a first glimpse at what to look at next for what might make humans special,” Schaefer says. Schaefer and his team used a genetic analysis technique called an ancestral recombination graph to make their discovery. Using publicly available data of modern human, Neanderthal, and Denisovan genomes, they then sequenced these genomes themselves and then used the ancestral recombination graph tool to sketch out a genetic tree encompassing the three species. “If you sequence a bunch of people, you could make a tree that would show how everyone's related, on average, across the whole genome,” Schaefer explains. Applied to the three species, it works in a similar way — mapping how they are related across the genome, finding shared regions and distinct ones. “They can look at all human populations at once under the same scope,” Villanea says. “WE ARE VERY SIMILAR TO NEANDERTHALS.” WHY IT MATTERS — By recreating these evolutionary trees, researchers today can pinpoint critical moments when H. sapiens adapted and diverged from our ancestors and other hominin lineages. But to understand how we relate to Neanderthals and Denisovans, it is important to look not only at genes — but also at what happens when those genes are translated into proteins. “When you ask that question, we are very similar to Neanderthals,” Schaefer says. “We have, you know, around 20,000 genes, and somewhere around 40 of them have these actual coding differences that all humans have one version, and the Neanderthals have the other version,” he says. “That’s already like, ‘Wow, we’re really, really close to them.’” As for the subsequent finding about cognitive adaptations, Schaefer says the genomes hint at the true difference between humans and other ancient peoples has to do with our brain development. The pressures that gave rise to these differences are likely incredibly ancient themselves. “These adaptation events which define our species possibly happened around [600,000] and [200,000] years ago in Africa,” Villanea says. This tree offers a peek at the “adaptation events” in the last 600,000 that hold the key to our cognitive development — and more metaphysically perhaps, human uniqueness. The tree hints that what could make us genetically unique from Neanderthals and Denisovans could be all in our heads — that is to say, our brains. “What I’m interested in doing now is trying to learn more about how genes work and what these genes do,” Schaefer says. For example, which mutations in the human genome are even functional, in that they have a measurable effect on our behavior. The study offers a little more light too on Denisovans as a species. As the study details, much more information is known about how modern human DNA reflects our species’ past interactions with Neanderthals purely because more Neanderthal remains have been discovered and more of their genomes are available to science. Currently, our knowledge of the Denisovan genome comes from a single pinkie bone found in a cave in Siberia, and Villanea says much of the work ahead will need to focus on understanding this elusive population of ancient hominin — and if there are more ancient peoples out there still to discover.
|
|
|
Post by Admin on Jul 17, 2021 19:59:08 GMT
An ancestral recombination graph of human, Neanderthal, and Denisovan genomes Nathan K. Schaefer1,2,3,†, Beth Shapiro1,2,3 and Richard E. Green3,4,*
Science Advances 16 Jul 2021: Vol. 7, no. 29, eabc0776 DOI: 10.1126/sciadv.abc0776
Abstract Many humans carry genes from Neanderthals, a legacy of past admixture. Existing methods detect this archaic hominin ancestry within human genomes using patterns of linkage disequilibrium or direct comparison to Neanderthal genomes. Each of these methods is limited in sensitivity and scalability. We describe a new ancestral recombination graph inference algorithm that scales to large genome-wide datasets and demonstrate its accuracy on real and simulated data. We then generate a genome-wide ancestral recombination graph including human and archaic hominin genomes. From this, we generate a map within human genomes of archaic ancestry and of genomic regions not shared with archaic hominins either by admixture or incomplete lineage sorting. We find that only 1.5 to 7% of the modern human genome is uniquely human. We also find evidence of multiple bursts of adaptive changes specific to modern humans within the past 600,000 years involving genes related to brain development and function.
INTRODUCTION Much of the current genetic variation within humans predates the split, estimated at 520 to 630 thousand years (ka) ago (1), between the populations that would become modern humans and Neanderthals. The shared genetic variation present in our common ancestral population is still largely present among humans today and was present in Neanderthals up until the time of their extinction. This phenomenon, which is known as incomplete lineage sorting (ILS), means that any particular human will share many alleles with a Neanderthal that are not shared with some other humans. Therefore, humans often share genetic variation with Neanderthals not by admixture but rather by shared inheritance from a population ancestral to us both. Because of this, any effort to map ancestry from archaic hominins in human genomes must disentangle admixture from ILS. Furthermore, a technique able to identify both admixture and ILS could produce a catalog of uniquely human genomic regions that is free of both and thereby shed light on the evolutionary processes that have been important in our origin as a unique species.
Ancestral recombination graph (ARG) inference (2) is a powerful starting point for such an analysis. An ARG can be conceptualized as a series of trees, mapped to individual sites, over phased haplotypes (chromosomes) in a panel of genomes. Ancestral recombination events, or sites at which chromosome segments with different histories were joined together by historical recombination, form boundaries between trees. Each ancestral recombination event manifests as a clade of haplotypes, all of which descend from the first ancestral haplotype to have it, moving from one position in the tree upstream of the event to a new position in the downstream tree (3). ARGs are complete descriptions of phylogenomic datasets and present for recombining genomes what single trees present for nonrecombining genomes, i.e., a complete description of their genetic relationships. As prior techniques for ancestry mapping can be thought of as summaries of the ARG, higher resolution ancestry maps could be produced if the ARG were known. In addition, the ARG can be used to estimate the time to most recent common ancestor (TMRCA) between admixed and admixer haplotypes, providing additional information about historical admixture between humans and archaic hominins.
Given the utility of an ARG, it is expected that several methods have been devised for estimating ARGs from genetic data. These published approaches all have different strengths and limitations. BEAGLE (3), ArgWeaver (4), and Rent+ (5) were designed for small datasets and require substantial time and/or memory to be used with large sequencing panels. Margarita (6) randomly samples histories at ancestral recombination event boundaries and does not seek to produce parsimonious recombination histories (6). ArgWeaver (4), which is widely considered the gold standard in ARG inference, requires prior knowledge of demographic model parameters. Relate (7) is a relatively new method that scales well to large datasets and produces trees without polytomies and with branch lengths but, in doing so, necessarily samples some relationships that are not directly inferred from the data, as do several other methods (4, 5). The most computationally efficient approach, tsinfer (8), also scales to large datasets but assumes that frequency of an allele is correlated with its age. Since this assumption is violated at loci undergoing either admixture or selection, tsinfer is not well suited for ARG inference using genetic data from Neanderthals, Denisovans, and modern humans.
Here, we present a heuristic, parsimony-guided ARG inference algorithm called SARGE (Speedy Ancestral Recombination Graph Estimator) and use it to build a genome-wide ARG of both modern human and archaic hominin genomes. SARGE can run on thousands of phased genomes, makes no prior assumptions other than parsimony, heuristically estimates branch lengths, and avoids making inferences about unobserved relationships by leaving polytomies in output trees. We validate SARGE using simulated data and demonstrate that it has high specificity compared to existing methods in reconstructing the topology of trees, making it suitable for identifying archaic admixture segments. To achieve this high specificity, SARGE avoids describing some relationships in output trees, resulting in lower sensitivity than existing methods.
We run SARGE on a panel of 279 modern human genomes, two high-coverage Neanderthal genomes, and one high-coverage Denisovan genome. Using the resulting ARG, we map Neanderthal and Denisovan ancestry, ILS, and the absence of both across modern human genomes. We find evidence of at least one wave of Neanderthal admixture into the ancestors of all non-Africans. We also identify several long and deeply divergent Neanderthal haplotype blocks that are specific to some human populations. We find support for the hypothesis that Denisovan-like ancestry is the result of multiple introgression events from different source populations (9, 10). We also detect an excess of Neanderthal and Denisovan haplotype blocks unique to South Asian genomes. Last, we pinpoint human-specific changes likely to have been affected by selection since the split with archaic hominins, many of which are involved in brain development.
|
|
|
Post by Admin on Jul 17, 2021 20:37:20 GMT
RESULTS ARG algorithm To build an ARG containing both modern human and archaic hominin genomes without the use of a demographic model or the need to infer ancestral haplotypes, we developed a parsimony-based ARG inference technique, SARGE. SARGE uses both shared derived alleles and inferred, shared ancestral recombination events to articulate trees (Fig. 1A and Supplementary Methods). SARGE uses the four-gamete test (11) to determine regions of recombination and the affected haplotypes. The crux of SARGE is a fast algorithm for choosing the branch movement(s) capable of explaining the highest number of discordant clades across a genomic segment that fails the four-gamete test. Once the branch movements, i.e., inferred ancestral recombinations, are determined, further definition of clades is possible. Thus, the trees are articulated by both shared alleles and shared ancestral recombination events (figs. S1 and S2 and Supplementary Methods). SARGE infers branch lengths via a heuristic method, compensating for mutation rate variation across the genome by comparing the number of mutations on each branch to the divergence to an outgroup genome in a fixed width region around each site (fig. S7 and Supplementary Methods). Fig. 1 Data structure and performance of SARGE on simulated data. (A) Schematic of data structure. Top: Rectangles are “tree nodes” representing clades in trees. Each clade has member haplotypes (shown with letters A to G) and a start and end coordinate (blue numbers in brackets) determined by coordinates of single-nucleotide polymorphism (SNP) sites tagging the clade (yellow numbers in braces), along with a propagation distance parameter (100 in this example). Parent/child edges (vertical arrows) also have start and end coordinates determined by the nodes. Ovals are candidates for clades sharing an ancestral recombination event that can explain four-gamete test failures; colored edges indicate potential paths between tree nodes through candidate nodes that could explain four-gamete test failures (colors indicate types of paths). The candidate node with the most edges (here, AB) is eventually chosen as the most parsimonious branch movement, allowing for the inference of new nodes. The two trees at the bottom show the “solved” ancestral recombination event with the branch movement marked in red and all clades inferred without SNP data marked with yellow stars (haplotypes A and B share an ancestral recombination event; their ancestry is shared with haplotypes C, D, and G upstream of the recombination event and haplotype E downstream of it). The coordinates of the recombination event (blue numbers in brackets) are taken to be midway between the highest-coordinate upstream site (left side) and the lowest-coordinate downstream site (right side) involved in recombination. For a more detailed overview of the data structure, see figs. S3 to S5. (B) Accuracy of SARGE on simulated data (defined as percent of all clades correct according to the true ARG in the simulation), with increasing numbers of human-like haplotypes from an unstructured population. Error bars are one SD across five replicates. (C) Number of nodes per tree with increasing number of haplotypes in simulated data. In the interest of parsimony, our method attempts to infer a set of ancestral recombination events that each explains as many four-gamete test failures as possible. Because the four-gamete test is known to underestimate the true number of ancestral recombination events (12, 13), SARGE will systematically underestimate the true number of ancestral recombination events in a dataset by design. Because of this, SARGE is not well suited to certain tasks, such as the creation of fine-grained recombination maps. We have attempted to mitigate cases where a clade in the ARG should be broken by an unobserved ancestral recombination event, however, by introducing a propagation distance parameter that limits the genomic distance over which each observed clade is allowed to persist (Fig. 1A and Supplementary Methods). SARGE is scalable to large datasets and achieves higher specificity than many other methods at the cost of lower sensitivity, by leaving uncertainty (polytomies) in the output data. Using simulated data, we find that SARGE runs quickly (figs. S8D and S10), requires little memory, and has 78.93% specificity [95% confidence interval (CI), 78.09 to 78.95%] on average across a range of simulated datasets that include between 50 and 500 haplotypes (see Supplementary Methods). SARGE is at least as specific as alternative techniques (fig. S9, A and C). Conversely, SARGE’s sensitivity (25.36%; 95% CI, 25.32 to 25.40%) is lower than that of other methods (fig. S9, B and D), as SARGE leaves an increasingly large number of polytomies in output trees as the number of input haplotypes increases (Fig. 1, B and C). As expected, SARGE (as well as similar techniques) performs best when the mutation to recombination rate ratio is high, as this makes clades easier to detect (figs. S11 and S12) and suffers slightly in accuracy with increasing amounts of population structure (fig. S13). We also find that the sensitivity of SARGE can be increased by increasing the propagation distance parameter (see Supplementary Text and fig. S15), that missing clades are likely to be small clades that are likely to be close to the leaves of trees (see Supplementary Text and fig. S16), and that incorrectly inferred clades tend to be within a few kilobases of sites at which those clades exist in truth (see Supplementary Text and fig. S17). We also find, using simulated data, that the SARGE’s branch lengths do not appear to be systematically biased upward or downward (see Supplementary Text and fig. S18). We ran SARGE on 279 phased human genomes from the Simons Genome Diversity Project (SGDP) (14), together with two high-coverage Neanderthal genomes (1, 15) and one high-coverage Denisovan genome (16). In our analyses, we relied on modern human population labels defined by the SGDP for many analyses, but we split sub-Saharan Africans into one population containing only the most deeply diverged lineages (Biaka, Mbuti, and Khomani-San), which we call “Africa-MBK,” and the remaining genomes (“Africa”). Using these data, we find that the completeness of trees in the ARG (the extent to which all possible clades are present rather than in polytomies) is positively correlated with the local mutation rate to recombination rate ratio (fig. S20A; Spearman’s rho = 0.40; P < 2.2 × 10−16) and that the number of inferred ancestral recombination events per genomic window agrees with a previously published population recombination map (17) (fig. S20B; Spearman’s rho = 0.46; P < 2.2 × 10−16), as expected. Estimates of the mean TMRCA of pairs of haplotypes, taken across all trees, were also concordant with prior knowledge (Fig. 2A). Fig. 2 Performance of SARGE on SGDP and archaic hominin dataset. (A) Pairwise coalescence times for randomly sampled sets of up to 10 pairs of phased genome haplotypes per population (every possible pair was considered for archaic hominins, since fewer genomes were available). Values are calibrated using a 13-Ma human-chimp divergence time (see Supplementary Methods) and averaged across every variable site in the dataset, error bars show one SD, and branch shortening values for archaic samples were incorporated into calculations using mean values reported in (1). The lower value for humans comes from removing archaic-admixed clades from trees. (B) UPGMA (unweighted pair group method with arithmetic mean) trees computed using nucleotide diversity from SNP data (top and left) against similarity matrix from shared recombination events inferred by SARGE. Light yellow boxes (similar groups) are Native Americans and Papuans. (C) Average similarity between Orcadian haplotypes in the SGDP panel and other European haplotypes calculated on the basis of the number of shared ancestral recombination events. The best matches are in England, Iceland, and Norway, as expected. Using these data, we found SARGE’s inferences of ancestral recombination events to be accurate. Because SARGE articulates tree clades using either shared allelic variation or shared inferred ancestral recombination, it is possible to test the concordance of trees made from each source. On average, 13.2% of clades in the ARG are known only from inference of shared ancestral recombination events and not by the presence of a shared, derived allele. We created a similarity score between every pair of phased human genome haplotypes in our dataset based on how often the haplotypes share ancestral recombination events. This score recapitulates relationships among humans known from single-nucleotide polymorphism (SNP) data alone (Fig. 2, B and C; Pearson’s r2 with scores from SNP data = 0.989; P < 2.2 × 10−16). We note that genomes with the poorest correlation between SNP-based and recombination-based similarity scores to other genomes are those most likely to contain phasing errors (table S1).
|
|
|
Post by Admin on Jul 17, 2021 21:06:13 GMT
Archaic hominin admixture We used our ARG to find regions of each phased human genome that derived from admixture with archaic hominins (see Supplementary Methods and fig. S24). If humans and the archaic hominins in our panel were in populations that had sorted their lineages, then this exercise would be simple with a complete and correct ARG. However, since human genome regions are often within a clade that includes hominin haplotypes due to ILS, finding admixed segments requires analysis beyond simply finding clades that unite some human and archaic hominin haplotypes. We started by selecting clades from ARG trees that united some modern humans with archaic hominins to the exclusion of some other modern humans. We then assigned each human genome haplotype in each such clade as putative Neanderthal, Denisovan, or ambiguous ancestry, depending on whether the clade contained Neanderthal, Denisovan, or both types of haplotypes. We then performed several filtering steps to remove these clades likely to result from ILS. First, we removed any clades that included more than 10% of the Africa-MBK haplotypes from the most basal human lineages, which are unlikely to be admixed. We then discarded clades that persisted for a short distance along the chromosome (which likely represent older haplotypes broken down over time by recombination) or in which the TMRCA between modern humans and archaic hominins was high (see Supplementary Methods and fig. S24). This ascertainment strategy was designed to identify haplotype blocks that we could confidently identify as archaic-introgressed and therefore likely underestimated the true extent of admixture across the genome. Because our method relies on both the haplotype block length and the TMRCA between admixed and introgressor haplotypes to identify admixed segments, we were able to identify some haplotypes that resemble archaic admixture in modern humans but that have relatively high sequence divergence to published archaic genomes (manifesting as high TMRCAs between archaic and modern genomes within these segments). Using the resulting maps, we calculated genome-wide percent admixture estimates across populations and compared them to fˆestimates based on the population-wide D statistic (18, 19) using basal Africa-MBK lineages as an outgroup. Since our alleles were polarized relative to the chimpanzee genome and only sites with derived alleles present in hominins were considered, our calculations were of the form D(Africa-MBK, Test, Introgressor, Chimpanzee) / D(Africa-MBK, Introgressor1, Introgressor2, Chimpanzee), where Introgressor1 and Introgressor2 were randomly chosen subsets of half of the introgressor (Neanderthal or Denisovan) haplotypes and the derived allele frequency in chimpanzee was set to 0 at all sites in our dataset. ARG-based estimates are similar to, but lower than, D statistic–based estimates in all non-African genomes, which we expected because of our aggressive filtering strategy for eliminating ILS (see Supplementary Methods). We detected slightly more admixture in sub-Saharan Africans (excluding Africa-MBK) than using the D statistic (Fig. 3A), even when considering the lower end of 95% confidence weighted block jackknife intervals (table S2). We note that a recent study that used an outgroup-free method to detect Neanderthal ancestry blocks in human genomes also found a higher average amount of Neanderthal ancestry in African genomes than has been previously reported (20). As another quality check, we compared our maps of Neanderthal ancestry to those published in prior studies (20–23). We found that maps produced by SARGE are about as concordant with the published maps as the published maps are with each other (fig. S25). Fig. 3 Neanderthal ancestry inferred in modern humans. (A) Genome-wide percent Neanderthal, Denisovan, and ambiguous (either Neanderthal or Denisovan) across SGDP populations, using the ARG and an estimator based on the D statistic. D statistic calculations considered only one archaic population at a time as introgressor and thus do not detect ambiguous ancestry and also might count some Denisovan ancestry as Neanderthal and vice versa. (B) For individual phased human genome haplotypes (points), mean TMRCA with Neanderthal in Neanderthal haplotype blocks (y axis) and mean Neanderthal haplotype block length (x axis) for all Neanderthal-introgressed haplotypes. TMRCA calculations assume a total of 6.5-Ma human-chimpanzee divergence and branch shortening values from (1), with a mutation rate of 1 × 10−9 per site per year. bp, base pair. (C) Overall number Neanderthal haplotype blocks versus geographically restricted (unique to a 3000-km radius) Neanderthal haplotype blocks. (D) Same as (B), but limited to geographically restricted (unique to a 3000-km radius) Neanderthal haplotype blocks. Only haplotypes with more than 10 geographically restricted segments are shown. We next looked for population-specific differences in archaic hominin ancestry in modern humans. Lengths of archaic haplotype segments and the TMRCA to admixer across those segments are both affected by the time of admixture and the divergence between the true admixers and available archaic hominin genomes. We therefore computed the mean of these two values for each ancestry type per phased genome haplotype and compared them across individuals from different populations to look for evidence of distinctive, population-specific admixture events. This analysis revealed distinctive population-specific patterns for Neanderthal and Denisovan ancestry, and many pairwise comparisons of these values between populations are significant (table S3). Segments of ambiguous ancestry produce a pattern resembling a mixture of Neanderthal and Denisovan ancestry, as expected (Fig. 3, B to D, and figs. S26, S28, S37, and S38). We caution, however, that our approach can artificially shorten haplotype block lengths (see Supplementary Methods and figs. S21 and S22), especially for populations such as Papuans and Australians that were absent from the 1000 Genomes Project panel (24) that was used for phasing (14). Nonetheless, Neanderthal haplotype block lengths in Oceania are not significantly shorter than in other populations (Fig. 3B), and incorrect phasing in archaic genomes does not appear to negatively affect results of admixture scans using simulated data (see Supplementary Text). As expected, the ARG classifies a smaller fraction (0.096 to 0.46%) of sub-Saharan African genomes (excluding Mozabite and Saharawi individuals) as resulting from Neanderthal admixture compared to non-African genomes (0.73 to 1.3%). The haplotype segments of African genomes that are grouped together in clades with Neanderthal haplotypes are distinctive from the haplotype segments found in the genomes of people with non-African ancestry (Fig. 3B and fig. S26A). Namely, the African haplotypes are more dissimilar to the Neanderthal haplotypes with which they are grouped and tend to be shorter. These observations are qualitatively consistent with the model wherein genetic drift may group Neanderthal and African haplotypes, independent of a specific admixture event. It is also possible that these haplotypes are the result of true introgression events from unknown archaic hominins distantly related to the Neanderthal/Denisovan lineage (25). Another recent study using an inferred ARG also found mysterious, divergent haplotypes within sub-Saharan Africans that resembled unknown archaic introgression (7). Unexpectedly, however, two of the SGDP African populations, Masai and Somali, are intermediate between non-African and African genomes when measuring lengths of archaic haplotype segments and TMRCA to admixers within them (Fig. 3B). These Neanderthal haplotype blocks may have originated in ancient European migrants to eastern Africa (26) and spread beyond eastern Africa through gene flow, which is known to have affected even the basal Africa-MBK lineages (27). To test this hypothesis, we recomputed the mean length and TMRCA of admixer genomes within archaic-introgressed haplotype segments across all individuals, using only geographically restricted segments. We defined these as any archaic haplotype segments found only in genomes that were sampled within a 3000-km radius of each other (using geodesic distance between sampling coordinates). This analysis showed Masai and Somali genomes to have a comparable number of geographically restricted Neanderthal haplotypes to most other African genomes (Fig. 3C), with similar haplotype block lengths and TMRCAs to admixers within these geographically restricted haplotypes (Fig. 3D). This observation is concordant with the idea that the unusual Neanderthal haplotypes in these populations originated in Eurasian migrants. Outside of Africa, our Neanderthal introgression maps largely agree with prior studies. We detect a mean TMRCA to Neanderthal of about 74 ka ago across all Neanderthal haplotype blocks in non-African populations, using published corrections for branch shortening in the archaic genomes (1). This value is consistent with published estimates of Neanderthal admixture times and the phylogenetic distance between the Vindija33.19 Neanderthal and the introgressing Neanderthal calculated by a recent study (28). The mean TMRCA between genomic segments detected as Neanderthal admixture segments and the Neanderthal itself is consistent within about 5000 years for all populations outside of Africa (Fig. 3B). We see slightly more Neanderthal ancestry in Central Asia, East Asia, and the Americas than in Europe, South Asia, and Southwest Asia (Fig. 3A). We also find more geographically restricted Neanderthal haplotype blocks in South Asia than elsewhere in mainland Eurasia, and the fewest geographically restricted Neanderthal haplotype blocks in the Americas (Fig. 3C and fig. S35). Humans in Central and East Asia are known to have elevated Neanderthal ancestry compared to other populations (22). However, there is debate over whether this elevated Neanderthal ancestry is due to smaller past population size relative to other groups and the resulting stronger effect of genetic drift (22) or to additional pulses of Neanderthal admixture specific to these populations (9, 29). Although we detect more Neanderthal ancestry in Central and East Asians than in West Eurasians, we detect a similar number of geographically restricted haplotype blocks (unique to a 3000-km radius) in both groups (Fig. 3C). Further, Neanderthal haplotype blocks are shorter on average and therefore older in Central and East Asians than in West Eurasians (Fig. 3B). This implies that the excess Neanderthal ancestry in Central and East Asians mostly comprises broadly shared haplotype blocks from introgression common to all non-Africans, consistent with the drift scenario. Another recent study (20) suggested that excess Neanderthal ancestry in Central and East Asians could be mostly due to underestimating Neanderthal ancestry in West Eurasians as a result of using sub-Saharan Africans, who share some Neanderthal ancestry with West Eurasians, as a model unadmixed outgroup. We reject this explanation because removing Neanderthal haplotypes from West Eurasians would likely increase the number of geographically restricted Neanderthal haplotypes in Asians, contrary to our observation. Furthermore, South Asians appear to have a comparable amount of Neanderthal ancestry to West Eurasians (Fig. 3, B and C), despite sharing few Neanderthal haplotype blocks with Africans relative to West Eurasians (fig. S37A). Last, because our strategy for ascertaining Neanderthal haplotype used only Mbuti, Biaka, and Khomani-San genomes as an outgroup and allowed up to 10% of these genomes to have Neanderthal ancestry, we do not believe our results were significantly biased by our choice of outgroup. This is further demonstrated by the fact that we detect a non-negligible amount of Neanderthal ancestry in all African groups (table S2). Aside from these broadly shared haplotype blocks, we also observe geographically restricted Neanderthal haplotype blocks in each non-African population in our panel. These population-specific haplotype blocks tend to be longer than the shared haplotype blocks and to have an older TMRCA to the Neanderthal genome than the broadly shared haplotype blocks (Fig. 3D, compared to all blocks shown in Fig. 3B). These observations suggest that the population-specific haplotype blocks may be the result of more recent population-specific Neanderthal admixture, as has recently been suggested (26–28, 30). We next investigated population-specific patterns within Denisovan ancestry segments and found that these segments probably originate from admixture with multiple, divergent individuals that were distantly related to the Denisovan genome. This implies that the Denisovan genome is not a good model for the actual population that admixed with humans with “Denisovan” ancestry. Prior studies have suggested that Denisovan-like haplotype blocks in humans have two or three distinct sources with different levels of divergence to the Denisovan genome, with the best-matching haplotype blocks in East Asia (9, 10). We uncover the same signal: Geographically restricted Denisovan haplotype blocks have the lowest TMRCA to the Denisovan genome in East Asian genomes (mean TMRCA to Denisovan of 125 ka ago) (figs. S26 and S27). Unexpectedly, we detected many Neanderthal and Denisovan-like haplotype blocks that are unique to South Asia (Fig. 3C and figs. S26C, S45, S35, and S36) and many Neanderthal haplotype blocks that are unique to Oceania (Fig. 3C and figs. S44 and S35). These geographically restricted Neanderthal haplotype blocks are no more divergent to the Neanderthal genome than those specific to other populations (Fig. 3D), complicating any interpretation of these regions.
|
|
|
Post by Admin on Jul 17, 2021 21:38:41 GMT
Genomic regions free of admixture and ILS Our ARG strategy allows us to bin the human genome into regions containing archaic admixture in at least some humans, regions of ILS, and regions free of both archaic admixture and ILS in all humans (hereafter archaic “deserts”). We find that approximately 7% of the human autosomal genome is human-unique and free of both admixture and ILS. Roughly 50% of the human genome contains regions where one or more humans has archaic ancestry obtained through admixture. If deserts are further restricted to regions that contain a high-frequency, human-specific derived allele, i.e., a substitution that can be assigned to the human lineage (hereafter “human-specific regions”), then these comprise only 1.5% of the assayed genome (Fig. 4A). Despite comprising very little of the genome, however, human-specific regions are significantly enriched for genes, exons, and regulatory element binding sites, while deserts are enriched for both genes and regulatory element binding sites (table S4). In line with previous studies (21, 31), we find admixed regions to be depleted of genes. Regions of ILS are enriched for overlap with genes but significantly depleted of exons (table S4). Fig. 4 Extent of deserts of archaic hominin ancestry. (A) Left: In the SGDP dataset, fractions of the genome where any individual has archaic ancestry (admixture), where any individual is incompletely sorted with archaic hominin lineages but where there is no archaic admixture (ILS), where there is no evidence of either admixture or ILS with archaic hominins (deserts), and where there is a fixed derived allele private to and shared by all humans (human-specific regions). Right: The same values from a simulated dataset with single pulses of Neanderthal and Denisovan admixture, both with an admixture proportion (2%) that produced reasonable amounts of archaic ancestry per individual genome (fig. S47C). (B) For random subsamples of the SGDP dataset, along with the full dataset, fractions of the genome comprising deserts, deserts filtered for candidate archaic alleles using another dataset, and human-specific regions are shown. The points on the far right (full dataset) correspond to the desert and ILS bars in the left panel of (A). To obtain an expectation of the extent of these different types of genomic regions, we ran a series of coalescent simulations with different amounts of archaic hominin admixture, as well as with no admixture (see Supplementary Methods and Supplementary Text). Our observation in the real data, which only 7% of the autosomal genome is free of both archaic admixture and ILS, is consistent with the results of a simulation that resulted in both higher archaic ancestry estimates per individual and more variability in individual archaic ancestry estimates than we observed in real data (Fig. 4A, Supplementary Text, and fig. S47). Two nonmutually exclusive explanations for this difference are (i) the existence of more, geographically limited, archaic hominin admixture events than the two we modeled (see Supplementary Text) and (ii) purifying selection acting against deleterious alleles found within archaic admixed segments (32). Under the model including selection, epistasis may be important, as the strength of selection against archaic alleles would vary depending on which archaic alleles are present in each human population. The power to detect deserts, i.e., regions in which no human carries a haplotype shared with an archaic hominin by ILS or admixture, can be expected to be affected by the number of human genomes available for analysis. To be certain that we have found the true extent of archaic deserts, we inferred ARGs over random subsamples of the human panel, computing the extent of deserts and human-specific regions for each (Fig. 4B). We were able to recover the full extent of deserts using a subsample of 100 haplotypes, less than half the size of the full panel, suggesting that the panel is sufficiently large. Given the limited sensitivity of our method compared to alternatives, we also compared the ability of SARGE, Relate, and tsinfer to recover true desert regions in demographic simulations with different numbers of haplotypes (see Supplementary Methods). We found that all methods, including SARGE, tended to overestimate the extent of desert regions, with this tendency increased in the larger dataset (figs. S48 and S49), further highlighting the unexpectedly small fraction of the human genome found to be in these deserts in the actual human genome data. Timing of human-specific mutations Given a clade of interest, mutations shared by all members of the clade must have arisen between that clade’s TMRCA and its parent clade’s TMRCA. Using this logic, and using the local divergence to the chimpanzee genome in the surrounding region of each tree to calibrate the local mutation rate, assuming human-chimpanzee divergence 13 Ma ago (see Supplementary Text) (18), we estimated ages of all human-specific mutations within deserts. For a schematic of how branch lengths were estimated, see fig. S7. Because the order of mutations along any given branch is unknowable, we took the midpoint of each branch, in years, to be the approximate age of each mutation. Combining these dates with a catalog of high-frequency, human-specific mutations and other annotation data (see Supplementary Methods) allowed us to construct a picture of human-specific evolutionary changes through time. We note that our date estimates rely on the assumption of a constant local mutation rate around each locus across all lineages since human/chimpanzee divergence. Some studies have found evidence for mutation rate heterogeneity in different hominoid lineages, with mutations accumulating more slowly in humans than in other apes, e.g., (33). In this case, some of our TMRCA estimates could be biased downward, as more time would have been required to accumulate the mutations tagging recently formed clades than those that occurred branches closer to the root. We first examined whether there were one or multiple bursts of human-specific adaptive changes since divergence with Neanderthals and Denisovans. We compiled the ages of all fixed or nearly fixed human-specific derived mutations within archaic hominin deserts that either were annotated as nonsynonymous substitutions (34) or fell within annotated regulatory element binding sites. The age distribution of these mutations is centered around 600 ka ago (Fig. 5A). Fig. 5 Approximate ages and consequences of fixed human-specific derived mutations. (A) Ages of candidate human-specific functional mutations (nonsynonymous substitutions and mutations within regulatory element binding sites) inferred by SARGE within desert regions (free of ILS and admixture between archaic hominins and modern humans). (B) For interacting pairs of genes in the STRING database (35) for which candidate human-specific functional mutations affect both genes, the ages of the two mutations are shown. (C) For each candidate human-specific functional mutation, the length of the surrounding desert region (x axis) and inferred mutation age (y axis) are shown. Mutations for which SARGE did not infer a date (mutations within CpG sites or for which the ancestral allele was unknown) are shown in the bottom panel. Mutations were scored on the basis of length of desert and age; genes with regulatory mutations in the top 99.5th percentile of this distribution, or nonsynonymous mutations in the top 95th percentile, are shown, with nonsynonymous changes in bold. We then compared the ages of mutations affecting pairs of genes that interact, according to the STRING database (35), to see whether any clustered around specific time points (see Supplementary Methods). We find two distinct bursts of these mutations, one concentrated around 600 ka ago and another around 200 ka ago (Fig. 5B). We note that, because many of our human-specific genes are likely functionally important and purifying selection can decrease genetic diversity, some of the time estimates for these mutations may be biased downward. Estimating how and when the modern human lineage arose remains controversial. Dating the oldest population split within modern humans using genetic data has suggested times as recent as 200 to 100 ka ago (36, 37). Archeological evidence paints a more complex and older story; however, a recent study reported human remains with many modern features but archaic cranial morphology dated to about 315 ka ago (38), suggesting that not all human-specific traits arose at the same time. Other studies have found that accumulation of derived morphological features in humans occurred in approximately three periods, whose boundaries correspond roughly to the timing of mutational bursts we found (39). Functional consequences of human-specific mutations Comparison of the human and extinct hominin genomes could reveal instances of positive selection that are undetectable via allele frequency or haplotype-based analyses within modern humans or through comparative genomics between humans and other primate genomes (40). The ARG framework is an attractive approach as it pinpoints truly human-specific genomic regions unaffected by either admixture of ILS. We performed a gene ontology (41) enrichment analysis on the human-specific haplotype regions that accounted for the lengths of candidate genes (see Supplementary Methods) and found these regions to be heavily enriched for genes related to neuron growth, synapse assembly and function, and cell adhesion (table S5). We note that this may occur because of positive selection in the ancestors of all humans, strong purifying selection that reduces the TMRCA within humans so that it excludes extinct hominins, or a combination of both. Further, this test does not suggest specific functional consequences of specific mutations. We ordered human-specific derived mutations within desert regions found in our ARG by the strength of evidence that they were targeted by selection. To this end, we first sought to limit analysis to potentially functionally relevant mutations, defined as mutations that either caused a nonsynonymous substitution relative to archaic hominin genomes or fell within annotated binding sites for regulatory elements known to affect specific genes (see Supplementary Methods). We developed a simple score for each mutation based on its inferred age, where available, and the length of the surrounding desert region (Fig. 5C). The rationale for this approach is that older human-specific substitutions should be in shorter haplotypes as they would have undergone more generations of recombination. Mutations in haplotype regions that run counter to this expectation are a priori more likely to have been affected by positive selection. This approach is similar to the recently described extended lineage sorting (ELS) scan (42), which prioritized long genomic intervals where modern human and archaic hominin lineages are completely sorted. Unlike the ELS method, however, our method only considers alleles that are fixed in modern humans, Our model also does not use a hidden Markov model to smooth transitions between sorted and unsorted haplotypes. Several patterns emerge when considering genes with high-scoring human-specific mutations, and we highlight some of these key findings. Gene ontology terms related to mRNA splicing, processing, and export are enriched in genes with high-scoring mutations (table S7). Of these, we find a regulatory mutation affecting one—LUC7L3—that is somewhat tissue specific (tau, a measure of tissue specificity scaled from 0 to 1 = 0.713), most highly expressed in cerebellar tissue, and annotated to be involved in splice site selection. Its paralog LUC7L and the gene KHDC4, both of which also have high-scoring (top 50th percentile) regulatory mutations, are involved in the same process. In addition, the gene NOVA1, which harbors a nonsynonymous mutation in the top 95th percentile of our score distribution (Fig. 5C), is a neuronal splicing factor that regulates splicing of genes involved in synapse formation within the brain (43). Other types of genes, largely related to brain function and development, appear to be affected by high-scoring mutations. Many genes localized to the centrosome and mitotic spindle are involved in maintaining the polarity of dividing neuroblasts, and some mutations affecting these genes are thought to be critical for the development of the human neocortex (44). We find the term “asymmetric neuroblast division” to be enriched in high-scoring genes (table S7). Among individual genes, we find a high-scoring nonsynonymous mutation affecting the centrosomal protein RABL6, which is highly expressed in cerebellar tissue and overexpressed in cancer (45) and a high-scoring regulatory mutation affecting INCENP, a protein crucial for cytokinesis that localizes to the mitotic spindle and centromere (46). Axon pathfinding is another process suggested to have been targeted by human-specific changes; the gene PIEZO1 is involved in this process (47) and harbors a high-scoring nonsynonymous mutation. In addition, the protocadherin PCDHGB7, which contains a nonsynonymous substitution within a long desert region but which we could not date, is a member of a gene family that generates neuronal cell surface identity and is thought to help guide growing neurites (48). In addition to these, we find a number of other mutations potentially affecting genes involved in histone acetylation, neural cell migration, and the clearing of toxic substances from the brain (see Supplementary Text).
|
|