Figure 7
Correlations of Papuan Ancestry with Archaic and S∗ Components
Conclusions
The discovery and characterization of archaic hominins has typically begun with the analysis of fossil remains (Meyer et al., 2012, Prüfer et al., 2014, Prüfer et al., 2017, Slon et al., 2018). However, as Denisovan admixture has its center of gravity in ISEA and Papua where DNA rarely survives more than a few thousand years in the humid tropical environment (Lipson et al., 2018, McColl et al., 2018), studying the genetic record from modern humans remains the sole way to shed light on the substructure and phylogeography of archaic hominins in this important but understudied region.
Here, we use a statistical approach on new genomes from ISEA and Papua to identify two new Denisovan groups (D1 and D2) and describe the relationships between these archaic hominins long before they first interacted with anatomically modern humans. Both groups branched off early from the Altai Denisovan clade at 283 and 363 kya and were reproductively isolated from the individuals at Denisova cave in Siberia and from each other. Yet both groups bred with modern humans, contributing around 4% of the genomes of Papuans, including over 400 gene variants enriched for traits involving immunity and diet. Some of this introgression is restricted to modern New Guinea and its surrounding islands and may have occurred as late as the very end of the Pleistocene, making the admixing D1 Denisovan population among the last surviving archaic hominins in the world.
The genetic diversity within the Denisovan clade is consistent with their deep divergence and separation into at least three geographically disparate branches, with one contributing an introgression signal in Oceania and to a lesser extent across Asia (D2), another apparently restricted to New Guinea and nearby islands (D1), and a third in East Asia and Siberia (D0). This suggests that Denisovans were capable of crossing major geographical barriers, including the persistent sea lanes that separated Asia from Wallacea and New Guinea. They therefore spanned an incredible diversity of environments, from temperate continental steppes to tropical equatorial islands. The emerging picture suggests that far from moving into sparsely inhabited country, modern humans experienced repeated and persistent interactions as they expanded out of Africa into this highly structured archaic landscape across Eurasia. This genetic contact yielded a rich legacy, including hundreds of gene variants that continue to contribute to the adaptive success of anatomically modern humans today.
STAR★Methods
Method Details
A schematic overview of the analytical pipeline presented here is shown in Figure S1A (STAR Methods S1–5) and Figure S1B (STAR Methods S6–13). Datasets used are shaded in green; analyses and inferences in yellow; and key steps are outlined in bold.
S1 - Sequencing and SNP calling
Sequencing libraries were prepared using TruSeq DNA PCR-Free and TruSeq Nano DNA HT kits depending on DNA quantity. 150 bp paired-end sequencing was performed on the Illumina HiSeq X sequencer.
Individuals were sequenced to expected mean depth of 30x, with an achieved median depth of raw reads across samples of 43x.
These newly generated whole genome sequences were combined with the following published genomes (raw reads):
a)
292 genomes from the Simons Genome Diversity Project (SGDP) (Mallick et al., 2016)
b)
25 Papuan genomes from the Malaspinas et al., 2016 study
SNP calling was performed on the combined dataset, with published genomes analyzed from raw reads exactly as for the new sequence data.
Trimmomatic v. 0.38 (Bolger et al., 2014) was used to cut adapters and low-quality sequences from the reads. After trimming, the vast majority of reads were longer than 145 bp; those below 60 bp were excluded. We aligned the reads to the ‘decoy’ version of the GRCh37 human reference sequence (hs37d5) using BWA MEM (Li, 2013). We removed duplicate reads with picard-tools v. 2.12.0 (http://broadinstitute.github.io/picard) and performed local realignment around indels with GATK v. 3.5 (Poplin et al., 2017).
After alignment, and keeping only properly paired reads that mapped to the same chromosome, the sequencing depth across the samples used in downstream analyses was as follows: min = 18x, Q1 = 35x, median = 38x, Q3 = 43x, max = 48x. Only three samples had median coverage rates below 30x: CBL34, RAM005 and RAM067.
Base calling was undertaken with GATK v. 3.5 following GATK best practices. Per-sample gVCF files were generated using GATK HaplotypeCaller (using only reads with mapping quality ≥ 20). Single sample gVCFs were combined into multisample files using CombineGVCFs, and joint genotyping was performed using GATK GenotypeGVCFs, outputting all sites to a multisample VCF. Exactly the same base calling steps were applied to new and published samples, and the joint genotyping included all samples in this study.
Using BCFtools v. 1.4 (Li, 2011), the following filters were applied to each genotype call: base depth (DP) ≥ 8x and ≤ 400x, and genotype quality (GQ) ≥ 30. We then kept only biallelic SNPs and invariable reference sites. For the majority of our analyses, we kept only sites that had high quality variant calls in at least 99% of samples. (Specifically, all analyses in STAR Methods S5-S9 and S11, and all analyses in S10, apart from two result robustness checks that assessed phasing and archaic haplotype topologies. Additionally, we did not apply the call rate filter in the motif-counting analysis in STAR Methods S12). Applying this 99% call-rate filter yielded a total of 36,462,963 SNPs in the combined dataset. We removed sites within segmental duplications, repeats and low complexity regions. These masks were downloaded from the UCSC and Broad Institute genome resources:
hgdownload.soe.ucsc.edu/goldenPath/hg19/database/genomicSuperDups.txt.gzsoftware.broadinstitute.org/software/genomestrip/node_ReferenceMetadata.htmlIn the filtered and masked VCF files, we examined several statistics across the samples: the percentages of no-calls and singletons; the average depth; transition/transversion ratio; the number of variants; and heterozygosity. One highly heterozygous sample from the SGDP (LP6005441-DNA_A09, Naxi-2) was excluded based on these metrics, as well as on the basis that the original authors determined that this sample had been contaminated (Mallick et al., 2016).
S2 - Kinship and outlier analysis
We performed sample kinship analysis using KING v. 2.1 (Manichaikul et al., 2010). Of the 161 new genomes, 6 were excluded due to the presence of a first-degree relative in the dataset, leaving a total of 155 genomes for downstream analysis. This relatedness is a consequence of the village-scale sampling strategy employed in this study. In addition, 7 sample pairs display second-degree relatedness (BNA05 / BNA12-F, BNA21-F / BNA26-F, CBL018 / CBL019, RAM045-F / RAM067, RAM022 / RAM039-F, NIAS08 / NIAS12, NIAS01 / NIAS10). These samples were kept for further analyses, and the final dataset comprised 471 genomes: 155 newly generated complete genomes, 25 genomes from the Malaspinas et al., 2016 study and 291 genomes from SGDP.
Principal component analysis (PC) was used to detect sample outliers and characterize regional diversity. First, we applied LD pruning of our SNP set using PLINK v. 1.9 (Chang et al., 2015). Pruning was performed in 1 Kb sliding windows with a step size of 100 bp, and SNPs with R2 > 0.1 were removed. Next, PCA was performed in EIGENSOFT v. 7.2.0 (Patterson et al., 2006, Price et al., 2006) without the outlier removal step. The results of a PCA without African samples (N = 429) is shown in Main Text Figure 1B.