Genomic Data Reveal a Complex Making of Humans

new

Admin
Administrator

Posts: 72,791

Genomic Data Reveal a Complex Making of Humans Mar 2, 2021 3:41:11 GMT

Quote

Post by Admin on Mar 2, 2021 3:41:11 GMT

Genotyping

We applied PanGenie (42), a method designed to leverage a panel of assembly-based reference haplotypes threaded through a graph representation of genetic variation that takes advantage of the linkage disequilibrium inherent in the phased genomes. We initially performed this genotyping step using a reference set of 15.5M SNVs, 1.03M indels (1-49 bp), and 96.1k SVs (where there was <20% allelic dropout; fig. S1, table S29) and genotyped these variants into the 1000GP WGS dataset (18) observing expected patterns of diversity (15) (Fig. 5A, figs. S27, S28)

Fig. 5 SV genotyping and eQTL analysis.

(A) Distribution of heterozygous SV counts per diploid genome broken down by population, based on PanGenie genotypes passing strict filters. (B) Concordance of allele frequency (AF) estimates from the assembly-based PAV discovery callset and AF estimates from genotyping unrelated Illumina genomes (n=2,504) with PanGenie (strict genotype set of 24,107 SVs); marginal histograms are in linear scale. (C) Count of short- and long-read SVs across variant class, size distribution, and genomic sequence localization. Blue bars represent the proportion of SVs genotyped by PanGenie with AF>0 and green stripes represent concordant SVs between technologies. SD: segmental duplications; SR: simple repeats; RM: repeat masked (not SD or SR); US: unique sequence. (D) Length distribution of common SVs sites (AF>5%) represented in assembly-based callset, including variants genotyped using PanGenie and all common variants from population-scale studies from the Genome Aggregation Database (gnomAD-SV) and the Centers for Common Disease Genetics (CCDG; insertions from CCDG omitted due to lack of data). Length distributions for all variants (not restricted to common) are provided in fig. S23. (E-G) Examples of lead SV-eQTLs (large symbols) in context of their respective genes, overlapping regulatory annotation, and other variants (small symbols). (E) An 89 bp insertion (chr10-133415975-INS-89) is linked to decreased expression of MTGI (q-value = 4.10e-11, Beta = -0.55 [-0.51 — -0.59]). (F) A 186 bp insertion (chr5-50299995-INS-186), overlapping an ENCODE enhancer mark (orange), is the lead variant associated with decreased expression of EMB (q-value = 2.92e-06, Beta = -0.44 [-0.39 — -0.49]). (G) A 1,069 bp deletion (chr21-14088468-DEL-1069) downstream of LIPI is linked to increased expression of LIPI (q-value = 0.0022, Beta = 0.44 [0.38 — 0.50]).

As one measure of genotyping quality, we compare the allele frequencies derived from assembly-based PAV calls across the 64 reference haplotypes to short-read-based allele frequencies obtained from PanGenie for the 2,504 unrelated individuals. From the raw output of PanGenie, we observe an allele frequency correlation (Pearson’s) of 0.98 for SNVs, 0.95 for indels, and 0.85 for SVs. To further improve SV genotyping, we filter the variants by assessing Mendelian consistency, the ability to detect the non-reference allele, genotype qualities, and concordance to assembly-based calls in a leave-out-one experiment into account (18). Using these criteria, we define a subset of strict and lenient SVs for genotyping containing 24,107 SVs (25%) and 50,340 SVs (52%), respectively, with excellent allele frequency correlation of 0.99 (strict, Fig. 5B) and 0.95 (lenient, fig. S29). Performance metrics for deletions and insertions are comparable (strict set: SV deletions, r=0.98; SV insertions, r=0.99; Fig. 5B), highlighting the value of sequence-resolved insertion alleles being part of our reference panel, as well as the algorithm’s ability to leverage it (fig. S30). Beyond SVs, 12,283,650 SNVs (79%) and 705,893 indels (68%) met strict filter criteria (note: given this larger fraction, we did not define a lenient set for these variant classes).
Added value from graph-based genotyping into short read WGS data

To determine the value added by PanGenie genotyping, we next focused on an integrated comparison of long-read SV discovery (PAV), state-of-the-art short-read SV discovery, and the set of genotypable SVs by PanGenie. Consistent with our previous analyses (43), we observed that most SVs specific to long-read discovery localized to highly repetitive sequences, which collectively harbored 95.8% of long-read-specific deletions, and 85.7% of long-read-specific insertions (table S30). We also discovered variation that was uniquely detected (although not sequence-resolved) and genotyped by sequencing read-depth from short reads. On average, there are 167 large CNVs (>5 kbp) per sample – 88.2% of which are not captured by long-read assemblies (Fig. 5C, figs. S11, S31). A large fraction of these calls maps to large repetitive regions such as segmental duplications that are not fully sequence-resolved. Remarkably, we find that 42.5% (strict) and 59.9% (lenient) of PanGenie-genotypable SVs are absent from the short-read callset. We examined the distribution of common long-read SVs genotyped at >5% AF across all the 3,202 Illumina genomes against the short-read SVs from large population studies, including the Centers for Common Disease Genomics (CCDG, (6)) and Genome Aggregation Database (gnomAD, (5)) (Fig. 5D, fig. S12). The ability to genotype variation typically not detected in Illumina callsets is reflected in increased numbers of common SVs (AF>5%), particularly deletions below 250 bp and insertions under 1 kbp, genotyped by PanGenie but not seen in CCDG and gnomAD-SV, while also emphasizing the overall value of large-scale short-read datasets to capture rare variation and large CNVs in the population (fig. S31).

Admin
Administrator

Posts: 72,791

Genomic Data Reveal a Complex Making of Humans Mar 2, 2021 22:02:22 GMT

Quote

Post by Admin on Mar 2, 2021 22:02:22 GMT

QTL analyses

We applied PanGenie genotypes (strict set) to systematically discover quantitative trait loci (eQTL) associated with structural variation. First, we performed deep RNA-seq (>200M fragments) of the corresponding 34 lymphoblastoid cell lines and integrated these data with 397 transcriptomes of 1000GP samples from GEUVADIS (44). We pursued cis expression quantitative trait loci (eQTL) and cis splicing quantitative trait loci (sQTL) mapping across the merged set of 427 donors, using a window of 1 Mbp centered around the gene or splice cluster, respectively, testing all variants with a minor allele frequency of ≥1% and at Hardy-Weinberg equilibrium (HWE exact test p-value ≥ 0.0001). We considered 23,953 expressed genes (15,504 of which were protein-coding) and 36,100 splicing clusters (linked to 11,278 genes).

Using this design, we identify 58,152 indel-eQTLs (linked to 6,748 unique genes) and 2,109 SV-eQTLs (linked to 1,526 unique genes; table S31) at an FDR of 5%. The set includes 819 lead indel-eQTLs and 38 lead SV-eQTLs at distinct genes, respectively (table S31). In the sQTL analysis we identified 3,382 SV-sQTLs (FDR 5%, linked to 758 unique genes; table S32) of which 65 SV-sQTLs at distinct genes were the lead association at the locus (18). In line with prior studies (23, 45), the lead variants are enriched for SVs (Fisher’s exact eQTL p-value = 1.0e-6, OR = 1.2; sQTL p-value = 1.6e-4, OR = 1.2) as well as smaller indels (Fisher’s exact eQTL: p-value = 8.8e-113, OR = 1.2; sQTL: p-value = 3.5e-72, OR = 1.2), whereas they are depleted for SNVs (Fisher’s exact eQTL p-value = 1.8-e118, OR = 0.84; sQTL: p-value = 1.2e-75, OR = 0.84). Among SVs, deletions show the greatest effect when compared to insertion events (table S33, (18)).

We overlapped lead SV-eQTLs with our Illumina-based discovery callset (18) and a recent large-scale SV study of 17,795 genomes (6) and find that 42% (16 out of 38 SVs) of the lead eQTL associations reported here are novel. Of these previously inaccessible SVs, 12 (75%) correspond to insertions (2 Alu MEIs, 3 tandem duplications, and 7 repeat expansions)—SV classes typically under-ascertained in short-read datasets (1). For example, one of our top novel lead SVs is an 89 bp VNTR insertion in the terminal intron of the mitochondrial ribosome-associated GTPase 1 gene (MTG1; Fig. 5E) and is seen in conjunction with decreased expression. Similarly, we identify a 186 bp insertion in an ENCODE enhancer for B cell lymphomas, which is associated with reduced expression of the immunoglobulin superfamily gene embigin (EMB; Fig. 5F). In contrast, we sequence resolve a 1,069 bp deletion located in an SD region downstream of the Lipase I gene (LIPI; Fig. 5G) and find that it is associated with increased gene expression of LIPI. Single-nucleotide polymorphisms at this locus have been linked to heart rate in patients with heart failure with reduced ejection fraction in a previous genome-wide association study (GWAS, p-value 9.0e-06 reported in (46)).

Admin
Administrator

Posts: 72,791

Genomic Data Reveal a Complex Making of Humans Mar 3, 2021 5:13:25 GMT

Quote

Post by Admin on Mar 3, 2021 5:13:25 GMT

Ancestry and population genetic analyses

The availability of haplotype-phased assemblies provides an opportunity to explore the ancestry and population genetic properties of the genomes and SVs at multiple levels. We applied a machine-learning method (47) and developed a hidden Markov model to identify ancestry-informative SNVs and to assign ancestral segments per block based on population genetic data from the Simons Genome Diversity Project (SGDP, (48)) (18). The two methods, as well as the different sequencing platforms, produce highly concordant results (>90%, fig. S32). At the family level, we can accurately assign paternal and maternal haplotypes and distinguish recombination crossover events in the child compared to parental haplotypes (Fig. 6A).

Fig. 6 Ancestry and population differentiation inferences using haplotype-phased diploid assemblies.

(A) Inferred local ancestries (18) for maternal (upper) and paternal (bottom) haplotypes of HG00733 are compared to parental haplotypes (maternal: HG00732, paternal: HG00731). Ancestral segments are colored (African: yellow, Native American: red, and European: blue) and are consistent with the recent demographic history of the island (18). HG0733 SVs (≥50 bp; insertion: green, deletion: purple), inferred recombination breakpoints (triangles), and transmission of recombinant parental haplotypes (dashed lines) are shown. (B) Length distribution (log10) of ancestry tracts among the 64 genomes assigned to five superpopulations shows evidence of recent (Admixed American) and more ancient (South Asian) admixture. (C) Top population-specific Fst variants (dark color) and top superpopulation-specific Fst variants (light color). The number of stratified SVs differs by orders of magnitude depending on population. (D) Top SV PBS (population branch statistic) values within 5 kbp of genes identify SV candidates for selection and disease. A high PBS statistic suggests AF differences among populations are a result of selection.

At the population level, on average 87.2% of the assembled sequence can be assigned ancestry. 1000GP samples originating from the African continent show the largest tracts of uniform ancestry (mean length = 23.6 cM, Fig. 6B, fig. S33) in contrast to North and South American populations (mean length=2.65 cM, Fig. 6B, fig. S33) and South Asians (mean length=4.38 cM, Fig. 6B), consistent with recent and more ancient admixture. For example, the African American, African Caribbean, and Admixed American 1000GP samples show the greatest diversity of ancestral segments (Fig. 6B, figs. S33, S34) most likely as a result of the transatlantic slave trade and colonial era migration (49).

Focusing on our more comprehensive genotyping of SVs into WGS data, we searched for population-stratified variants since these are potential candidates for local adaptation (50, 51) that could not have been characterized in the original study of 1000GP populations (15). Using Fst as a metric, we find that the number of such population-stratified variants varies widely among different groups likely as a consequence of ancestral diversity (Africans), population bottlenecks (East Asians), and admixture (South Asians) (Fig. 6C). Restricting our analysis to SVs located within 5 kbp of genes and applying population branch statistics (PBS) (51), we identify 117 stratified SVs (PBS >3 s.d., tables S34, S35) and further characterize these by the number of base pairs deleted or inserted per locus (Fig. 6D). The greatest outlier is a 4.0 kbp insertion within the first intron of LCT (lactase gene) originally reported based on fosmid sequencing from European samples (52). We determine that the corresponding insertion is ancestral (i.e., the human reference genome carries the derived deleted allele), the insertion harbors 11 predicted transcription factor binding sites, and the deletion likely occurred as a result of an Alu-mediated NAHR event ~520,000 years ago (fig. S35).

LCT variation is one of the most well-known genes under adaptive evolution among Europeans. Notably, the reported causal, derived allele of lactase persistence in Europeans (-13910*T; rs4988235) is in complete linkage disequilibrium (D′=1) with the reference allele of this SV, and it will be interesting to determine the functional roles of these two mutations in lactase persistence (53). In other cases, the population-stratified variants are nested among known regulatory elements or intersect them directly, such as a 76 bp tandem repeat expansion in a PLEC intron, a cytoskeleton component, seen only in Africans (AF=0.82) and Admixed Americans (AF=0.06). Similarly, we identify a 2.8 kbp insertion mapping near potential repressor-binding sites in a CLEC16A intron, a gene associated with type 1 diabetes when disrupted (54). This variant shows a high frequency in American populations (AF=0.28), with the highest PBS signal among Peruvians (AF=0.39), but is rarely observed in other populations (AF≤0.04). Further studies are needed to confirm functional effect; however, it is interesting to note that type 1 diabetes in Peruvians is among the highest in the world (55).

Admin
Administrator

Posts: 72,791

Genomic Data Reveal a Complex Making of Humans Mar 3, 2021 6:39:21 GMT

Quote

Post by Admin on Mar 3, 2021 6:39:21 GMT

Discussion

We have generated a diversity panel of phased long-read human genome assemblies that has significantly improved SV discovery and will serve as the basis to construct new population-specific references. Previous large-scale efforts have largely been inferential and biased when it comes to the detection of SVs. Here, we develop a method to discover all forms of genetic variation (PAV) directly by comparison of assembled human genomes. In contrast, SV discovery from the 1000GP was indirect and limited given the frequent proximity of SVs to repeat sequences inaccessible to short reads (15, 23). The 1000GP, for example, reported 69,000 SVs based on the analysis of 2,504 short-read sequenced genomes. In contrast, our analysis of 32 genomes (64 unrelated haplotypes) recovers 107,136 SVs, more than tripling the rate of discovery when compared to short-read Illumina SV analyses on the same samples (Fig. 2D). Recent large-scale short-read sequencing studies (5, 6), interrogating tens of thousands of samples, show even lower SV sensitivity reporting 5,000 to 10,000 SVs per sample, when compared to our phased-assembly approach, which identifies 23,000 to 28,000 SVs per sample. This lack of sensitivity for SV discovery from short reads also affects common variation (AF>5%) and we increase the amount of common SVs by 2.6-fold. The predominant source of this increase in sensitivity was among small SVs (<250 bp) localized to SDs and simple repeat sequences, where we observed a dramatic 8.4-fold increase in variant discovery (12,109 SVs per genome from long-read assembly, 1,444 per genome from Illumina short-read alignment; Fig. 5C). Notably, all discovered genetic variation is physically phased and therefore SVs are fully integrated with their flanking SNVs.

Compared to previous reports based on short-read sequencing (25–27), a surprising finding has been the larger fraction of SVs (63%) now assigned to homology-based (>50 bp) mutation mechanisms, including HDR, NAHR and VNTR. Breakpoint characterization with short-read data apparently biased early reports toward relatively unique regions concluding that <30% of SVs were driven by homology-based mutational mechanisms (25–27). Since a majority of unresolved structural variation still maps to large repeats, including centromeres and SDs subject to NAHR, we conclude that homology-based mutational mechanisms will contribute even further and are, therefore, the most predominant mode shaping the SV germline mutational landscape. Notwithstanding, access to fully assembled retrotransposons and their flanking sequence provides the largest collection of annotated source elements for both L1 and SVA mobile elements. We find that 14% of SVA insertions are associated with transductions compared to 8% of L1s—a difference driven in part by the proclivity of SVAs to transduce sequences at their 5ʹ and 3ʹ ends. We find a surprisingly large number of L1 source elements (19%) with defective ORFs suggesting either trans-complementation (56) or polymorphisms leading to the recent demise of these active source elements. Of note, some of the youngest L1 copies (e.g., 6p22.1-1 and 2q24.1) have been reported to be rare polymorphisms able to mediate massive bursts of somatic retrotransposition in cancer genomes (57). This suggests that recently acquired hot L1s, which have not yet reached an equilibrium with our species, contribute disproportionately to disease-causing variation (58).

Genome-wide QTL scans can bridge the gap between molecular and clinical phenotypes and serve as a proxy for functional effects mediated by genetic variant classes (23, 44, 59). Taking advantage of the fully phased sequence-resolved genetic variation, we demonstrate this by applying PanGenie, a new pangenome-based genotyping method, to 3,202 1000GP genomes, resulting in reliable genotype calls for 705,893 indels and up to 50,340 SVs (lenient genotype set). Of these, 59.9% are presently missed in multi-algorithm short-read discovery callsets and the majority (68.2%) of these novel SVs are insertions. Our work, thus, provides a framework for the discovery of eQTLs and disease-associated variants with the potential to discriminate among SNVs, indels, and SVs as the most likely causal variants (lead variants) associated with human genetic traits. The fact that 31.9% of SV-eQTLs and 48% of lead SV-eQTLs are rendered accessible to short reads only through the availability of our panel of haplotype-resolved assemblies testifies to the importance of this resource for future GWAS. Once again, among the lead SV-eQTLs, 75% are insertions, although there are also promising deletion eQTLs. For example, we identify a 1,069 bp deletion eQTL near LIPI, a GWAS disease locus for cardiac failure (46). Indeed, Summary-data-based Mendelian Randomization analysis (SMR, (60)) suggests that this SV-eQTLs of LIPI may be driving this association (SMR p-value adj.: 5.6e-4).

Haplotype-resolved SVs with accurate genotypes will also facilitate evolutionary and population genetic studies of SVs, including estimations of the rates of recurrent mutation, population stratification, and selective sweeps. As part of this analysis, we identify 117 loci associated with genes where allele frequencies differ radically between populations and are candidates for local adaptation (50, 51). Ancestral reconstructions of haplotype-resolved SVs can be further extended to identify introgressed SVs from Neanderthals and Denisovans (61). While archaic SNV haplotypes have been identified in modern-day humans, little is known regarding SV content given the degraded nature of ancient DNA. Combined with coalescent estimates of evolutionary age, it should now be possible to systematically identify associated introgressed SVs and assess them for signatures of adaptive evolution as was recently demonstrated (62). Even though we estimate that 96% of SVs with an allele frequency above 2% have been theoretically discovered (63), a greater diversity of human genomes are required to adequately account for population differences, effects of selection, as well as archaic introgression. Our findings clearly indicate that genomes of African ancestry represent the deepest reservoir of untapped structural variation. Ongoing efforts from the HGSVC, All of Us, and the Human Pangenome Reference Consortium (HPRC, exploring the normal pattern of structural variation using long-read sequences over the next few years will be critical to better understand human genetic variation.

Currently, our understanding of the full spectrum of structural variation is not yet complete, despite the advances presented here. There are two important limitations. First, comparison with optical mapping data identifies hundreds of gene-rich regions near and within SDs harboring more complex forms of SVs that are still not fully resolved by long-read assembly. The remaining gaps in human genomes cluster and a subset represent complex SV differences between human haplotypes. Second, only ~50% of our long-read discovery set of SVs can, at present, be reliably genotyped in short-read data using PanGenie. Expanding the number of assembly-based haplotypes available as pangenomic reference will likely mitigate this, but multiallelic VNTRs/STRs as well as SVs embedded in larger repeats such as SDs and centromeres are particularly problematic and novel methods are needed to characterize these. Recent advances coupling both HiFi and ultra-long-read Oxford Nanopore data show promise in resolving the sequence of these more complex regions from both haploid (64) and diploid human genome assemblies (65). Once a larger number of such complex regions are haplotype resolved across diversity panels of human genomes—and algorithms continue to evolve to exploit this information—we expect larger portions (fig. S36) of the human genome to become amenable to genotyping and association with human traits.
Methods summary

Libraries were prepared from high-molecular-weight DNA from lymphoblast lines (Coriell Institute). Long-read CLR and HiFi sequencing data (25-50X) were generated on the Sequel II platform (Pacific Biosciences) using 15-hour (CLR) or 30-hour (HiFi) movie times. Strand-seq data were produced from the same samples and used to identify and phase heterozygous SNVs (LongShot (66) and DeepVariant (67)) from the squashed genome assemblies (Peregrine (68) or Flye (69)). StrandphaseR (70), SaaRclust (71) and WhatsHap (72, 73) partitioned long reads into haplotypes to generate phased genome assemblies (PGAS (3)). MAPQ60 phased assembly contig coverage is estimated for autosomes (chr 1-22) and the X chromosome to balance male and female comparisons, excluding regions of heterochromatin (Giemsa pos./var. staining) and unresolved reference sequence (N-gaps). We generated optical maps for 30 of the 32 samples based on DLE1 digestion (Bionano Genomics).

PAV was used to characterize SNVs, indels, and SVs compared to the human reference GRCh38. Inversions were detected using Strand-seq (1, 9, 38), optical mapping data (Bionano Solve v3.5) and PAV (88), which detects inversion signatures using a novel k-mer density approach to identify inner and outer breakpoints of flanking repeats without relying on alignment truncation. The diploid callset is created by merging two independent haploid callsets. We removed variants in collapses by SDA (74) and misaligned contig clusters, then merged variants from all samples to create a nonredundant callset that was subsequently filtered by additional support (18). SVs required support from at least one of seven other sources, including read-based callers (MELT, PBSV, PALMER) (33, 75), optical mapping data, breakpoint k-mer analysis, and PAV replication with LRA (76). Indels required support from at least two of four sources and SNVs required support from at least two of five sources. MEIs were primarily discovered using PAV which were then annotated using MEIGA-PAV (89). In addition, Illumina and PacBio alignments were processed using MELT and PALMER, respectively, in order to increase sensitivity for MEI discovery. Finally, MEI calls across different platforms were merged into an integrated callset.

We estimated functional element depletion for SVs by simulation permuting SVs within their 1 Mbp bin 100,000 times and recording functional element hits for insertions and deletions for each functional category (CDS, 5ʹ UTR, 3ʹ UTR, promoter, proximal enhancer, distal enhancer, CTCF, and intron). SV hotspots were defined by searching for regions of increased SV density using kernel density estimation implemented with the ‘hotspotter’ function from the primatR package (38, 77). Illumina WGS short reads (250 bp paired end) were generated (34.5-fold) (18) from 1000GP samples (2,504 unrelated individuals and additional samples from children to form 602 trios). SVs were called from an ensemble of three methods: GATK-SV (5), SVTools (6) and Absinthe (88) and detailed comparisons between long- and short-read data were performed for the 31 matched samples (18).

We genotyped all 3,202 genomes using PanGenie (42), which determines k-mer abundances from an input set of unaligned short reads and infers the genotypes of this short-read sample at all loci represented in the reference set. The method exploits both the linkage disequilibrium structure inherent to the reference haplotypes and the sequence resolution they provide and, hence, makes full use of the haplotype resource provided. RNA-seq data QC was conducted with Trim Galore! (78) and mapped to the reference genome using STAR (79), followed by gene-level quantification using FeatureCounts (80) and quantification of splice events using leafCutter (81). We mapped the effect of genetic variation on both expression levels and splicing ratios using a QTL mapping pipeline based on a linear mixed model implemented in LIMIX (82–84). We combined our QTL statistics with published GWAS results to assess the link among genetic variation, GWAS traits, and either gene expression or splicing ratios using SMR (60). To identify population-stratified SVs in the 26 populations, we computed the FST-based PBS (18). For each focal population, we constructed population triplets by choosing sister- and out-groups inside and outside the continent where the focal population resides, respectively. For each focal population, we selected the maximum PBS per gene for all possible PBS triplets and selected the subset that are at least three standard deviations (Z transformation) beyond the PBS mean as potential targets of selection. Detailed descriptions of materials and methods are available in the supplementary materials (18).

Supplementary Materials

science.sciencemag.org/cgi/content/full/science.abf7117/DC1

Admin
Administrator

Posts: 72,791

Genomic Data Reveal a Complex Making of Humans Jul 13, 2021 20:54:14 GMT

Quote

Post by Admin on Jul 13, 2021 20:54:14 GMT

Genome-scale sequencing and analysis of human, wolf and bison DNA from 25,000 year-old sediment
Pere Gelabert, Susanna Sawyer, Anders Bergström, Thomas C. Collin, Tengiz Meshveliani, Anna Belfer-Cohen, David Lordkipanidze, Nino Jakeli, Zinovi Matskevich, Guy Bar-Oz, Daniel M. Fernandes, Olivia Cheronet, Kadir T. Özdoğan, Victoria Oberreiter, Robin N. M. Feeney, Mareike C. Stahlschmidt, Pontus Skoglund, Ron Pinhasi

doi: doi.org/10.1101/2021.01.08.425895

Summary
Archaeological sediments have been shown to preserve ancient DNA, but so far have not yielded genome-scale information of the magnitude of skeletal remains. We retrieved and analysed human and mammalian low-coverage nuclear and high-coverage mitochondrial genomes from Upper Palaeolithic sediments from Satsurblia cave, western Georgia, dated to 25,000 years ago. First, a human female genome with substantial basal Eurasian ancestry, which was an ancestry component of the majority of post-Ice Age people in the Near East, North Africa, and parts of Europe. Second, a wolf genome that is basal to extant Eurasian wolves and dogs and represents a previously unknown, likely extinct, Caucasian lineage that diverged from the ancestors of modern wolves and dogs before these diversified. Third, a bison genome that is basal to present-day populations, suggesting that population structure has been substantially reshaped since the Last Glacial Maximum. Our results provide new insights into the late Pleistocene genetic histories of these three species, and demonstrate that sediment DNA can be used not only for species identification, but also be a source of genome-wide ancestry information and genetic history.

Highlights

We demonstrate for the first time that genome sequencing from sediments is comparable to that of skeletal remains

A single Pleistocene sediment sample from the Caucasus yielded three low-coverage mammalian ancient genomes

We show that sediment ancient DNA can reveal important aspects of the human and faunal past

Evidence of an uncharacterized human lineage from the Caucasus before the Last Glacial Maximum

∼0.01-fold coverage wolf and bison genomes are both basal to present-day diversity, suggesting reshaping of population structure in both species

Introduction
Ancient genome sequencing from bone (Hagelberg et al., 1989), teeth (Höss et al., 1996) and hair (Gilbert et al., 2007) has revolutionised our understanding of natural history and the human past (Hofreiter et al., 2001; Racimo et al., 2020). When skeletal material is not available, sediment ancient DNA has sometimes been used to determine the presence or absence of different species. Several studies have demonstrated the presence of ancient DNA in sediments, initially with PCR-based methods (Willerslev et al., 2003) and more recently using high throughput sequencing techniques (Pedersen et al., 2016; Søe et al., 2018; Willerslev et al., 2014). Sediment DNA has also been used to track the presence of absence of species across a range of environments and time periods, primarily through targeted amplification or capture of single genetic regions (Pedersen et al., 2015). A ground-breaking study showed DNA preservation in clay-rich sediments for at least 240 ky. (Slon et al., 2017), and used targeted enrichment to recover sufficient numbers of fragments to reconstruct mtDNA phylogenies of Neanderthals and Denisovans. A recent study similarly recovered Denisovan mitochondrial DNA from sediments deposited ∼100 kya and ∼60 kya from Baishiya Karst Cave (BKC) on the Tibetan Plateau (Zhang et al., 2020).

However, the use of sediment DNA has thus far remained limited to species identification or mitochondrial capture. A major question, which we address here, is whether it can also be leveraged to provide genome-wide data informative about overall genetic ancestry and other biological properties of the organisms it derives from such as: sex, presence of multiple individuals or differnetial ancestry clusters. We report results from shotgun sequencing of a sediment sample from the Upper Paleolithic site of Satsurblia Cave, southern Caucasus, that dates to the Last Glacial Maximum (LGM, 25 thousands years ago [kya]). In most of the Caucasus and particularly in western Georgia, karst systems hold low and stable year-round temperatures and low acidity (no guano deposits in most systems), conditions which are particularly favourable to ancient DNA survival. A single sample yielded up to several million sequence reads from human, wolf (Canis lupus) and bison (Bison), corresponding to genome-wide data comparable to low-coverage sequencing obtained from skeletal remains.s