Figure 3:
Copy Number Expansions and Runaway Duplications. Red bar illustrates the location of the expansion. Additional examples are shown in Figure S11. A: Expansion in HPR in Africans and Middle Eastern samples. B: Expansions upstream OR7D2 that are mostly restricted to East Asia. The observed expansions in Central & South Asian samples are all in Hazara samples, an admixed population carrying East Asian ancestry. C: Expansions within HCAR2 which are particularly common in the Kalash population. D: Expansions in SULT1A1 which are pronounced in Oceanians (median copy number, 4; all other non-African continental groups, 2; Africa, 3). E: Expansions in ORM1/ORM2. This expansion was reported previously in Europeans (Handsaker et al., 2015); however, we find it in all regional groups and particularly in Middle Eastern populations. F: Expansions in PRB4 which are restricted to Africa and Central & South Asian samples with significant African admixture (Makrani and Sindhi).
We discover multiple expansions that are mostly restricted to African populations. The hunter-gatherer Biaka are notable for a private expansion downstream of TNFRSF1B that reaches up to 9 copies (Figure S11). We replicated the previously identified HPR expansions (Figure 3A), and find that they are present in almost all African populations in our study (Handsaker et al., 2015, Sudmant et al., 2015b). HPR encodes a haptoglobin-related protein associated with defense against trypanosome infections (Smith et al., 1995). We observe populations with the highest copy numbers to be Central and West African, consistent with the geographic distribution of the infection (Franco et al., 2014). In contrast to previous studies, we also find the expansion at lower frequencies in all Middle Eastern populations, which we hypothesize is due to recent gene flow from African populations.
We identified a remarkable expansion upstream of the olfactory receptor OR7D2 that is almost restricted to East Asia (Figure 3B), where it reaches up to 18 copies. Haplotype phasing demonstrates that many individuals contain the expansion on just one chromosome, illustrating that these alleles have mutated repeatedly on the same haplotype background. However, we identify a Han Chinese sample that has a particularly high copy number. This individual has nine copies on each chromosome, suggesting that the same expanded runaway haplotype is present twice in a single individual. This could potentially lead to an even further increase in copy number due to non-allelic homologous recombination (Handsaker et al., 2015).
We discovered expansions in HCAR2 (encoding HCA2) in Asians which are especially prominent in the Kalash group (Figure 3C), with almost a third of the population displaying an increase in copy number. HCA2 is a receptor highly expressed on adipocytes and immune cells, and has been proposed as a potential therapeutic target due to its key role in mediating anti-inflammatory effects in multiple tissues and diseases (Offermanns 2017). Another clinically-relevant expansion is in SULT1A1 (Figure 3D), which encodes a sulfotransferase involved in the metabolism of drugs and hormones (Hebbring et al., 2008). Although the copy number is polymorphic in all continental groups, the expansion is more pronounced in Oceanians.
De novo assemblies and sequences missing from the reference
We sequenced 25 samples from 13 populations using linked-read sequencing at an average depth of ∼50x and generated de novo assemblies using the Supernova assembler (Weisenfeld et al., 2017) (Table S2). By comparing our assemblies to the GRCh38 reference, we identified 1631 breakpoint-resolved unique, non-repetitive insertions across all chromosomes which in aggregate account for 1.9Mb of sequences missing from the reference (Figure 4A). A San individual contained the largest number of insertions, consistent with their high divergence from other populations. However, we note that the number of identified insertions is correlated with the assembly size and quality (Figure S18), suggesting there are still additional insertions to be discovered.
Figure S17:
Top: IGV screenshot of a small deletion (63 bp) in ZNRF1 which is present at 34% frequency in Oceanian populations. Top track Altai Neanderthal, middle track Altai Denisova, bottom track Vindija Neanderthal. The deletion is present in all 3 archaic genomes. Bottom: Loupe screenshot of the region in HGDP00542 showing the two haplotypes resolved using 10x linked-reads, with one carrying the deletion.
Figure S18:
Correlation between Contig N50 and Number of identified NUIs (r = 0.91). Colours refer to the regional group of the samples.
Figure 4:
Non-Reference Unique Insertions (NUIs). A: Ideogram illustrating the density of identified NUI locations across different chromosomes using a window size of 1 Mb. Colours on chromosomes reflect chromosomal bands with red for centromeres. B: Size distribution of NUIs using a bin size of 500bp. C: PCA of NUI genotypes showing population structure (PC3-4). Previous PCs potentially reflect variation in size and quality of the assemblies.
We find that the majority of insertions are relatively small, with a median length of 513bp (Figure 4B). They are of potential functional consequence as 10 appear to reside in exons. These genes are involved in diverse cellular processes, including immunity (NCF4), regulation of glucose (FGF21), and a potential tumour suppressor (MCC). Although many insertions are rare - 41% are found in only one or two individuals - we observe that 290 are present in over half of the samples, suggesting the reference genome may harbour rare deletion alleles at these sites. These variants show population structure, with Central Africans and Oceanians showing most differentiation (Figure 4C), reflecting the deep divergences within Africa and the effect of drift, isolation and possibly Denisovan introgression in Oceania. While the number of de novo assembled genomes using linked or long reads is increasing, they are mostly representative of urban populations. Here, we present a resource containing a diverse set of assemblies with no access or analysis restrictions.
Discussion
In this study we present a comprehensive catalogue of structural variants from a diverse set of human populations. Our analysis illustrates that a substantial amount of variation, some of which reaches high frequency in certain populations, has not been documented in previous sequencing projects. The relatively large number of high-coverage genomes in each population allowed us to identify and estimate the frequency of population-specific variants, providing insights into potentially geographically-localized selection events, although further functional work is needed to elucidate their effect. Our finding of common clinically-relevant regionally private variants, some of which appears to be introgressed from archaic hominins, argues for further efforts generating genome sequences without data restrictions from under-represented populations. We note that despite the diversity found in the HGDP panel, considerable geographic gaps remain in Africa, the Americas and Australasia.
The use of short reads in this study restricts the discovery of complex structural variants, demonstrated by recent reports which uncovered a substantially higher number of variants per individual using long-read or multi-platform technologies (Audano et al., 2019; Chaisson et al., 2019). Additionally, comparison with a mostly linear human reference formed from a composite of a few individuals, and mainly from just one person, limits accurately representing the diversity and analysis of human structural variation (Schneider et al., 2017). The identification of considerable amounts of sequences missing from the reference, in this study and others (Wong et al., 2018; Sherman et al., 2019), argues for the creation of a graph-based pan-genome that can integrate structural variation (Garrison et al., 2018). Such computational methods and further developments in long-range technologies will allow the full spectrum of human structural variation to be investigated.