Genomic Data Reveal a Complex Making of Humans

new

Admin
Administrator

Posts: 73,281

Genomic Data Reveal a Complex Making of Humans Feb 28, 2021 7:05:09 GMT

Quote

Post by Admin on Feb 28, 2021 7:05:09 GMT

64 Human Genomes Sequenced Will Serve as New Reference for Genetic Variation and Predisposition to Human Diseases

Researchers at the University of Maryland School of Medicine (UMSOM) co-authored a study, published today in the journal Science, that details the sequencing of 64 full human genomes. This reference data includes individuals from around the world and better captures the genetic diversity of the human species. Among other applications, the work will enable population-specific studies on genetic predispositions to human diseases as well as the discovery of more complex forms of genetic variation.

Twenty years ago this month, the International Human Genome Sequencing Consortium announced the first draft of the human genome reference sequence. The Human Genome Project, as it was called, required 11 years of work and involved more than 1000 scientists from 40 countries. This reference, however, did not represent a single individual, but instead was a composite of humans that could not accurately capture the complexity of human genetic variation.

Building on this, scientists have conducted several sequencing projects over the last 20 years to identify and catalog genetic differences between an individual and the reference genome. Those differences usually focused on small single base changes and missed larger genetic alterations. Current technologies now are beginning to detect and characterize larger differences – called structural variants – such as insertions of new genetic material. Structural variants are more likely than smaller genetic differences to interfere with gene function.

The new finding in Science announced a new and significantly more comprehensive reference dataset that was obtained using a combination of advanced sequencing and mapping technologies. The new reference dataset reflects 64 assembled human genomes, representing 25 different human populations from across the globe. Importantly, each of the genomes was assembled without guidance from the first human genome composite. As a result, the new dataset better captures genetic differences from different human populations.

“We’ve entered a new era in genomics where whole human genomes can be sequenced with exciting new technologies that provide more substantial and accurate reads of the DNA bases,” said study co-author Scott Devine, PhD, Associate Professor of Medicine at UMSOM and faculty member of IGS. “This is allowing researchers to study areas of the genome that previously were not accessible but are relevant to human traits and diseases.”

Institute of Genome Science (IGS)’s Genome Resource Center (GRC) was one of three sequencing centers, along with Jackson Labs and the University of Washington, that generated the data using a new sequencing technology that was developed recently by Pacific Biosciences. The GRC was one of only five early access centers that was asked to test the new platform.

Dr. Devine helped to lead the sequencing efforts for this study and also led the sub-group of authors who discovered the presence of “mobile elements” (i.e., pieces of DNA that can move around and get inserted into other areas of the genome). Other members of the Institute for Genome Sciences (IGS) at the University of Maryland School of Medicine are among the 65 co-authors. Luke Tallon, PhD, Scientific Director of the Genomic Resource Center, worked with Dr. Devine to generate one of the first human genome sequences on the Pacific Bioscences platform that was contributed to this study. Nelson Chuang, a graduate student in Dr. Devine’s lab also contributed to the project.

“The landmark new research demonstrates a giant step forward in our understanding of the underpinnings of genetically-driven health conditions,” said E. Albert Reece, MD, PhD, MBA, Executive Vice President for Medical Affairs, UM Baltimore, and the John Z. and Akiko K. Bowers Distinguished Professor and Dean, University of Maryland School of Medicine. “This advance will hopefully fuel future studies aimed at understanding the impact of human genome variation on human diseases.”

Reference: “Haplotype-resolved diverse human genomes and integrated analysis of structural variation” by Peter Ebert, Peter A. Audano, Qihui Zhu, Bernardo Rodriguez-Martin, David Porubsky, Marc Jan Bonder, Arvis Sulovari, Jana Ebler, Weichen Zhou, Rebecca Serra Mari, Feyza Yilmaz, Xuefang Zhao, PingHsun Hsieh, Joyce Lee, Sushant Kumar, Jiadong Lin, Tobias Rausch, Yu Chen, Jingwen Ren, Martin Santamarina, Wolfram Höps, Hufsah Ashraf, Nelson T. Chuang, Xiaofei Yang, Katherine M. Munson, Alexandra P. Lewis, Susan Fairley, Luke J. Tallon, Wayne E. Clarke, Anna O. Basile, Marta Byrska-Bishop, André Corvelo, Uday S. Evani, Tsung-Yu Lu, Mark J.P. Chaisson, Junjie Chen, Chong Li, Harrison Brand, Aaron M. Wenger, Maryam Ghareghani, William T. Harvey, Benjamin Raeder, Patrick Hasenfeld, Allison A. Regier, Haley J. Abel, Ira M. Hall, Paul Flicek, Oliver Stegle, Mark B. Gerstein, Jose M.C. Tubio, Zepeng Mu, Yang I. Li, Xinghua Shi, Alex R. Hastie, Kai Ye, Zechen Chong, Ashley D. Sanders, Michael C. Zody, Michael E. Talkowski, Ryan E. Mills, Scott E. Devine, Charles Lee, Jan O. Korbel, Tobias Marschall and Evan E. Eichler, 25 February 2021, Science.
DOI: 10.1126/science.abf7117

Admin
Administrator

Posts: 73,281

Genomic Data Reveal a Complex Making of Humans Feb 28, 2021 21:13:57 GMT

Quote

Post by Admin on Feb 28, 2021 21:13:57 GMT

Haplotype-resolved diverse human genomes and integrated analysis of structural variation

Science 25 Feb 2021:
eabf7117
DOI: 10.1126/science.abf7117

Abstract

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent–child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average contig N50: 26 Mbp) integrate all forms of genetic variation even across complex loci. We identify 107,590 structural variants (SVs), of which 68% are not discovered by short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterize 130 of the most active mobile element source elements and find that 63% of all SVs arise by homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1,526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.
Introduction

Advances in long-read sequencing, coupled with orthogonal genome-wide mapping technologies, have made it possible to fully resolve and assemble both haplotypes of a human genome (1–3). While such phased human genome assemblies generally improve variant discovery compared to Illumina or “squashed” long-read genome assemblies (4), the largest gains in sensitivity have been among structural variants (SVs)—inversions, deletions, duplications, and insertions ≥50 bp in length. Typical Illumina-based discovery approaches identify only 5,000–10,000 SVs (1, 5, 6) in contrast to long-read genome analyses that now routinely detect >20,000 SVs (1, 3, 4, 7). Among the different classes of SVs, the greatest gains in sensitivity have been noted specifically for insertions where >85% of the variation has been reported as novel (1). In addition, repeat-mediated alterations within SV classes, such as variable number of tandem repeats (VNTRs) and short tandem repeats (STRs), have been challenging to delineate from short-read sequencing technologies and are underrepresented in the reference genome and often collapsed in unphased genome assemblies (8). The integration of long-read sequencing with new technologies such as single-cell template strand sequencing (Strand-seq) has further catalyzed the unambiguous confirmation of both heterozygous- and homozygous-inverted configurations in a genome (1, 9). Long-read phased genome assemblies (1) also better resolve larger full-length mobile element insertions (MEIs), providing an opportunity to systematically investigate their origins, distribution, and the mutational processes underlying their mobilization within more complex regions of the genome, including transductions (10, 11).

The Human Genome Structural Variation Consortium (HGSVC) recently developed a method for phased genome assembly that combines long-read PacBio whole-genome sequencing (WGS) and Strand-seq data to produce fully phased diploid genome assemblies without dependency on parent–child trio data (Fig. 1A) (3). These phased assemblies enable a more complete sequence-resolved representation of variation in human genomes.

Fig. 1 Trio-free phased diploid genome assembly using Strand-seq (PGAS).

(A) A schematic of the PGAS pipeline (3): (a) generation of a non-haplotype-resolved (“squashed”) long-read assembly; (b) clustering of assembled contigs into “chromosome” clusters based on Strand-seq Watson/Crick signal; (c) calling of single-nucleotide variants (SNVs) relative to the clustered squashed assembly; (d) integrative phasing combines local (SNV) and global (Strand-seq) haplotype information for chromosome-wide phasing; (e) tagging of input long reads by haplotype; (f) phased genome assembly based on haplotagged long reads and subsequent variant calling (18). (B) Genomic coverage (y-axis) as a function of the long-read length (x-axis). (C) Fraction of reads that can be assigned (“haplotagged”) to either haplotype 1 (semitransparent) or haplotype 2 for HiFi (hatched) and CLR (solid) datasets. (D) Contig-level N50 values for squashed (x-axis) and haploid assemblies (y-axis) for CLR (black diamonds) and HiFi (red circles) samples. (E) Haploid assembly QV estimates computed from unique and shared k-mers (x-axis) based on homozygous Illumina variant calls (y-axis). Samples colored according to the 1000GP population color scheme (15) with exception of the added Ashkenazim individual NA24385/HG002 (Coriell family ID 3140) (ASK/dark blue).

Here, we present a resource consisting of phased genome assemblies, corresponding to 70 haplotypes (64 unrelated and 6 children) from a diverse panel of human genomes. We focus specifically on the discovery of novel SVs performing extensive orthogonal validation using supporting technologies with the goal of comprehensively understanding SV complexity, including in regions that cannot yet be resolved by long-read sequencing (fig. S1). Further, we genotype these newly defined SVs using a pangenome graph framework (12–14) into a diversity panel of human genomes now deeply sequenced (>30-fold) with short-read data from the 1000 Genomes Project (1000GP) (15, 16). These findings allow us to establish their population frequency, identify ancestral haplotypes, and discover new associations with respect to gene expression, splicing, and candidate disease loci. The work provides fundamental new insights into the structure, variation, and mutation of the human genome providing a framework for more systematic analyses of thousands of human genomes going forward.

Admin
Administrator

Posts: 73,281

Genomic Data Reveal a Complex Making of Humans Mar 1, 2021 2:51:54 GMT

Quote

Post by Admin on Mar 1, 2021 2:51:54 GMT

Results
Sequencing and phased assembly of human genomes

We initially selected 34 unrelated individual genomes for de novo sequencing, with the goal of at least one representative from each of the 26 1000GP populations, of which 30 samples passed initial QC (tables S1 and S2). We additionally sequenced three previously studied child samples completing three parent–child trios, and we included for analysis publicly available sequencing data for two samples, NA12878 and HG002/NA24385, generated as part of the Genome in a Bottle effort (17). The complete set of 35 genomes includes 19 females and 16 males of African (AFR, n=11), Admixed American (AMR, n=5), East Asian (EAS, n=7), European (EUR, n=7) and South Asian (SAS, n=5; table S1) descent. All genomes were sequenced (Methods) using continuous long-read (CLR) sequencing (n=30) to an excess of 40-fold coverage or high-fidelity (HiFi) sequencing (n=12) to an excess of 20-fold coverage (Fig. 1B, table S1, (18)).

As a control for phasing and platform differences, we sequenced nine overlapping samples with both CLR as well as HiFi sequence data corresponding to the three parent–child trios (tables S1, S2) that had been studied for SVs previously by the HGSVC (1). For the purpose of phasing, we generated corresponding Strand-seq data (74-183 cells, fig. S2) for each of the samples. We used these data to successfully produce 70 (64 unrelated) phased and assembled human haplotypes (5.7 to 6.1 Gbp in length for the diploid sequence, table S1) using a reference-free assembly approach (Fig. 1A) (3), which works in the absence of parent–child trio information.

We find that the phased genomes are accurate at the base-pair level (QV > 40) and highly contiguous (contig N50 > 25 Mbp, Fig. 1C-E, table S1) with low switch error rates (median 0.12%, table S3) providing a diversity panel of physically resolved and fully phased single-nucleotide variant (SNV) and indel (insertion/deletion) haplotypes flanking sequence-resolved SVs (table S4). Using two different metrics from variant calling and k-mer content methods, respectively (Fig. 1E), we find that sequence accuracy is higher for human genome assemblies generated by HiFi (median QV = 54 [hom. var.] / 43 [k-mer], Fig. 1E) when compared to CLR (median QV = 48 [hom. var.] / 39 [k-mer], Fig. 1E) sequencing. Considering only accessible regions of the genome (18), the MAPQ60 contig coverage of HiFi and CLR genomes are similar (95.43% and 95.12%, table S5). CLR assemblies, however, are more contiguous (HiFi median contig N50 was 19.5 vs. 28.6 Mbp for CLR; p-value <10e-9, t test). Fifteen of our assembled haplotypes exceed a contig N50 of 32 Mbp, all of which were based on CLR sequencing where insert libraries are much larger and sequence coverage is higher with half the number of single-molecule, real-time (SMRT) cells (Fig. 1D, fig. S3, table S6).

Comparing Strand-seq phasing accuracy for six samples where parent–child trio data are available (table S3, figs. S4, S5; see Methods in (3)), we estimate on average 99.86% of all 1 Mbp segments are correctly phased from telomere-to-telomere (average switch error rate of 0.18% and Hamming distance of 0.21%, table S3). Predictably (3), remaining assembly gaps are enriched (18) in regions of segmental duplications (SDs) and acrocentric and centromeric regions of human chromosomes (figs. S6, S7, table S7). As a final QC of assembly quality, we analyzed Bionano Genomics optical mapping data for 32 genomes and found a median concordance of >97% between the optical map and the phased genome assemblies (figs. S8, S9, table S8).

Admin
Administrator

Posts: 73,281

Genomic Data Reveal a Complex Making of Humans Mar 1, 2021 4:44:47 GMT

Quote

Post by Admin on Mar 1, 2021 4:44:47 GMT

Phased variant discovery

Unlike previous population surveys of structural variation (1, 4, 19–21), which mapped reads or unphased contigs to the human reference genome, we developed the Phased Assembly Variant (PAV) caller (88) to discover genetic variants on the basis of a direct comparison between the two sequence-assembled haplotypes and the human reference genome, GRCh38 (18). In the end, each human genome is rendered into two haplotype-resolved assemblies (each 2.9 Gbp) where all variants are physically linked (table S4). We classify variants as SNVs, indels (1-49 bp), and SVs (≥50 bp), which includes copy number variants (CNVs) and balanced inversion polymorphisms. After filtering (18), our nonredundant callset of unrelated samples contains 107,590 insertion/deletion SVs, 316 inversions, 2.3 million indels, and 15.8 million SNVs.

We observe a 2 bp periodicity for indels (dinucleotide repeats) and modes at 300 bp and 6 kbp for Alu and L1 MEIs, respectively (Fig. 2A), with only a small fraction intersecting functional elements (22) (Fig. 2B). PAV readily flags all reference-based artefacts or minor alleles by pinpointing regions where the 64 phased human genomes consistently differ from GRCh38 (1,573 SVs, 18,630 indels, and 91,537 SNVs, “shared variants”) (Fig. 2C, (18)). The greater haplotype diversity allows us to reclassify 50% of previously annotated shared SVs (4) as minor alleles and correct the coding sequence annotation of five genes with tandem repeats (RRBP1, ZNF676, MUC2, STOX1) or extreme GC content (SAMD1) (table S9). We estimate a false discovery rate (FDR) of 5-7% for SVs on the basis of support from sequence-read-based callers, as well as an independent alignment method (18). A comparison against SVs called from the benchmark Genome in a Bottle sample (HG002), including orthogonal datasets, suggests an FDR of ~4% although this estimate is restricted to a subset of the genome where events could be more reliably called (18).

Fig. 2 Variant discovery and distribution.

(A) Size distribution of indels and SVs from 64 unrelated reference genomes shows a 2 bp periodicity for indels, 300 bp peak for Alu insertions (second row), and 6 kbp peak for L1 MEIs. (B) The number of SVs intersecting functional elements (horizontal axis) compared to randomly permuting SV locations (box plots). Gray bars depict percent depletion (right axis scale). ELS: Enhancer-like signature. CTCF: CCCTC-binding factor. (C) Cumulative number of unique SVs when adding samples one-by-one, from left to right. The rate of SV discovery slows with each new haplotype (regression lines); however, the addition of haplotypes of African origin (dashed line) increases SV yield. Colors indicate SVs shared among all haplotypes and not present in GRCh38 (red), major allele variants (AF≥50%, purple), polymorphisms (≥2 haplotypes, blue) and singletons (teal). Asterisks indicate samples sequenced using PacBio HiFi. (D) Overlap between SVs detected by PacBio long-read assemblies and Illumina short-read alignments on 31 matched samples (NA24835, HG00514, HG00733 and NA19240 excluded). Top bar shows overall SV sites across 31 samples, while the bottom bar displays the average count of SVs per sample, with green stripes representing concordant SV calls between technologies. (E) Length distribution of SVs detected by PacBio long-read assemblies and Illumina short-read alignments across all 31 matched samples. (F) Genome-wide distribution of SV hotspots divided in three categories: last 5 Mbp of chromosomes (yellow), overlapping (light blue), and novel (red) when compared to short-read SV analysis of 1000GP (23).The total sequence length is represented by each hotspot category (inset). (G) Heatmap of seven selected SV haplotypes for 4 Mbp MHC region (chr6:28,510,120-33,480,577 dashed lines) comparing regions of high SNV (red) and low diversity (blue) regions based on the number of alternate SNVs compared to the reference (GRCh38; alignment bin size 10 kbp, step 1 kbp). Phased SV insertions (blue arrows) and deletions (red arrows) are mapped above each haplotype. The most diverse regions correspond to SV hotspots (red/blue bars top row) and cluster with HLA genes (red bottom track).

Similarly, we estimate a 6% FDR for indels and 4% for SNVs based on an assessment of Mendelian transmission error from the HiFi and CLR parent–child trios (table S10, (18)). We find that 42% of the SVs are novel when compared to recent long-read surveys of human genomes (1, 4, 19–21) (fig. S10). The addition of African samples more than doubles the rate of new variant discovery when compared to non-Africans for all classes of variation (2.21⨉ SVs (809 vs. 366), 3.70⨉ indels (11,514 vs. 3,109), and 2.97⨉ SNVs (160,232 vs. 54,006) for the 64th haplotype (Fig. 2C, table S11, (18)). On average, we detect 24,653 SVs, 794,406 indels, and 3,895,274 SNVs per diploid human genome (table S4).

Admin
Administrator

Posts: 73,281

Genomic Data Reveal a Complex Making of Humans Mar 1, 2021 20:09:07 GMT

Quote

Post by Admin on Mar 1, 2021 20:09:07 GMT

Structural variant distribution and mechanisms

SVs are known to be clustered (4, 15) and we identify 278 SV hotspots on the basis of our PAV callset (Fig. 2F, fig. S13, table S14, (18)) spanning ~279 Mbp of the genome (Fig. 2F inset). We find that 30.6% (32,222/105,327) of SVs on autosomes and chromosome X map within the last 5 Mbp of chromosome arms, corresponding to a ~4-fold enrichment (p=0.001, z-score=301.3, permutation test), with few notable exceptions—the long arm of the X chromosome and the short arms of chromosomes 3 and 20 (Fig. 2F, fig. S14A). Focusing on SVs >5 Mbp from chromosome ends (73,105), we identify 221 hotspots (fig. S14B). Of these, 49% (109/221) have not been previously identified by short-read analyses of the 1000GP data (23). These interstitial hotspots are enriched 6.6-fold (p=0.001, z-score=26.6, permutation test) for SDs consistent with homologous recombination and frequently correspond to gene-rich regions of exceptional diversity among human populations. For example, we identify three distinct hotspots mapping to the major histocompatibility complex (MHC) region that distinguish seven selected structural haplotypes (Fig. 2G, fig. S15, table S15). Our analysis indicates that a majority (98.85%) of this 4 Mbp region has been sequence resolved at the base-pair level (29 of the assemblies are a single assembled contig and 18 have a single gap; 17/19 individual HLA genes are fully sequence resolved in all assemblies; tables S15, S16).

A detailed analysis of the SVs with unambiguous breakpoint locations provided an opportunity to examine mechanisms of SV formation. Excluding MEIs and SVs with ambiguous breakpoints, we assessed 52,974 insertions and 30,467 deletions (table S17). We find 58% of insertions and 70% of deletions, including SVs in VNTRs, are flanked by at least 50 bp of homologous sequence suggesting formation by homology-directed repair (HDR) processes or non-allelic homologous recombination (NAHR). Amongst those, 15% of insertions and 25% of deletions showed >200 bp flanking homology and are more likely mediated by NAHR. VNTRs with short repeat units (<50 bp) account for a smaller number of events (1.6% insertions and 0.4% deletions) and suggest replication slippage-mediated expansion and contraction. Additionally, 40% of insertions and 29% of deletions show blunt-ended breakpoints or microhomology (<50 bp flanking sequence identity), consistent with nonhomologous end joining, microhomology-mediated end joining, or microhomology-mediated break-induced replication (24). Homology-associated SVs are twofold more frequent than expected from reports using short reads (25–27), and when considering Illumina sequencing-based SV calls from the same samples, only 2% of insertions and 19% of deletions appear to be NAHR-mediated SVs with ≥200 bp flanking homology (p-value <2.2e-16; Fisher’s exact test; table S17).

SVs and their breakpoints are generally more depleted within protein-coding sequences and other functional elements; with the exception of specific gene families where variability in the length of amino acid sequences relates to the function of the molecule (lipoprotein (e.g., LPA), mucins (MUC1, MUC3A, MUC4, MUC12, MUC20, MUC21), zinc finger genes (ZNF99, ZNF285, ZNF280), among others; table S18). We identify 9.4% of all SV breakpoints that intersect functional elements, such as exons (n=993), untranslated regions (UTRs; n=1,097), promoters (n=466), and enhancer-like elements (n=6,796) (Fig. 2B, table S19).

When we consider structural polymorphisms that arise from perfect triplet repeats, expansions outnumber contractions 3 to 1 (271 expansions, 88 contractions) consistent with such regions being systematically underrepresented in the original reference (8, 28). Over the 64 haplotypes, there are six such SVs per haplotype and we identify a total of 106 nonredundant loci (tables S20, S21). Of note, 5/7 of the largest insertions of uninterrupted CTG or CGG repeat insertions mapping within exons correspond to genes already associated with triplet repeat instability diseases or fragile sites. For example, we identify a 21-copy CTG repeat expansion in ATXN3 (Machado-Joseph disease), a 17-copy gain of CAG in HTT (Huntington’s disease), a 21-copy gain of a CGG repeat in ZNF713 (Fragile site 4A), and a 36-copy CGG gain in DIP2B (Fragile site 12A) (18). The discovery of these perfect repeat insertion alleles with respect to the human reference provides an important reference for future investigations of triplet repeat instability.
Mobile element insertions

On the basis of the phased genome assemblies, we identified a collection (n=9,453) of fully sequence-resolved non-reference MEIs, including 7,738 Alus, 1,175 L1Hs, and 540 SVAs (18) and used sequence content of the elements and their flanking sequences to provide insight into their origin and mechanisms of retrotransposition. Retroelement insertions typically display the classic hallmarks of integration via target-site primed reverse transcription. These include endonuclease cleavage motifs at insertion breakpoints, polyadenylate tracts at their 3ʹ end, target site duplications ranging from 3 to 52 bp (mode = 14 bp), in addition to frequent inversion and truncation for L1 elements (fig. S16). Full-length L1 (FL-L1) elements are an especially relevant source of genetic variation since they can mutagenize germline and somatic cells and can lead to gene disruptions that cause human disease (29, 30). While a minority of non-reference L1s are full length (fig. S16, table S22), we find that 78% of FL-L1s possess two intact open reading frames (ORF1 and ORF2), encoding the proteins that drive L1, Alu, SVA, and processed pseudogene mobilization. Indeed, 23% of these sequences show evidence of activity as they are part of a database of 198 FL-L1s known to be active in vitro (31, 32), in human populations (33), and in cancers (34–36). Most active copies (72%; 142/198) are either in our callset or present in the reference genome and are now fully sequence resolved (table S23). We note that 19% of the active FL-L1s have at least one ORF disrupted, which includes a hot element at 9q32 reported to be highly active in diverse tumors (34).

Using L1 Pan troglodytes as an outgroup, we construct a phylogeny (85) of active human L1s and estimate their age in million years (Myr) (Fig. 3A, fig. S17). As expected, copies of the Ta-1 subfamily are the youngest (mean = 1.00 [95% CI: 0.88-1.13]), followed by Ta-0 (mean = 1.63 [95% CI: 1.49-1.77]) and pre-Ta (mean = 2.15 [95% CI: 1.91-2.40]) (fig. S18). Notably, the evolutionary age correlates with L1 features such as subfamily, level of activity, and allele frequency (Fig. 3B, fig. S19)—with the youngest FL-L1s typically corresponding to highly polymorphic and active Ta-1 sequences. Indeed, three out of the four youngest active FL-L1s, namely 2q24.1, 6p24.1 and 6p22.1-2, are Ta-1 copies reported to be extremely active in cancer genomes (34). In contrast, 1p12 is a fixed Pre-Ta insertion that despite integrating into the human genome approximately 1.8 Myr ago remains highly active both in the germline (33) and somatically associated with tumors (34–36). This indicates that a small set of pre-Ta representatives possibly remain very active in the human genome.

Fig. 3 Mobile element insertions.

(A) Maximum-likelihood phylogenetic tree (85) for highly active sequence-resolved FL-L1s annotated by subfamily designation, presence/absence on the reference, ORF content, and hot activity profile (34–36) (bootstrap values ≥80% shown). Tree branch lengths are scaled according to the average number of substitutions per base position. Dashed lines map each L1 cytoband identifier to its corresponding branch on the tree. Pan troglodytes (L1Pt) is included as an outgroup. Heatmaps represent allele frequency (AF) based on the assembly discovery set, activity estimates based on in vitro assays (31, 32) and the number of transduction events detected in human populations (33) or cancer studies (34–36). (B) Enrichment and depletion in the number of FL-L1s belonging to the Ta-1 subfamily at age quartiles (Q1-Q4) compared with a random distribution. Same applies for the other features, including the number of FL-L1s with low allele frequency (MAF<5%), with two intact ORFs, or with evidence of activity. (C) Size distribution and number of 5ʹ and 3ʹ SVA-mediated transductions (td) based on the analysis of flanking sequences. (D) Schematic and circos representation for serial SVA-mediated transduction events. Dashed arrows indicate SVA transcription initiation and end. Transduced sequences are shown as colored boxes with their length proportional to transduction size. (E) Distributions of VNTR length (x-axis: the minimum, y-axis: the maximum) of reference and non-reference SVA elements. Reference SVAs are shown as blue dots and non-reference SVAs as red dots. The dot size represents the sample frequency of SVAs among discovery samples in the HGSVC.

SVA source elements are able to produce 5ʹ and 3ʹ transductions through alternative transcription start sites or bypassing of normal poly(A) sites during retrotransposition (10, 11). We detected 77 transduced non-repetitive DNA sequences at SVA insertion ends (table S24). Interestingly, 5ʹ transductions are more abundant (58%, 45/77) than 3ʹ transductions (Fig. 3C), as opposed to L1s, which primarily mediate 3ʹ transduction events (95%, 89/94). We used these unique transduced sequences to trace the origin of all 77 SVAs to 56 source SVA elements (fig. S20, table S25). A majority of source loci (84%) belong to the youngest human-specific SVA-E and SVA-F subfamilies (37), and only 11 source elements generate 38% of the offspring insertions.

SVA transductions can occasionally shuffle coding sequences as illustrated by the mobilization of a complete exon of HGSNAT by an intronic SVA in antisense orientation (fig. S21). In addition, one SVA source element appears to have caused three sequential mobilization events as indicated by nested transductions flanked by poly(A) tails (Fig. 3D, fig. S22). Finally, SVA elements harbor CpG-rich VNTRs in their interior regions that can expand and contract; we find that non-reference SVAs show significantly greater variability in VNTR copy number compared to those present in the reference (p-value < 10e-5, student’s t test, two-sided, Fig. 3E).