Ethnic minority groups in China

Admin
Administrator

Posts: 72,993

Ethnic minority groups in China Jul 5, 2023 18:33:32 GMT

Quote

Post by Admin on Jul 5, 2023 18:33:32 GMT

Short-read mapping with the CPC reference
To evaluate the performance of the Giraffe mapper to process graph genomes with different complexities, we applied gradient filtering on the CPC reference according to the path depth of nodes (Methods). When a more stringent filter was applied, we observed a decrease in the number of nodes and edges in the graph reference (Supplementary Fig. 11a), the graph complexity (Supplementary Fig. 11b) and the diversity (Supplementary Fig. 11c). Next, ten samples from the East Asian population of the 1000 Genomes Project were aligned to these CPC references with different complexities through vg Giraffe. The results showed that the mapping rate increased and reached a peak value with a simplified version of the graph (Supplementary Fig. 12a), probably owing to the limitation of the current version of the Giraffe mapper in managing the locally complex regions. We also observed that the proportion of reads with perfect matching continued to decline with the simplification of the graph (Supplementary Fig. 12b), reflecting the decline in the diversity of the graph (Supplementary Fig. 13). Therefore, there was a trade-off between mapping rate and mapping quality, and a compromise is needed to determine the size of the graph reference.

Compared with the HPRC graph reference, the CPC graph reference had fewer nodes, edges and diversity, probably owing to only Chinese samples being included, compared with HPRC covering both African and European samples (Supplementary Table 8). However, using the CPC graph reference achieved better alignments than using the HPRC graph reference when aligning the East Asian genomes (Supplementary Fig. 14). By contrast, the HPRC graph performed better in processing African samples (Supplementary Fig. 14). These results indicate that using population-specific graph references improved the alignment quality of short reads.

To carry out variant calling, we mapped the GAM file in the graph reference coordinate to the BAM file in the linear reference coordinate. The results showed that the mapping rate of all samples decreased by an average of 0.58% (0.54–0.61%). We speculated that the advantages of the graph reference would be lost when using the traditional linear reference to carry out calling or record variation because the novel sequences in the graph reference were missing in the linear coordinate.

Comparison with the HPRC pangenome graph
To investigate the previously unidentified components contributed by the East Asian populations in the CPC pangenome graph, we constructed a merged Minigraph-Cactus graph including all 116 assemblies in CPC and 94 assemblies in HPRC1 (Methods). We identified 5,850,863 (18.4%) small variants and 34,223 (17.1%) SVs that were found only in the CPC assemblies (Fig. 3c), of which each sample included 170,307 (s.d. = 10,904) small variants and each haplotype carried 543 (s.d. = 39) SVs, and more than half of the CPC-specific variants were singletons or doubletons (Fig. 3d). In both ‘easy’ and ‘difficult’ regions of the GRCh38 reference defined in GIAB 3.0 (ref. 30), approximately 39% of the CPC-specific small variants could not be annotated in gnomAD v0.1.8 (ref. 31; Supplementary Table 9), suggesting that the East Asian-specific small variants identified with the long-reads-based methods remain a potent supplement to the current short-read-based genetic resources. We found that 16,898 (49.4%) of the CPC-specific SVs overlapped the nearby regions (100 kb upstream and downstream of the gene coding regions) of 6,426 protein-coding genes, in which 4,344 genes were disrupted by SVs spanning more than 1 kb and had the most frequent functional enrichments related to immunological functions, such as humoral immune response (GO:0002455, OR = 5.11, BH-adjusted P = 8.50 × 10−14; and GO:0006959, OR = 2.91, BH-adjusted P = 1.64 × 10−11; Supplementary Table 10). These CPC-specific SVs also showed an overrepresentation of the laryngitis-related genes according to the disease ontology annotation (DOID:3437 and DOID:786, OR = 16.66, BH-adjusted P = 0.007).

Furthermore, we estimated the location distribution of CPC-specific SVs using a sliding-window-based analysis along the autosomes (Methods). Similar to HPRC-specific SVs and common SVs, most of the CPC-specific SVs were located at the centromeric and telomeric regions of chromosomes (Fig. 3e and Supplementary Fig. 15). We next applied a one-tailed Fisher’s exact test between the number of CPC-specific SVs and SVs that were also found in HPRC assemblies in different regions, and found 223 hotspots where CPC-specific SVs were significantly enriched compared with other SVs (FDR-adjusted P < 0.05), involving 807 protein-coding genes (Fig. 3e) overrepresenting biological functions such as oxygen transport (GO:0015671, OR = 22.66, BH-adjusted P = 0.008; and GO:0005344, OR = 24.91, BH-adjusted P = 0.001) and haemoglobin structure (GO:0031838, OR = 28.58, BH-adjusted P = 0.003; GO:0005833, OR = 24.21, BH-adjusted P = 0.003; and GO:0031720, OR = 33.15, BH-adjusted P = 0.002; Supplementary Table 11).

Long-read sequencing technologies and pangenome graph-based analysis methods allow us to explore large and complex SVs that were previously difficult to locate in NGS data, thus providing the genetic basis for association studies of these complex loci with physiological function or disease. We found that some of the CPC-specific enriched SVs mentioned above were closely related to the prevalent diseases in East Asia. A remarkable example is the α-globin gene cluster located near the telomere of the short arm of chromosome 16, including five functional genes and two pseudogenes32, 5′-zeta–pseudozeta–mu–pseudoalpha-1–alpha-2–alpha-1–theta-3′ (Fig. 4a). We identified six major haplotypes based on the copy number variations of α-globin genes (HBA1 or HBA2) and ζ-globin (HBZ or pseudogene HBZP1; Fig. 4b) genes from the pangenome graph (Supplementary Table 12). In addition to a deletion (Z2A1) and duplication (Z2A3) involving a copy number change of α-globin found in both CPC and HPRC, we also identified two CPC-specific large SVs: a 20-kb deletion (Z2A0) involving five globin genes and a 10-kb duplication (Z3A2 and Z3A3) involving ζ-globin genes (Fig. 4c). The long deletion in which both α-globins are lost has been widely reported as the Southeast Asian deletion (--SEA, A0 in our haplotype)33, and is mainly distributed in southern China and Southeast Asia. As previously reported34, the heterozygote SEA deletion (A2/A0) as well as the loss of one copy of the α-globin gene (A2/A1) is phenotypically silent. The homozygous loss of one α-globin gene (A1/A1) leads to mild anaemia; losing three copies (A1/A0) leads to haemoglobin H disease, and homozygous SEA deletion leads to severe hydrops fetalis. The precise localization of the complex SVs on the α-globin gene cluster in the CPC pangenome graph could provide a potential reference for future anaemia-related studies. Another example is the RASA4 gene located on chromosome 7 (Fig. 4d). As compared to the two copies of the reference genome, a high diversity of copy numbers in East Asian populations (Supplementary Table 13), including a six-copy variant that is not found in HPRC samples, was discovered (Fig. 4e). CNVs of this gene have not yet been described. The aberrant expression of RAS p21 protein activator 4, encoded by RASA4, has been widely reported to be closely associated with the development of a variety of human cancers35, and we observed differences in the dosage frequency distribution among populations (Supplementary Tables 14 and 15), which may contribute to the variation of disease incidence.

Fig. 4: Visualization of novel and complex SVs in the CPC pangenome graph.

a, The locations of α-globin genes on the CPC pangenome subgraph. b, Allele counts and linear structural visualization of all structural haplotypes from the Minigraph-Cactus graph among 116 CPC haploid assemblies and 94 HPRC haploid assemblies. The size and spacing of genes on the diagram do not represent the actual size of the chromosome. c, Paths of different α-globin gene haplotypes through the joint subgraph. The arrows indicate the direction of the paths. d, The locations of genes in the RASA4 region on the CPC subgraph. e, Paths of different structural haplotypes with diverse copy numbers of RASA4B. ‘partial’ represents a 14.9-kb fragment of RASA4B.

We next investigated to what extent the novel SVs identified in the CPC assemblies may increase our insights into disease genetics. On the basis of the 243,465 phenotype-associated variants collected from the latest release of the GWAS catalogue, in which 62,393 variants were reported or replicated in the East Asian populations, we found that 75.95% of the novel SVs >1 kb in size (spanning 83.17% of the total novel sequence length) were located <50 kb from the GWAS loci, and in particular, 55.49% (spanning 72.95% of the total novel sequence length) were around the variants associated with East Asian phenotypes. We observed that, when comparing reported variants across traits, height-associated variants were more likely to be associated with larger proportions of essentially independent novel loci (50.7% and 47.7% for the height-associated variants reported in global and East Asian populations, respectively) than other traits, possibly owing to the remarkable polygenicity of height (Extended Data Fig. 9 and Supplementary Table 16). Moreover, these novel SVs were significantly enriched at disease-associated variants identified in East Asians, including urolithiasis, nephrolithiasis and goitre (BH-adjusted P value = 0.043 for each), which are highly prevalent in some Asian areas36 (Extended Data Fig. 9).

Despite the fact that all of the novel SVs detected in the CPC assemblies collectively showed a similar level of nucleotide diversity measured by Tajima’s D to the rest of the genome, the CPC-specific novel SVs absent in the HPRC assemblies exhibited a significantly higher Tajima’s D than the latter (P = 4.65 × 10−9, one-tailed Wilcoxon rank-sum test; Supplementary Fig. 16a). The most outstanding signals encompassed two protein-coding genes, STARD7 and ITPRIPL1, in chromosome 2 (Tajima’s D = −3.06, FDR-adjusted P = 7.54 × 10−12; Supplementary Fig. 16b). Again, we highlight the SPATA31 genes as they could confer evolutionary significance in hominoids not only with the copy number variables (Supplementary Fig. 9) but also with the underrepresented sequence variants (Tajima’s D = −3.01, FDR-adjusted P = 2.38 × 10−5). All of these results imply a great potential for the novel SVs in the CPC assemblies to provide new insights into the human adaptive evolution in East Asia.

Admin
Administrator

Posts: 72,993

Ethnic minority groups in China Jul 8, 2023 20:01:09 GMT

Quote

Post by Admin on Jul 8, 2023 20:01:09 GMT

Archaic introgression and annotation
We applied ArchaicSeeker 2.0 (refs. 37,38) and identified 5,338 archaic introgression segments (AISs) in all 61 CPC samples, spanning 703.87 Mb in total, and on average 84.67 Mb per sample. Of these AISs, 2,450 were located in the coding sequence of 5,531 genes (Supplementary Table 17). We found that 4,126 genes were detected with AISs in at least two samples. In particular, 2,617 genes were detected with AISs in at least five samples, and were substantially enriched in functional categories such as keratinization (GO:0031424, OR = 4.19, BH-adjusted P = 1.23 × 10−5), type I interferon receptor binding (GO:0005132, OR = 130.05, BH-adjusted P = 7.62 × 10−12), positive regulation of peptidyl-serine phosphorylation of STAT protein (GO:0033141, OR = 48.22, BH-adjusted P = 1.23 × 10−10), RIG-I-like receptor signalling pathway (hsa04622, OR = 4.04, BH-adjusted P = 2.44 × 10−4) and neuronal cell body (GO:0043025, OR = 1.75, BH-adjusted P = 5.96 × 10−3; Supplementary Fig. 17 and Supplementary Table 18). We obtained similar results when analysing all AIS-affected genes and those detected with higher-frequency AISs (1,510 genes with AISs carried by at least 10 samples; Supplementary Tables 19 and 20 and Supplementary Figs. 18 and 19). The extremely high-frequency AISs (>40 samples) affected the following genes (annotated with GeneCards Suite39 online): KRT6C, KRT6A, KRT6B and KRT75, which are all keratin gene family members; CACNA2D2, CYB561D2, EEF1A2 and KCNQ2, which are associated with developmental and epileptic encephalopathy; GNAT1, which functions as a signal transducer in normal rod photoreceptor (RHO)-mediated light perception by the retina, and is associated with autosomal recessive congenital stationary night blindness40; and USH2A, which is involved in cell development and maintenance of the inner ear and retina, and is associated with Usher syndrome and retinitis pigmentosa in the Chinese population41.

The CPC assemblies are enriched with archaic hominin sequences compared with the African samples in the HPRC dataset (Supplementary Fig. 20). We further compared the AISs detected in the CPC samples and those in the American samples that constitute the largest continental group in the HPRC dataset second only to the African group. We found that the proportion of the Altai Neanderthal-like AISs was higher in the Americans (76.44 ± 15.61 Mb) than in the CPC East Asians (74.39 ± 4.48 Mb) on both the individual and population levels given comparable sample size (525.81 Mb for American, and 434.54 ± 8.42 Mb for East Asian). The Denisovan-like AIS proportion was higher in the East Asian genome (16.92 Mb, covering 0.59% of the American genome, and 2.10 Mb ± 0.71 Mb (0.07%) for each sample; 26.04 ± 1.34 Mb, covering 0.90% on average of the East Asian genome, and 2.77 Mb ± 0.70 Mb (0.10%) for each sample), indicating greater AIS diversity inherited from the Denisovan in the East Asian genomes than in the American genomes (Supplementary Fig. 20). In addition, the archaic hominin introgression in East Asians was largely underrepresented by the CHS samples in HPRC. Each population in the CPC assembly on average added 15.45 Mb of AISs (14.16 Mb of Altai Neanderthal-like sequences and 1.29 Mb of Denisovan-like sequences) to the archaic sequence pool of the present-day East Asians (Supplementary Fig. 21), and each CPC genome contributed 9.56 Mb of archaic-like sequences. In particular, the Turkic-speaking populations (for example, Uyghur, Kazakh and Kyrgyz) showed the least Altai Neanderthal-like AIS sharing with other East Asian populations, possibly owing to the European genetic ancestry in these populations (Extended Data Fig. 10a); some southern Chinese linguistic groups (for example, Tai–Kadai and Austro-Asiatic) added to the Denisovan-like AIS diversity at the highest level (Extended Data Fig. 10b).

We further investigated genes affected by the CPC-specific AISs and their potential functions. We found 1,575 AISs spanning 72.41 Mb in the CPC assembly that were absent in the HPRC assembly. These CPC-specific AISs encompassed 3,629 genes in total. We highlighted 1,211 genes affected by potentially functional AISs located in the coding sequence regions (Supplementary Table 21), which had roles in xenobiotic glucuronidation (GO:0052697, OR = 77.51, BH-adjusted P = 1.22 × 10−6), flavonoid metabolic processes (GO:0009812, OR = 28.73, BH-adjusted P = 3.98 × 10−6) and ascorbate and aldarate metabolism (hsa00053, OR = 8.66, BH-adjusted P = 8.88 × 10−4; Supplementary Fig. 22 and Supplementary Table 22). According to the GeneAnalytics42 annotation, these genes are associated with multiple diseases (for example, colorectal cancer, breast cancer, schizophrenia and nervous system disease) (Supplementary Table 23). We found that a CPC-specific AIS affecting BOD1 was carried by 71 (61.2%) haploid assemblies. This gene was reported to be involved in cerebellar motor dysfunction43, and is crucial for human cognitive function44. Another AIS-affected gene, IL17RA (nhaplotype = 52; 44.8%), has a pathogenic role in many inflammatory and autoimmune diseases. In particular, polymorphism at this gene is related to atopic dermatitis, autoimmune type 1 diabetes and asthma in East Asian populations45. TWIST2 (nhaplotype = 41, 35.3%) and CHFR (nhaplotype = 34; 29.3%) have critical roles in cancer metastasis, and are commonly used biomarkers for various cancers46. All of the population-specific AISs >150 kb in size are listed in Supplementary Table 24, and almost all were present only in one single individual (nhaplotype = 1). The Uyghur population contributed the largest proportion (0.51%, 14.68 Mb) of the CPC-specific archaic introgression among all of the ethnicities studied (Supplementary Table 25). One notable example of the Uyghur-specific AIS affects QPCT, which codes for glutamyl peptidyltransferase. This gene is associated with schizophrenia in both European and Han Chinese populations47, and may also affect bone mineral density in adult women, resulting in susceptibility to osteoporosis48. Moreover, we found a well-recognized oncogene, JUN, affected by archaic introgression in the Zhuang population, which could be responsible for the present-day differential prevalence and association with cancers for the JUN variants.

We found that 6.68% of the AISs identified in the CPC assemblies were attributed to genes affected by the SVs, and 17.68% of the SVs were affected by archaic introgression (Supplementary Table 26). In addition, 0.10% of these AISs were detected in 141 CNV genes in the CPC assemblies (Supplementary Table 4); in particular, 0.09% were detected in 135 CPC-specific CNV genes. These results imply that the CPC data hold great potential to advance our understanding of human evolutionary history in Asia.

Admin
Administrator

Posts: 72,993

Ethnic minority groups in China Jul 10, 2023 20:42:16 GMT

Quote

Post by Admin on Jul 10, 2023 20:42:16 GMT

Archaic introgression and annotation
We applied ArchaicSeeker 2.0 (refs. 37,38) and identified 5,338 archaic introgression segments (AISs) in all 61 CPC samples, spanning 703.87 Mb in total, and on average 84.67 Mb per sample. Of these AISs, 2,450 were located in the coding sequence of 5,531 genes (Supplementary Table 17). We found that 4,126 genes were detected with AISs in at least two samples. In particular, 2,617 genes were detected with AISs in at least five samples, and were substantially enriched in functional categories such as keratinization (GO:0031424, OR = 4.19, BH-adjusted P = 1.23 × 10−5), type I interferon receptor binding (GO:0005132, OR = 130.05, BH-adjusted P = 7.62 × 10−12), positive regulation of peptidyl-serine phosphorylation of STAT protein (GO:0033141, OR = 48.22, BH-adjusted P = 1.23 × 10−10), RIG-I-like receptor signalling pathway (hsa04622, OR = 4.04, BH-adjusted P = 2.44 × 10−4) and neuronal cell body (GO:0043025, OR = 1.75, BH-adjusted P = 5.96 × 10−3; Supplementary Fig. 17 and Supplementary Table 18). We obtained similar results when analysing all AIS-affected genes and those detected with higher-frequency AISs (1,510 genes with AISs carried by at least 10 samples; Supplementary Tables 19 and 20 and Supplementary Figs. 18 and 19). The extremely high-frequency AISs (>40 samples) affected the following genes (annotated with GeneCards Suite39 online): KRT6C, KRT6A, KRT6B and KRT75, which are all keratin gene family members; CACNA2D2, CYB561D2, EEF1A2 and KCNQ2, which are associated with developmental and epileptic encephalopathy; GNAT1, which functions as a signal transducer in normal rod photoreceptor (RHO)-mediated light perception by the retina, and is associated with autosomal recessive congenital stationary night blindness40; and USH2A, which is involved in cell development and maintenance of the inner ear and retina, and is associated with Usher syndrome and retinitis pigmentosa in the Chinese population41.

The CPC assemblies are enriched with archaic hominin sequences compared with the African samples in the HPRC dataset (Supplementary Fig. 20). We further compared the AISs detected in the CPC samples and those in the American samples that constitute the largest continental group in the HPRC dataset second only to the African group. We found that the proportion of the Altai Neanderthal-like AISs was higher in the Americans (76.44 ± 15.61 Mb) than in the CPC East Asians (74.39 ± 4.48 Mb) on both the individual and population levels given comparable sample size (525.81 Mb for American, and 434.54 ± 8.42 Mb for East Asian). The Denisovan-like AIS proportion was higher in the East Asian genome (16.92 Mb, covering 0.59% of the American genome, and 2.10 Mb ± 0.71 Mb (0.07%) for each sample; 26.04 ± 1.34 Mb, covering 0.90% on average of the East Asian genome, and 2.77 Mb ± 0.70 Mb (0.10%) for each sample), indicating greater AIS diversity inherited from the Denisovan in the East Asian genomes than in the American genomes (Supplementary Fig. 20). In addition, the archaic hominin introgression in East Asians was largely underrepresented by the CHS samples in HPRC. Each population in the CPC assembly on average added 15.45 Mb of AISs (14.16 Mb of Altai Neanderthal-like sequences and 1.29 Mb of Denisovan-like sequences) to the archaic sequence pool of the present-day East Asians (Supplementary Fig. 21), and each CPC genome contributed 9.56 Mb of archaic-like sequences. In particular, the Turkic-speaking populations (for example, Uyghur, Kazakh and Kyrgyz) showed the least Altai Neanderthal-like AIS sharing with other East Asian populations, possibly owing to the European genetic ancestry in these populations (Extended Data Fig. 10a); some southern Chinese linguistic groups (for example, Tai–Kadai and Austro-Asiatic) added to the Denisovan-like AIS diversity at the highest level (Extended Data Fig. 10b).

We further investigated genes affected by the CPC-specific AISs and their potential functions. We found 1,575 AISs spanning 72.41 Mb in the CPC assembly that were absent in the HPRC assembly. These CPC-specific AISs encompassed 3,629 genes in total. We highlighted 1,211 genes affected by potentially functional AISs located in the coding sequence regions (Supplementary Table 21), which had roles in xenobiotic glucuronidation (GO:0052697, OR = 77.51, BH-adjusted P = 1.22 × 10−6), flavonoid metabolic processes (GO:0009812, OR = 28.73, BH-adjusted P = 3.98 × 10−6) and ascorbate and aldarate metabolism (hsa00053, OR = 8.66, BH-adjusted P = 8.88 × 10−4; Supplementary Fig. 22 and Supplementary Table 22). According to the GeneAnalytics42 annotation, these genes are associated with multiple diseases (for example, colorectal cancer, breast cancer, schizophrenia and nervous system disease) (Supplementary Table 23). We found that a CPC-specific AIS affecting BOD1 was carried by 71 (61.2%) haploid assemblies. This gene was reported to be involved in cerebellar motor dysfunction43, and is crucial for human cognitive function44. Another AIS-affected gene, IL17RA (nhaplotype = 52; 44.8%), has a pathogenic role in many inflammatory and autoimmune diseases. In particular, polymorphism at this gene is related to atopic dermatitis, autoimmune type 1 diabetes and asthma in East Asian populations45. TWIST2 (nhaplotype = 41, 35.3%) and CHFR (nhaplotype = 34; 29.3%) have critical roles in cancer metastasis, and are commonly used biomarkers for various cancers46. All of the population-specific AISs >150 kb in size are listed in Supplementary Table 24, and almost all were present only in one single individual (nhaplotype = 1). The Uyghur population contributed the largest proportion (0.51%, 14.68 Mb) of the CPC-specific archaic introgression among all of the ethnicities studied (Supplementary Table 25). One notable example of the Uyghur-specific AIS affects QPCT, which codes for glutamyl peptidyltransferase. This gene is associated with schizophrenia in both European and Han Chinese populations47, and may also affect bone mineral density in adult women, resulting in susceptibility to osteoporosis48. Moreover, we found a well-recognized oncogene, JUN, affected by archaic introgression in the Zhuang population, which could be responsible for the present-day differential prevalence and association with cancers for the JUN variants.

We found that 6.68% of the AISs identified in the CPC assemblies were attributed to genes affected by the SVs, and 17.68% of the SVs were affected by archaic introgression (Supplementary Table 26). In addition, 0.10% of these AISs were detected in 141 CNV genes in the CPC assemblies (Supplementary Table 4); in particular, 0.09% were detected in 135 CPC-specific CNV genes. These results imply that the CPC data hold great potential to advance our understanding of human evolutionary history in Asia.

Admin
Administrator

Posts: 72,993

Ethnic minority groups in China Jul 12, 2023 19:50:18 GMT

Quote

Post by Admin on Jul 12, 2023 19:50:18 GMT

Discussion
In this study, we sequenced 58 CPC core samples to an average depth of 30.65× using PacBio HiFi long-read sequencing. With an average contiguity N50 > 35.63 Mb and an average total size of 3.01 Gb, the 116 high-quality and haplotype-phased de novo assemblies have good coverage of the Telomere-to-Telomere Consortium haploid assembly T2T-CHM13. Our analysis showed that the CPC assemblies largely matched or exceeded the continuity and base-level accuracy of the current reference human genome sequence (GRCh38). The CPC core assemblies also have good coverage of GRCh38, and added 189 million base pairs of euchromatic polymorphic sequences and 1,367 protein-coding gene duplications to GRCh38. The CPC Phase I data thus serve as a comprehensive pangenome reference for Chinese populations and are expected to better capture genomic diversity in populations of Asian ancestry. Our further analysis confirmed the necessity of high-quality population-specific assemblies for genetic and medical applications2. Indeed, we identified variations showing considerable differentiation among different ethnic groups, probably resulting from divergent ancestral backgrounds. Our results also suggest that the use of population-specific references in sequence alignment improved the alignment quality. Compared with the HPRC graph reference, using the CPC graph reference improved the perfect alignment rate of short reads in East Asian samples.

The CPC pangenome reference undoubtedly provides a more comprehensive understanding of genomic variation in Asian populations, particularly those of Chinese ancestry. For example, about 18.4% of the small variants and 17.1% of the SVs identified were specific to the CPC assemblies compared with the HPRC data, although most of the CPC-specific SVs were located at the centromeric and telomeric regions of chromosomes. More than half of the variants showed an extremely low frequency, such as singletons or doubletons, and they were specifically identified in either CPC or HPRC data. Therefore, our results indicated the necessity of a more comprehensive sampling effort for both CPC and HPRC. Meanwhile, we have also generated a joint CPC–HPRC pangenome reference with both CPC and HPRC assemblies so that it can be more widely applied to various enterprises.

The CPC data also demonstrated a remarkable increase in the discovery of novel sequences when individuals were included from genetically divergent ethnic groups. A notable example is the α-globin gene cluster, in which we identified a 20-kb deletion that has been widely reported as a cause of anaemia specifically in the southern Chinese and Southeast Asian populations, and a 10-kb duplication specific to CPC assemblies. Therefore, our analyses demonstrated great potential in discovering novel or missing sequences in underrepresented Asian populations, especially minority ethnic groups.

Although not surprising, we identified a substantial proportion of sequences of archaic origins. In particular, every ethnic group contributed on average about 15 Mb and every sample contributed about 9.5 Mb of sequences of archaic ancestry, indicating the potential for discovering novel archaic sequences that were missing in previous studies. Moreover, the novel archaic sequences identified in the CPC assemblies were largely underrepresented in the HPRC data, which again suggests the necessity of including more diverse samples of Asian ancestry in further efforts of the HPRC. An interesting observation was that the least Altai Neanderthal-like sequences were shared between the Turkic-speaking populations from northwestern China (for example, Uyghur, Kazakh and Kyrgyz) and other East Asian populations, probably owing to the genetic admixture with west Eurasian populations, which diluted the Altai Neanderthal-like ancestry in northwestern Chinese populations.

We showed previously that individuals of Chinese or Asian ancestry harbour a great genomic diversity6. China is populated with multiple ethnic groups with high cultural and language diversities, including 55 officially recognized minority ethnic groups in addition to the Han Chinese majority and a considerable number of unrecognized ethnic groups. As the first effort (Phase I) of the CPC, the current pangenome reference constructed by the CPC was based on 58 CPC core samples representing 36 of the 55 minority ethnic groups and 8 linguistic groups. In its plans, the CPC aims to produce high-quality, phased, chromosome-level haplotype sequences of 500 individuals, which will cover the 56 ethnic groups as officially defined as well as a number of unidentified ethnic groups that have never been well covered by any previous work, such as Sherpa49, Dolan, Keriyan, Deng and Lop Nur. The fully phased T2T diploid genomes will represent a paradigm shift and the new standard in population-level genomic studies7. In parallel with the effort to document genomic diversity, considerable efforts have been invested in comprehensively annotating the elements in the CPC genomes that confer function, such as genes, control elements and transcript isoforms. Annotating the CPC genomes resulted in discovering genes that confer essential functions and underlying natural selection, which are probably associated with phenotypic diversity of disease susceptibility specific to Asian populations. In particular, a considerable proportion of the CPC sequences are of archaic origins and enriched in genes related to keratinization, inflammation and autoimmune diseases. Moreover, the novel sequences specifically discovered in the CPC pangenome encompassing 6,426 protein-coding genes confer phenotypic diversity or disease susceptibility, including immunological functions. Taken together, the CPC Phase I data have already demonstrated a great potential to shed new light on human evolution and recover missing heritability in complex trait and disease mapping. We expect the CPC, as an important part of the global force of human genomics, to make a considerable contribution to building high-quality pangenome references and applying them for various basic and clinical research projects.