Post by Admin on May 18, 2023 19:51:35 GMT
Material and methods
Selection criteria of the probands
To ensure genotype calling rate, consistency across individuals and phylogenetic positioning in relation to anatomically modern humans, we did not consider contaminated, admixed, low-depth and archaic genomes with abundant uncalled positions in the loci understudy. We hence retained only high-quality genomes from one Denisovan (Denisova 3) and three Neanderthal individuals i.e., Altaï Neanderthal (Denisova 5), Vindija 33.19, and Chagyrskaya 8 [13–16] (S1 Table). These four probands are representative of the two archaic human-related species that spanned over 50,000 years of the Late Pleistocene and across approximately 5,000 km of Eurasia.
Presentation of the blood groups under study
We studied 7 blood group systems covering 11 genes: ABO including H system and Secretor status (ISBT 001 and 018, ABO, FUT1 and FUT2 genes), Rh (ISBT 004, RHD and RHCE genes), Kell and the covalently linked Kx protein (ISBT 006, KEL and XK genes), Duffy (ISBT 008, ACKR1 gene), Kidd (ISBT 009, SLC14A1 gene), MNS (ISBT 002, GYPB gene), and Diego and its Band 3-Memphis variant (ISBT 010, SLC4A1 gene) (S2 Table).
Exploration procedure for blood group alleles
For the probands and blood groups under study, we downloaded the already published [13–16] and curated *.vcf and *.bam(.bai) chromosome files available at the Genome Projects website of the Max Planck Department of Evolutionary Genetics (https://www.eva.mpg.de/genetics/genome-projects.html, S1 Table). For genotype calling filters, see the Supplementary Information of [13–16] and the readme files at cdna.eva.mpg.de/neandertal/. Briefly, the filters included a coverage filter stratified by GC content, minimum coverage of 10, Heng Li’s Mappability 35, Mapping Quality (MQ) of 25, no tandem repeats and no indels.
Then, we briefly proceeded in a two-step screening of the blood group loci. First, we first gathered the genotypes at the key functional changes with depth, allele counts, quality and Phred scores probability using vcftools [17] (S2 Table, S1 File). Second, we browsed all exomes regions within the coding bounds (i.e. from the initiation ATG to the stop codons), in search for additional variation from hg19 (S3 Table, S1 File). While doing so, we paid specific attention to the following five points.
Consideration of the reference sequence.
We aligned with the reference blood group gene sequences used by the ISBT against the GRCh37 (hg19) with nucleotide labelling according to the sense (5’-3’) strand (S2 Table). We noticed that for six loci we studied (ABO, KEL, GYPB, RHCE, SCL4A1 and FUT1) the hg19 reference sequence opens by default onto the antisense strand (3’ → 5’) in the NCBI graphic window and the ancient genome browser, although their cDNA is conventionally the sense strand (5’ → 3’). This has two consequences: the chromosomal coordinates are decreasing as we progress throughout the coding strand (5’ → 3’) of these genes (from exon ’n’ to exon ’n+1’) and overall, nucleotides should be reversed-complemented.
ABO genotype calls.
We inferred the ABO alleles according to the functional approach developed by [18, 19] for pure and chimeric A-B transferase cDNAs. In conformity with this approach, we identified the ABO alleles by 4 letters corresponding to the 4 main amino acid changes in the catalytic site of the glycosyltransferase of pure A or B allele positions, preceded by the presence or not of the G in position c.261 (rs8176719) (i.e. G-AAAA meaning 4 SNPs of A allele generating A phenotype) and the deletion or not of the C at position c.1061 (rs56392308) to differentiate the A1 and A2 alleles. We achieved the ABO allele identification with the screening of all exons and collected the genotypes at 39 additional loci previously identified as responsible for various ABO alleles [20] (S2 Table; Fig A in S1 File). Special attention has been taken to FUT2 whose amino acid numbering in NBCI and hg19 should be rectified to get the correct amino-acid changes as mentioned by the ISBT. This is due to the fact that the initiation codon is the third ATG at the beginning of exon 2 (19: 49,206,247) [21].
RHD and RHCE genotype calls.
For RHD and RCE, while browsing the exons in search for variation with hg19, we gathered the genotypes of the key changes of the RH*Ce, *CE, *ce, and *cE alleles. Any variation with hg19 was consolidated with The Human Rhesusbase.com [22], Erythrogene.com [11] and screenshots of the bam sequences (Figs B-D in S1 File).
For any identification of a variant, we searched for it in all four archaic genomes. In addition, for any call at two key variants of our findings, namely c.733G>C (RHD) and c.712A>G (RHCE), we searched for the other polymorphisms that usually constitute the haplotypes made with them, respectively RHD*DBU, *DLX, *DV, *DVI, *DBS, *DBT, *DUC2 and *ceAR, *ceEK, *ceBI, *ce*SM, and *ceHAR. For this, we browsed both vcf and bam alignment by varying the MQ threshold (S2 Table; Fig D in S1 File).
Identification of indels.
Because indels could have been filtered out in the making the vcf files, all ABO, RHD and RHCE, notably the ABO c.261delG, c.1061delC, and RHCE 209bp insert have been double-checked from the specific indels vcf files (http://ftp.eva.mpg.de/neandertal/Vindija/VCF/indels/) and bam alignments using Integrative Genomics Viewer (IGV [23]) (S2 Table and Fig A in S1 File).
Low-mapped variants.
The screenshots of the bam alignments in simultaneously the four archaic individuals have highlighted a difference in depth and MQ between reference and alternate alleles. This is especially true for variants with very low frequency in modern humans reference panel such as rs17418085 (RHD), rs150073306 (RHD), rs1132763 (RHCE), and rs1132764 (RHCE) in the 3 Neandertals (alternate allele) in comparison with Denisova 3, homozygous for the reference alleles (Figs C and D in S1 File). Hence, these loci may suffer from reference bias, which is known to strongly reduce the depth and mapping of the reads with the alternate alleles [24, 25], and consequently, the genotype accuracy indexes at these loci. Hence, in cases where variants have been called in the released VCFs of some probands but filtered out in the others, we screenshotted the indexed alignments with hg19 using IGV [23] to manually call genotypes with allocated reads count and MQ cut-off (i.e. value above which the reads are not visualized) (S1 File).
Selection criteria of the probands
To ensure genotype calling rate, consistency across individuals and phylogenetic positioning in relation to anatomically modern humans, we did not consider contaminated, admixed, low-depth and archaic genomes with abundant uncalled positions in the loci understudy. We hence retained only high-quality genomes from one Denisovan (Denisova 3) and three Neanderthal individuals i.e., Altaï Neanderthal (Denisova 5), Vindija 33.19, and Chagyrskaya 8 [13–16] (S1 Table). These four probands are representative of the two archaic human-related species that spanned over 50,000 years of the Late Pleistocene and across approximately 5,000 km of Eurasia.
Presentation of the blood groups under study
We studied 7 blood group systems covering 11 genes: ABO including H system and Secretor status (ISBT 001 and 018, ABO, FUT1 and FUT2 genes), Rh (ISBT 004, RHD and RHCE genes), Kell and the covalently linked Kx protein (ISBT 006, KEL and XK genes), Duffy (ISBT 008, ACKR1 gene), Kidd (ISBT 009, SLC14A1 gene), MNS (ISBT 002, GYPB gene), and Diego and its Band 3-Memphis variant (ISBT 010, SLC4A1 gene) (S2 Table).
Exploration procedure for blood group alleles
For the probands and blood groups under study, we downloaded the already published [13–16] and curated *.vcf and *.bam(.bai) chromosome files available at the Genome Projects website of the Max Planck Department of Evolutionary Genetics (https://www.eva.mpg.de/genetics/genome-projects.html, S1 Table). For genotype calling filters, see the Supplementary Information of [13–16] and the readme files at cdna.eva.mpg.de/neandertal/. Briefly, the filters included a coverage filter stratified by GC content, minimum coverage of 10, Heng Li’s Mappability 35, Mapping Quality (MQ) of 25, no tandem repeats and no indels.
Then, we briefly proceeded in a two-step screening of the blood group loci. First, we first gathered the genotypes at the key functional changes with depth, allele counts, quality and Phred scores probability using vcftools [17] (S2 Table, S1 File). Second, we browsed all exomes regions within the coding bounds (i.e. from the initiation ATG to the stop codons), in search for additional variation from hg19 (S3 Table, S1 File). While doing so, we paid specific attention to the following five points.
Consideration of the reference sequence.
We aligned with the reference blood group gene sequences used by the ISBT against the GRCh37 (hg19) with nucleotide labelling according to the sense (5’-3’) strand (S2 Table). We noticed that for six loci we studied (ABO, KEL, GYPB, RHCE, SCL4A1 and FUT1) the hg19 reference sequence opens by default onto the antisense strand (3’ → 5’) in the NCBI graphic window and the ancient genome browser, although their cDNA is conventionally the sense strand (5’ → 3’). This has two consequences: the chromosomal coordinates are decreasing as we progress throughout the coding strand (5’ → 3’) of these genes (from exon ’n’ to exon ’n+1’) and overall, nucleotides should be reversed-complemented.
ABO genotype calls.
We inferred the ABO alleles according to the functional approach developed by [18, 19] for pure and chimeric A-B transferase cDNAs. In conformity with this approach, we identified the ABO alleles by 4 letters corresponding to the 4 main amino acid changes in the catalytic site of the glycosyltransferase of pure A or B allele positions, preceded by the presence or not of the G in position c.261 (rs8176719) (i.e. G-AAAA meaning 4 SNPs of A allele generating A phenotype) and the deletion or not of the C at position c.1061 (rs56392308) to differentiate the A1 and A2 alleles. We achieved the ABO allele identification with the screening of all exons and collected the genotypes at 39 additional loci previously identified as responsible for various ABO alleles [20] (S2 Table; Fig A in S1 File). Special attention has been taken to FUT2 whose amino acid numbering in NBCI and hg19 should be rectified to get the correct amino-acid changes as mentioned by the ISBT. This is due to the fact that the initiation codon is the third ATG at the beginning of exon 2 (19: 49,206,247) [21].
RHD and RHCE genotype calls.
For RHD and RCE, while browsing the exons in search for variation with hg19, we gathered the genotypes of the key changes of the RH*Ce, *CE, *ce, and *cE alleles. Any variation with hg19 was consolidated with The Human Rhesusbase.com [22], Erythrogene.com [11] and screenshots of the bam sequences (Figs B-D in S1 File).
For any identification of a variant, we searched for it in all four archaic genomes. In addition, for any call at two key variants of our findings, namely c.733G>C (RHD) and c.712A>G (RHCE), we searched for the other polymorphisms that usually constitute the haplotypes made with them, respectively RHD*DBU, *DLX, *DV, *DVI, *DBS, *DBT, *DUC2 and *ceAR, *ceEK, *ceBI, *ce*SM, and *ceHAR. For this, we browsed both vcf and bam alignment by varying the MQ threshold (S2 Table; Fig D in S1 File).
Identification of indels.
Because indels could have been filtered out in the making the vcf files, all ABO, RHD and RHCE, notably the ABO c.261delG, c.1061delC, and RHCE 209bp insert have been double-checked from the specific indels vcf files (http://ftp.eva.mpg.de/neandertal/Vindija/VCF/indels/) and bam alignments using Integrative Genomics Viewer (IGV [23]) (S2 Table and Fig A in S1 File).
Low-mapped variants.
The screenshots of the bam alignments in simultaneously the four archaic individuals have highlighted a difference in depth and MQ between reference and alternate alleles. This is especially true for variants with very low frequency in modern humans reference panel such as rs17418085 (RHD), rs150073306 (RHD), rs1132763 (RHCE), and rs1132764 (RHCE) in the 3 Neandertals (alternate allele) in comparison with Denisova 3, homozygous for the reference alleles (Figs C and D in S1 File). Hence, these loci may suffer from reference bias, which is known to strongly reduce the depth and mapping of the reads with the alternate alleles [24, 25], and consequently, the genotype accuracy indexes at these loci. Hence, in cases where variants have been called in the released VCFs of some probands but filtered out in the others, we screenshotted the indexed alignments with hg19 using IGV [23] to manually call genotypes with allocated reads count and MQ cut-off (i.e. value above which the reads are not visualized) (S1 File).