Post by Admin on Aug 21, 2021 5:44:19 GMT
Materials and Methods
Ethics Statements, Sample Collection, and Genotyping
This study was approved by the Ministry of Health Malaysia under National Medical Research Registry MNDR ID #09—23-3913, JAKOA (Department of Orang Asli Development, Government of Malaysia) and Monash University Human Research Ethics Committee.
Following consultation with JAKOA officers in the various districts in different states, courtesy visits were made to OA community elders and the rationale of the study and the procedure of sample collection explained. Once they had agreed and informed their communities, field visits were carried out. Individuals who provided informed consent and also answered questionnaires were included.
Peripheral blood samples were collected from 169 individuals belonging to Negrito (Jehai, Bateq, Kintaq, and Mendriq subgroups), Senoi (MahMeri and CheWong subgroup), and Proto-Malay (Seletar, Jakun, and Temuan subgroups) groups (fig. 1). Genotyping was performed using Illumina Human Omni 2.5 array (Illumina Inc., San Diego, CA).
Fig. 1.—
Geographical location of Orang Asli communities recruited in this study.
Quality Control and Data Integration
Quality controls were applied to the data obtained from each OA community separately to exclude problematic samples and single nucleotide polymorphisms (SNPs). All SNPs that failed the Hardy–Weinberg exact (HWE) test (P < 10−6) and displayed missing rates >0.05 across all samples in each population were removed. Additionally, samples with call rate <0.99 were excluded. Gender concordance was examined using PLINK v1.07 (Purcell et al. 2007) and samples with inconsistency between genotype results and questionnaire-reported sex were excluded. In order to avoid analysis of close relatives, unknown relatedness was measured between all pairs of individuals within each population using PLINK’s (v1.07) Identity-by-Descent estimation, PI_Hat. An upper cut-off threshold of 0.375 was set to exclude first-degree relatedness within each population. Finally, a principal component analysis (PCA) using EIGENSOFT v3.0 (Patterson et al. 2006) was performed to remove outliers from each population across first ten eigenvectors. In the final stage, all OA populations were merged into one data set and pruned for SNPs that failed HWE (P < 10−6) test and missing rates more than 0.05 across all samples.
The OA genotype data were merged with data from Human Genome Diversity Project (HGDP) (Li et al. 2008), 89 Malay individuals from Singapore Genome Variation Project (SGVP) (Teo et al. 2009) and Onge and Jarawa Negritos from Andaman islands were genotyped using Illumina Human 1.2M (SNP population data courtesy of P. Majumder and A. Basu). After merging data sets (supplementary table S1, Supplementary Material online), a total of 291,096 overlapping autosomal SNPs remained for downstream analysis.
Population Structure Analysis
PCA was used to identify population structure across indigenous Malaysians. PCA analysis was performed on genotyped data of OA combined with Andamanese Negritos, Oceanians, South and East Asian populations in the HGDP, and Malays from SGVP using EIGENSOFT v3.0. To balance sample sizes across our populations, 30 Malay individuals were randomly sampled from SGVP data set (which contains 89 individuals). SNPs with r2 > 0.5 were pruned out in order to avoid the effects of excessive LD between SNPs. After this pruning a total of 204,426 SNPs remained for analysis. Pairwise Fst distance between populations in same data set were calculated using EIGENSOFT v3.0, and a Neighbor-net tree was constructed by SplitsTree v4 software (Huson and Bryant 2006). ADMIXTURE v1.22, a clustering algorithm, was used on pruned SNPs to estimate the ancestral population clustering (Alexander et al. 2009).
PLINK v1.07 was used to estimate ROH in selected populations. PLINK takes 5,000 kb (50 SNPs) sliding windows across the genome and allows for 1 heterozygous and 5 missing calls in each window. To minimize the effects of LD on ROH, minimum ROH length was set to be 500 kb because it is unusual for LD to extend beyond 500 kb. LD decay for each population was calculated as r2 using PLINK. Pairwise LD between all possible SNPs was calculated and mean LD was measured in bins of 5 kb.
TreeMix v1.12 (Pickrell and Pritchard 2012) was used to explore the population relationships and migration events. Same data set described above was used to estimate the Maximum Likelihood tree with Yoruba as outgroup. We used blocks of 200 SNPs (-k 200) to account for LD and migration edges added sequentially until the model explained 99% of variances. We estimated the D statistics using ADMIXTOOLS (Patterson et al. 2012) to examine gene flow between OAs and surrounding populations. Divergence time between OA and EA was estimated using 399,971 shared SNPs between our data and HapMap 3 (The International HapMap 2005). Effective population size (Ne) and divergence time between OAs and Yoruba in Ibadan (YRI), Han Chinese in Beijing (CHB), and Japanese in Tokyo (JPT) samples were estimated according to the method suggested by McEvoy et al. (2011). To estimate LD, pairwise LD was calculated as r2 using PLINK v1.07. In order to minimize the effects of small sample size, all individuals were pooled together in their respective OA groups. Admixture time between OAs and EA was estimated by rolloff package using 399,971 SNPs by HapMap3 and OAs.
Ethics Statements, Sample Collection, and Genotyping
This study was approved by the Ministry of Health Malaysia under National Medical Research Registry MNDR ID #09—23-3913, JAKOA (Department of Orang Asli Development, Government of Malaysia) and Monash University Human Research Ethics Committee.
Following consultation with JAKOA officers in the various districts in different states, courtesy visits were made to OA community elders and the rationale of the study and the procedure of sample collection explained. Once they had agreed and informed their communities, field visits were carried out. Individuals who provided informed consent and also answered questionnaires were included.
Peripheral blood samples were collected from 169 individuals belonging to Negrito (Jehai, Bateq, Kintaq, and Mendriq subgroups), Senoi (MahMeri and CheWong subgroup), and Proto-Malay (Seletar, Jakun, and Temuan subgroups) groups (fig. 1). Genotyping was performed using Illumina Human Omni 2.5 array (Illumina Inc., San Diego, CA).
Fig. 1.—
Geographical location of Orang Asli communities recruited in this study.
Quality Control and Data Integration
Quality controls were applied to the data obtained from each OA community separately to exclude problematic samples and single nucleotide polymorphisms (SNPs). All SNPs that failed the Hardy–Weinberg exact (HWE) test (P < 10−6) and displayed missing rates >0.05 across all samples in each population were removed. Additionally, samples with call rate <0.99 were excluded. Gender concordance was examined using PLINK v1.07 (Purcell et al. 2007) and samples with inconsistency between genotype results and questionnaire-reported sex were excluded. In order to avoid analysis of close relatives, unknown relatedness was measured between all pairs of individuals within each population using PLINK’s (v1.07) Identity-by-Descent estimation, PI_Hat. An upper cut-off threshold of 0.375 was set to exclude first-degree relatedness within each population. Finally, a principal component analysis (PCA) using EIGENSOFT v3.0 (Patterson et al. 2006) was performed to remove outliers from each population across first ten eigenvectors. In the final stage, all OA populations were merged into one data set and pruned for SNPs that failed HWE (P < 10−6) test and missing rates more than 0.05 across all samples.
The OA genotype data were merged with data from Human Genome Diversity Project (HGDP) (Li et al. 2008), 89 Malay individuals from Singapore Genome Variation Project (SGVP) (Teo et al. 2009) and Onge and Jarawa Negritos from Andaman islands were genotyped using Illumina Human 1.2M (SNP population data courtesy of P. Majumder and A. Basu). After merging data sets (supplementary table S1, Supplementary Material online), a total of 291,096 overlapping autosomal SNPs remained for downstream analysis.
Population Structure Analysis
PCA was used to identify population structure across indigenous Malaysians. PCA analysis was performed on genotyped data of OA combined with Andamanese Negritos, Oceanians, South and East Asian populations in the HGDP, and Malays from SGVP using EIGENSOFT v3.0. To balance sample sizes across our populations, 30 Malay individuals were randomly sampled from SGVP data set (which contains 89 individuals). SNPs with r2 > 0.5 were pruned out in order to avoid the effects of excessive LD between SNPs. After this pruning a total of 204,426 SNPs remained for analysis. Pairwise Fst distance between populations in same data set were calculated using EIGENSOFT v3.0, and a Neighbor-net tree was constructed by SplitsTree v4 software (Huson and Bryant 2006). ADMIXTURE v1.22, a clustering algorithm, was used on pruned SNPs to estimate the ancestral population clustering (Alexander et al. 2009).
PLINK v1.07 was used to estimate ROH in selected populations. PLINK takes 5,000 kb (50 SNPs) sliding windows across the genome and allows for 1 heterozygous and 5 missing calls in each window. To minimize the effects of LD on ROH, minimum ROH length was set to be 500 kb because it is unusual for LD to extend beyond 500 kb. LD decay for each population was calculated as r2 using PLINK. Pairwise LD between all possible SNPs was calculated and mean LD was measured in bins of 5 kb.
TreeMix v1.12 (Pickrell and Pritchard 2012) was used to explore the population relationships and migration events. Same data set described above was used to estimate the Maximum Likelihood tree with Yoruba as outgroup. We used blocks of 200 SNPs (-k 200) to account for LD and migration edges added sequentially until the model explained 99% of variances. We estimated the D statistics using ADMIXTOOLS (Patterson et al. 2012) to examine gene flow between OAs and surrounding populations. Divergence time between OA and EA was estimated using 399,971 shared SNPs between our data and HapMap 3 (The International HapMap 2005). Effective population size (Ne) and divergence time between OAs and Yoruba in Ibadan (YRI), Han Chinese in Beijing (CHB), and Japanese in Tokyo (JPT) samples were estimated according to the method suggested by McEvoy et al. (2011). To estimate LD, pairwise LD was calculated as r2 using PLINK v1.07. In order to minimize the effects of small sample size, all individuals were pooled together in their respective OA groups. Admixture time between OAs and EA was estimated by rolloff package using 399,971 SNPs by HapMap3 and OAs.