Post by Admin on Mar 19, 2020 4:34:22 GMT
Results
To characterize the population structure of Turkic-speaking populations in the context of their geographic neighbors across Eurasia, we genotyped 322 new samples from 38 Eurasian populations and combined it with previously published data (see S1 Table and Material and Methods for details) to yield a total dataset of 1,444 samples genotyped at 515,841 markers. The novel samples introduced in this study geographically cover previously underrepresented regions like Eastern Europe (Volga-Ural region), Central Asia, Siberia, and the Middle East. We used a STRUCTURE-like [27] approach implemented in the program ADMIXTURE [28] to explore the genetic structure in the Eurasian populations by inferring the most likely number of genetic clusters and mixing proportions consistent with the observed genotype data (from K = 3 through K = 14 groups) (S1 Fig). As shown in previous studies [15, 20, 29] East Asian populations commonly contained alleles that find membership in two general clusters, shown here as k6 and k8, in a model assuming K = 8 “ancestral” populations (Fig 2). Geographically, the spread zones of these two components (clusters) were centered on Siberia and East Asia, respectively. Their combined prevalence declined as one moves west from East Asia (correlation with longitude, p = 8.8×10−16, R = 0.77, 95% CI: 0.66–0.85). Overall, alleles from the Turkic populations sampled across West Eurasia showed membership in the same set of West Eurasian genetic clusters, k1–k4, as did their geographic neighbors. In addition, the Volga-Uralic Turkic peoples (Chuvashes, Tatars, and Bashkirs) also displayed membership in the k5 cluster, which contained the Siberian Uralic-speaking populations (Nganasans and Nenets) and extended to some of the European Uralic speakers (Maris, Udmurts, and Komis). However, in most cases the Turkic peoples showed a higher combined presence of the “eastern components” k6 and k8 than did their geographic neighbors.
Fig 2
Population structure inferred using ADMIXTURE analysis.
Three-population test
The “eastern components” k6 and k8 inferred among Turkic- and non-Turkic peoples across West Eurasia, as well as the “western components” k1, k2, and k3 present among Siberian populations can originate through gene flow episodes in opposite directions in the past and this population mixture history can be statistically tested using f3-statistics [30, 31]. In order to evaluate the admixture scenarios suggested by the ADMIXTURE analysis, we tested all possible three population combinations in our dataset using the three-population test (f3-statistics) [30, 31]. We reported only population trios f3(target, source1, source2) with the most negative f3-statistics (S2 Table) and considered populations to be significantly admixed when their Z-score was smaller than 1.64 (i.e. p-value was less than 0.05, for a one-tailed test). Our three-population tests showed that almost all the West Eurasian Turkic peoples (15 out of 16) and their non-Turkic neighbors (49 out of 61) (see S2 Table for geographic subdivision) were admixed with East Asian- and Siberian-related populations. Similarly, all the Siberian Turkic populations, as well as some (11 out of 27) East Eurasian non-Turkic populations showed an admixture signal with West Eurasian-related populations. In interpreting f3-statistics results, it is important to point out that the reported source populations do not necessarily represent the true admixing populations [31]. Although the exact source populations were uncertain, significantly negative f3-statistics provided strong evidence for admixture in most of the Turkic and non-Turkic populations in our dataset. In order to test whether these admixture signals resulted from recent gene flow events, we next explored the distribution of long chromosomal tracts shared between populations in our dataset.
Geographic distribution of recent shared ancestry
A recent study shows that even a pair of unrelated individuals from the opposite ends of Europe share hundreds of chromosomal tracts of IBD from common ancestors that lived over the past 3,000 years. The amount of such recent ancestry declines exponentially with geographic distance between population pairs, and such a distance-dependent pattern can be distorted due to population expansion or gene flow [32]. We observed a reasonably high correlation (Pearson’s correlation coefficient = 0.77, 95% CI: 0.76–0.79, p < 2.2×10–16) between the rate of IBD sharing decay and geographic distance in our set of Eurasian populations. This distance-dependent pattern is likely shaped by both isolation-by-distance and gene flow: many of the populations are admixed (the negative f3-statistics in S2 Table) and there is a longitude dependent decrease in the prevalence of “eastern components” k6 and k8. Some populations might stand out in this distance dependent pattern due to isolation, greater gene flow, or genetic drift. For example, when we removed the West Eurasian Turkic populations (sampled in the Middle East, Caucasus, Eastern Europe, and Central Asia) from our dataset, we observed better correlation between IBD sharing decay and geographic distances between populations (Pearson’s correlation coefficient = 0.83, 95% CI: 0.82–0.85, p < 2.2×10−16). To identify populations for which IBD sharing with Turkic populations departs from a distance-dependent decay pattern, we first computed IBD sharing (the average length of genome IBD measured in centiMorgans) for each of the 12 western Turkic populations with all other populations in the dataset (S3 Table) and then subtracted the same statistic computed for their geographic neighbors (see the Materials and Methods section for details and S2 Fig for a schematic representation of this analysis). When the differences were overlaid for all 12 Turkic populations, we detected an unusually high signal of accumulated IBD sharing (samples indicated by a “plus symbol” on Fig 3A–3C) for populations outside West Eurasia. The correlated signal of IBD sharing for these distant populations exceeded the expectation based on a distance-dependent decay pattern. Most of these distant populations are located in South Siberia and Mongolia (SSM) and Northeast Siberia, except the two samples in Eastern Europe (Maris) and the North Caucasus (Kalmyks). In principle, when we compare the IBD sharing pattern in this way between neighboring Turkic and non-Turkic populations, we might observe a high IBD sharing signal with some Siberian populations due to drift in one of the populations compared, but chances that such random signals would correlate between multiple Turkic populations and accumulate in a single region is negligible. Indeed the null hypothesis for this analysis assumed no systematic difference between any of the Turkic populations and their respective geographic neighbors. Therefore, the null hypothesis predicted that random differences accumulated across the entire geographic range of the western Turkic populations. To demonstrate this null expectation, we replaced each of the western Turkic populations by populations randomly drawn from the sets of respective non-Turkic neighbors, and repeated this subtraction/accumulation analysis, as shown in S2 Fig When the sets of random non-Turkic samples were tested, the accumulated signal was restricted to populations (indicated by the “plus symbol” on S3 Fig) within West Eurasia, as expected by the null hypothesis. There are, however, two exceptions (Nganasans and Nenets) that, when examined closely, suggest an interesting finding consistent with our ADMIXTURE results. These two Siberian populations, Nganasans and Nenets (S3A, S3B, S3E, S3I and S3J Fig), speak Uralic languages and demonstrated a high accumulated signal only when our tested sets contained the western Uralic speakers (Maris, Komis, Vepsas, and Udmurts). This was in line with our ADMIXTURE results (Fig 2), as the k5 ancestry component was shared specifically between these western Uralic speakers and the two Siberian Uralic-speaking Nganasans and Nenets. We now return to the overall difference between the accumulated IBD sharing signal under the null hypothesis (see S3 Fig) and that observed for the set of western Turkic populations (Fig 3). Some of the populations in SSM and Northeast Siberia demonstrated a strong IBD sharing signal with the western Turkic populations and this pattern most likely indicates recent gene flow from Siberia. To narrow down the source of this gene flow it is important to know which of the Siberian populations are indigenous to their current locations. We show in the Discussion section that only Tuvans, Buryats, and Mongols from the SSM area are indigenous to their current locations (at least within the known historical time) and therefore this area is the best candidate for the source of recent gene flow into the western Turkic populations. It should be noted that this east-west directionality is implied by the fact that 12 populations sampled across different West Eurasian locations are unlikely to show a correlated signal of high IBD sharing with a single region unless they received gene flow from it. Indeed, when we repeated our analysis by randomly choosing non-Turkic populations (S2 Fig), we could not reproduce a similar correlated signal.
Fig 3
Populations with high and correlated signals of IBD sharing with western Turkic peoples.
Our previous analysis suggests that the western Turkic populations (Fig 3 and S3 Table) experienced stronger gene flow from the SSM area than their non-Turkic neighbors, but it is not clear whether this signal is statistically significant. To test this, we computed IBD sharing between the group of SSM populations (Tuvans, Mongols, and Buryats, as well as a known migrant population, Evenkis) and each of the western Turkic populations. Then, for each of the western Turkic populations, we pooled their non-Turkic neighbors, and generated 10,000 permuted samples to see whether a comparable amount of IBD sharing (observed in tested Turkic populations) with the four Siberian populations is obtained by chance. IBD sharing was estimated separately for different classes of chromosomal tracts (1–2 cM, 2–3 cM, 3–4 cM, etc.), and permutation tests were performed. In most of the cases, higher IBD sharing between the western Turkic populations (compared to non-Turkic neighbors) and the Siberian populations was statistically significant (Fig 4 and S4 Fig; numbers in red show how many Siberian populations have p-values ≤ 0.01). Some of the non-Turkic neighbors, such as Adyghe, Maris, Udmurts, and North Ossetians, also shared a relatively high number of IBD tracts (Fig 4 and S4 Fig) with the SSM populations. We conclude that the recent gene flow from the SSM area inferred in our previous analysis was not restricted to the western Turkic peoples, and the higher IBD sharing is evidence that Turkic populations are distinct from their non-Turkic neighbors.
Fig 4
Pairwise IBD sharing based on 1–2 cM long segments.
A spatial pattern in IBD sharing was noted when IBD tracts of different length classes were considered separately. For segment classes of 1–2 cM and 2–3 cM, higher IBD sharing is statistically significant for most Turkic speakers, except Gagauzes and Chuvashes (and Tatars in the case of 2–3 cM). For longer IBD tracts of 3–4 cM, statistical evidence for higher IBD sharing becomes weaker in some Middle Eastern and Caucasus (Azeris, Kumyks, and Balkars) samples. By weaker evidence, we mean that a statistically significant excess of IBD sharing was restricted to a subset of the four candidate ancestors tested. In the Volga-Ural region, for the same class of segments (3–4 cM), only Bashkirs continued to show strong evidence for gene flow, while Tatars and Chuvashes do not. For these two Turkic populations, not all tests were statistically significant because the background group, from which permuted samples are drawn, contained the Finnic speaking Mari population, which shows comparable levels of Asian admixture (Fig 2) and IBD sharing (S4 Fig). When we considered even longer segments (4–5 cM and 5–6 cM), we no longer observed a systematic excess of IBD sharing for Turkic peoples in the Middle East, the Caucasus, or in the Volga-Ural region. In contrast, populations closer to the SSM area (Uzbeks, Kazakhs, Kyrgyz, and Uygurs, and also Bashkirs from the Volga-Ural region) still demonstrated a statistically significant excess of IBD sharing. This spatial pattern can be partly explained by a relative rarity of longer IBD tracts compared to shorter ones and recurrent gene flow events into populations closer to the SSM area.
To characterize the population structure of Turkic-speaking populations in the context of their geographic neighbors across Eurasia, we genotyped 322 new samples from 38 Eurasian populations and combined it with previously published data (see S1 Table and Material and Methods for details) to yield a total dataset of 1,444 samples genotyped at 515,841 markers. The novel samples introduced in this study geographically cover previously underrepresented regions like Eastern Europe (Volga-Ural region), Central Asia, Siberia, and the Middle East. We used a STRUCTURE-like [27] approach implemented in the program ADMIXTURE [28] to explore the genetic structure in the Eurasian populations by inferring the most likely number of genetic clusters and mixing proportions consistent with the observed genotype data (from K = 3 through K = 14 groups) (S1 Fig). As shown in previous studies [15, 20, 29] East Asian populations commonly contained alleles that find membership in two general clusters, shown here as k6 and k8, in a model assuming K = 8 “ancestral” populations (Fig 2). Geographically, the spread zones of these two components (clusters) were centered on Siberia and East Asia, respectively. Their combined prevalence declined as one moves west from East Asia (correlation with longitude, p = 8.8×10−16, R = 0.77, 95% CI: 0.66–0.85). Overall, alleles from the Turkic populations sampled across West Eurasia showed membership in the same set of West Eurasian genetic clusters, k1–k4, as did their geographic neighbors. In addition, the Volga-Uralic Turkic peoples (Chuvashes, Tatars, and Bashkirs) also displayed membership in the k5 cluster, which contained the Siberian Uralic-speaking populations (Nganasans and Nenets) and extended to some of the European Uralic speakers (Maris, Udmurts, and Komis). However, in most cases the Turkic peoples showed a higher combined presence of the “eastern components” k6 and k8 than did their geographic neighbors.
Fig 2
Population structure inferred using ADMIXTURE analysis.
Three-population test
The “eastern components” k6 and k8 inferred among Turkic- and non-Turkic peoples across West Eurasia, as well as the “western components” k1, k2, and k3 present among Siberian populations can originate through gene flow episodes in opposite directions in the past and this population mixture history can be statistically tested using f3-statistics [30, 31]. In order to evaluate the admixture scenarios suggested by the ADMIXTURE analysis, we tested all possible three population combinations in our dataset using the three-population test (f3-statistics) [30, 31]. We reported only population trios f3(target, source1, source2) with the most negative f3-statistics (S2 Table) and considered populations to be significantly admixed when their Z-score was smaller than 1.64 (i.e. p-value was less than 0.05, for a one-tailed test). Our three-population tests showed that almost all the West Eurasian Turkic peoples (15 out of 16) and their non-Turkic neighbors (49 out of 61) (see S2 Table for geographic subdivision) were admixed with East Asian- and Siberian-related populations. Similarly, all the Siberian Turkic populations, as well as some (11 out of 27) East Eurasian non-Turkic populations showed an admixture signal with West Eurasian-related populations. In interpreting f3-statistics results, it is important to point out that the reported source populations do not necessarily represent the true admixing populations [31]. Although the exact source populations were uncertain, significantly negative f3-statistics provided strong evidence for admixture in most of the Turkic and non-Turkic populations in our dataset. In order to test whether these admixture signals resulted from recent gene flow events, we next explored the distribution of long chromosomal tracts shared between populations in our dataset.
Geographic distribution of recent shared ancestry
A recent study shows that even a pair of unrelated individuals from the opposite ends of Europe share hundreds of chromosomal tracts of IBD from common ancestors that lived over the past 3,000 years. The amount of such recent ancestry declines exponentially with geographic distance between population pairs, and such a distance-dependent pattern can be distorted due to population expansion or gene flow [32]. We observed a reasonably high correlation (Pearson’s correlation coefficient = 0.77, 95% CI: 0.76–0.79, p < 2.2×10–16) between the rate of IBD sharing decay and geographic distance in our set of Eurasian populations. This distance-dependent pattern is likely shaped by both isolation-by-distance and gene flow: many of the populations are admixed (the negative f3-statistics in S2 Table) and there is a longitude dependent decrease in the prevalence of “eastern components” k6 and k8. Some populations might stand out in this distance dependent pattern due to isolation, greater gene flow, or genetic drift. For example, when we removed the West Eurasian Turkic populations (sampled in the Middle East, Caucasus, Eastern Europe, and Central Asia) from our dataset, we observed better correlation between IBD sharing decay and geographic distances between populations (Pearson’s correlation coefficient = 0.83, 95% CI: 0.82–0.85, p < 2.2×10−16). To identify populations for which IBD sharing with Turkic populations departs from a distance-dependent decay pattern, we first computed IBD sharing (the average length of genome IBD measured in centiMorgans) for each of the 12 western Turkic populations with all other populations in the dataset (S3 Table) and then subtracted the same statistic computed for their geographic neighbors (see the Materials and Methods section for details and S2 Fig for a schematic representation of this analysis). When the differences were overlaid for all 12 Turkic populations, we detected an unusually high signal of accumulated IBD sharing (samples indicated by a “plus symbol” on Fig 3A–3C) for populations outside West Eurasia. The correlated signal of IBD sharing for these distant populations exceeded the expectation based on a distance-dependent decay pattern. Most of these distant populations are located in South Siberia and Mongolia (SSM) and Northeast Siberia, except the two samples in Eastern Europe (Maris) and the North Caucasus (Kalmyks). In principle, when we compare the IBD sharing pattern in this way between neighboring Turkic and non-Turkic populations, we might observe a high IBD sharing signal with some Siberian populations due to drift in one of the populations compared, but chances that such random signals would correlate between multiple Turkic populations and accumulate in a single region is negligible. Indeed the null hypothesis for this analysis assumed no systematic difference between any of the Turkic populations and their respective geographic neighbors. Therefore, the null hypothesis predicted that random differences accumulated across the entire geographic range of the western Turkic populations. To demonstrate this null expectation, we replaced each of the western Turkic populations by populations randomly drawn from the sets of respective non-Turkic neighbors, and repeated this subtraction/accumulation analysis, as shown in S2 Fig When the sets of random non-Turkic samples were tested, the accumulated signal was restricted to populations (indicated by the “plus symbol” on S3 Fig) within West Eurasia, as expected by the null hypothesis. There are, however, two exceptions (Nganasans and Nenets) that, when examined closely, suggest an interesting finding consistent with our ADMIXTURE results. These two Siberian populations, Nganasans and Nenets (S3A, S3B, S3E, S3I and S3J Fig), speak Uralic languages and demonstrated a high accumulated signal only when our tested sets contained the western Uralic speakers (Maris, Komis, Vepsas, and Udmurts). This was in line with our ADMIXTURE results (Fig 2), as the k5 ancestry component was shared specifically between these western Uralic speakers and the two Siberian Uralic-speaking Nganasans and Nenets. We now return to the overall difference between the accumulated IBD sharing signal under the null hypothesis (see S3 Fig) and that observed for the set of western Turkic populations (Fig 3). Some of the populations in SSM and Northeast Siberia demonstrated a strong IBD sharing signal with the western Turkic populations and this pattern most likely indicates recent gene flow from Siberia. To narrow down the source of this gene flow it is important to know which of the Siberian populations are indigenous to their current locations. We show in the Discussion section that only Tuvans, Buryats, and Mongols from the SSM area are indigenous to their current locations (at least within the known historical time) and therefore this area is the best candidate for the source of recent gene flow into the western Turkic populations. It should be noted that this east-west directionality is implied by the fact that 12 populations sampled across different West Eurasian locations are unlikely to show a correlated signal of high IBD sharing with a single region unless they received gene flow from it. Indeed, when we repeated our analysis by randomly choosing non-Turkic populations (S2 Fig), we could not reproduce a similar correlated signal.
Fig 3
Populations with high and correlated signals of IBD sharing with western Turkic peoples.
Our previous analysis suggests that the western Turkic populations (Fig 3 and S3 Table) experienced stronger gene flow from the SSM area than their non-Turkic neighbors, but it is not clear whether this signal is statistically significant. To test this, we computed IBD sharing between the group of SSM populations (Tuvans, Mongols, and Buryats, as well as a known migrant population, Evenkis) and each of the western Turkic populations. Then, for each of the western Turkic populations, we pooled their non-Turkic neighbors, and generated 10,000 permuted samples to see whether a comparable amount of IBD sharing (observed in tested Turkic populations) with the four Siberian populations is obtained by chance. IBD sharing was estimated separately for different classes of chromosomal tracts (1–2 cM, 2–3 cM, 3–4 cM, etc.), and permutation tests were performed. In most of the cases, higher IBD sharing between the western Turkic populations (compared to non-Turkic neighbors) and the Siberian populations was statistically significant (Fig 4 and S4 Fig; numbers in red show how many Siberian populations have p-values ≤ 0.01). Some of the non-Turkic neighbors, such as Adyghe, Maris, Udmurts, and North Ossetians, also shared a relatively high number of IBD tracts (Fig 4 and S4 Fig) with the SSM populations. We conclude that the recent gene flow from the SSM area inferred in our previous analysis was not restricted to the western Turkic peoples, and the higher IBD sharing is evidence that Turkic populations are distinct from their non-Turkic neighbors.
Fig 4
Pairwise IBD sharing based on 1–2 cM long segments.
A spatial pattern in IBD sharing was noted when IBD tracts of different length classes were considered separately. For segment classes of 1–2 cM and 2–3 cM, higher IBD sharing is statistically significant for most Turkic speakers, except Gagauzes and Chuvashes (and Tatars in the case of 2–3 cM). For longer IBD tracts of 3–4 cM, statistical evidence for higher IBD sharing becomes weaker in some Middle Eastern and Caucasus (Azeris, Kumyks, and Balkars) samples. By weaker evidence, we mean that a statistically significant excess of IBD sharing was restricted to a subset of the four candidate ancestors tested. In the Volga-Ural region, for the same class of segments (3–4 cM), only Bashkirs continued to show strong evidence for gene flow, while Tatars and Chuvashes do not. For these two Turkic populations, not all tests were statistically significant because the background group, from which permuted samples are drawn, contained the Finnic speaking Mari population, which shows comparable levels of Asian admixture (Fig 2) and IBD sharing (S4 Fig). When we considered even longer segments (4–5 cM and 5–6 cM), we no longer observed a systematic excess of IBD sharing for Turkic peoples in the Middle East, the Caucasus, or in the Volga-Ural region. In contrast, populations closer to the SSM area (Uzbeks, Kazakhs, Kyrgyz, and Uygurs, and also Bashkirs from the Volga-Ural region) still demonstrated a statistically significant excess of IBD sharing. This spatial pattern can be partly explained by a relative rarity of longer IBD tracts compared to shorter ones and recurrent gene flow events into populations closer to the SSM area.