|
Post by Admin on Mar 7, 2024 0:00:06 GMT
South Asia is home to one of the most diverse assemblages of people in the world. A mélange of different ethnic identities, languages, religions, castes, and customs makes up the 1.5 billion humans who live here. Now, scientists have revealed the most detailed look yet of how this population took shape. In the largest ever modern whole-genome analysis from South Asia—published as a preprint last month on bioRxiv—researchers reveal new details about the origin of India’s Iranian ancestry and when ancient hunter-gatherers settled the region. The study also turns up a surprise: an unexpectedly rich diversity of genes from Neanderthals and their close evolutionary cousins, the Denisovans. Because no fossils of these ancient human relatives have been found in India, researchers are speculating about how these genes got there—and why they stuck around. Global genetic sequencing efforts have largely ignored India, says population geneticist Kelsey Witt of Clemson University, who wasn’t involved with the work. So, “We’re learning a lot about populations that we didn’t know much about.” Most Indians are primarily a mixture of three ancestral populations: hunter-gatherers who lived on the land for tens of thousands of years, farmers with Iranian ancestry who arrived sometime between 4700 and 3000 B.C.E., and herders from the central Eurasian steppe region who swept into the region sometime after 3000 B.C.E., perhaps between 1900 and 1500 B.C.E. In the new study, University of California, Berkeley population geneticist Priya Moorjani—who also co-led the previous work—and her colleagues confirm the identities of those ancestral groups. They also add fresh wrinkles by using a much larger sample of modern Indians than previous analyses. Working with data from the Longitudinal Aging Study in India–Diagnostic Assessment of Dementia (LASI-DAD), Moorjani’s team sequenced more than 2700 modern Indian genomes—hundreds more than in past studies—including people from nearly every geographic region, speakers of every major language group, and all tribes and castes. To find out more about the identity of the Iranian-related farmers who entered the region thousands of years ago, the researchers analyzed previously extracted ancient DNA from groups with Iranian ancestry who predated the genetic pulse into India. They then played out simulations to see whose genes best matched the patterns seen in present-day Indians. The best fit came from farmers from an ancient agricultural center called Sarazm in the northwest of what today is Tajikistan. Farmers here grew wheat and barley and kept cattle, and traded extensively throughout Eurasia. Interestingly, one ancient individual from Sarazm also carried traces of Indian ancestry; another was buried with ceramic bracelets similar to those made in ancient India. “That really helped directly connect the two cultures, and it showed that it wasn’t just one-way mixing,” Moorjani says. Michael Frachetti, an archaeologist at Washington University in St. Louis, who wasn’t involved with the new work, says he is “highly compelled” by the finding. He has long argued that Sarazm would have been a key outpost for spreading farming and domestic animals—as well as human genes—south into Kashmir and northwestern India. “There’s a very significant story being told here,” he says. “Societies were far more connected in deep time than most have given then credit for.” Still, other ancestral source populations, such as those from the steppe, remain somewhat “vague,” says biological anthropologist Gyaneshwer Chaubey at Banaras Hindu University. He says the relative paucity of ancient DNA samples from India means other, ancient source populations could be missing from the mix. Even deeper in time, Moorjani and colleagues uncovered unexpected details about prehistoric migration and mingling. Scholars have debated over the years whether modern humans were responsible for stone tools found in India and dated to approximately 80,000 years ago, and if so, whether they left a genetic legacy in modern populations. But with no remains associated with these tools, researchers haven’t been able to pin down their makers. The new study suggests those early toolmakers only left traces in living people. By estimating how much genetic mutation occurs between generations and calculating how long it would have taken India’s modern population to reach its current state of variation, Moorjani and her colleagues argue that the settlers who gave rise to contemporary Indians were part of a single migration out of Africa about 50,000 years ago. In addition, the scientists found that the modern individuals sampled derive 1% to 2% of their ancestry from Neanderthals and their close cousins, the Denisovans—on par with Europeans. But Indians collectively carry a stunning variety of these archaic genes compared with other worldwide populations. About 90% of all known Neanderthal genes that have made their way into human populations turned up in the 2700 Indian genomes. That’s about 50% more than was recovered in a similar study of Neanderthal DNA in Icelanders that analyzed more than 27,000 genomes. The researchers also identified several new candidates for Neanderthal- and Denisovan-inherited genes that may have given their descendants some evolutionary advantage, though it’s too early to say what those boons might have been. Moorjani says ancient humans might have encountered and mated with a relatively large, genetically diverse population of our archaic cousins living on the subcontinent—although no fossils of those archaic cousins have been found. Another possibility is that India’s vast geographical boundaries and close kin–marrying traditions preserved different segments of Neanderthal DNA than on other continents. Researchers need more genetic and archaeological studies to put those mysteries to rest, Witt says. “There are so many different possibilities, so many populations coming together. It’s a really complex problem to solve.” Correction, 5 March, 3:35 p.m.: This article originally incorrectly noted that herders from the steppe entered South Asia around 3000 B.C.E. www.biorxiv.org/content/10.1101/2024.02.15.580575v2.full
|
|
|
Post by Admin on Mar 10, 2024 2:49:22 GMT
Introduction With more than 1.5 billion people and approximately 5,000 anthropologically well-defined ethno-linguistic and religious groups, India is a region of extraordinary diversity1. Yet, Indian populations are often underrepresented in genomic studies. Recent sequencing endeavors such as the 1000 Genomes Project (1000G)2, UK Biobank3, TopMed4, Simons Genome Diversity Panel5 and GenomeAsia6,7 have incorporated Indian populations. However, with the exception of GenomeAsia6,7, these efforts have either included very few individuals or primarily sampled expatriate communities outside of India, leading to a limited (and biased) representation of the genetic variation seen in India. As a result, many open questions remain about the population history of India: When did people first migrate to India from Africa––as part of the major migration out of Africa or at earlier times along the southern coastal route of migration? What is the contribution and legacy of archaic gene flow from Neanderthals and Denisovans to Indians? How have recent technological innovations like Neolithic farming and spread of languages impacted variation in India?
To obtain a more complete picture of human diversity in India, we generated deep coverage genome sequences of ∼2,700 individuals from 18 states in India. Our samples are part of the Longitudinal Aging Study in India - Diagnostic Assessment of Dementia (LASI-DAD)8 that is a population-based prospective cohort study that has collected nationally representative data of individuals that are 60 years or older. These data contain individuals from diverse geographic regions (including rural and urban areas), speakers for many language families (e.g., Indo-European, Dravidian and Tibeto-Burman languages) and various ethno-linguistic and caste groups (e.g., self-reported castes recognized by the Indian government), providing the most comprehensive snapshot of genetic diversity in India.
Data and catalog of novel variants A total of 2,762 LASI-DAD participants, including 22 trios (mother-father-child), were sequenced at MedGenome, Inc. (Bangalore, India) at an average read depth of 30x. Individuals were sampled from 18 different states across India (Fig 1A), with median sample size of 157 individuals per state (Supplementary Note S1). The raw whole genome sequences were sent to the Genome Center for Alzheimer’s Disease (GCAD) at the University of Pennsylvania for joint calling and quality control. A total of 2,679 samples and 73.2 million autosomal bi-allelic variants passed quality control filters, including 67.1 million single nucleotide variants (SNVs) and 6.04 million insertion-deletions (indels) (Supplementary Note S2). We identified 24 million novel SNVs and 2.2 million novel indels, underscoring the limitations of existing human genetic variation databases like the 1000G and Genome Aggregation Database (gnomAD)9 in representing diverse populations. The vast majority (>99%) of the newly identified variants are rare, including 68% of singletons and less than 1% common variants (with greater than 1% frequency) (Table S2.1). Genome phasing was conducted using SHAPEIT410, and we estimated a low phase switch error rate of less than 1.15% in trios (Table S3.1).
|
|
|
Post by Admin on Mar 11, 2024 4:47:54 GMT
Data and catalog of novel variants A total of 2,762 LASI-DAD participants, including 22 trios (mother-father-child), were sequenced at MedGenome, Inc. (Bangalore, India) at an average read depth of 30x. Individuals were sampled from 18 different states across India (Fig 1A), with median sample size of 157 individuals per state (Supplementary Note S1). The raw whole genome sequences were sent to the Genome Center for Alzheimer’s Disease (GCAD) at the University of Pennsylvania for joint calling and quality control. A total of 2,679 samples and 73.2 million autosomal bi-allelic variants passed quality control filters, including 67.1 million single nucleotide variants (SNVs) and 6.04 million insertion-deletions (indels) (Supplementary Note S2). We identified 24 million novel SNVs and 2.2 million novel indels, underscoring the limitations of existing human genetic variation databases like the 1000G and Genome Aggregation Database (gnomAD)9 in representing diverse populations. The vast majority (>99%) of the newly identified variants are rare, including 68% of singletons and less than 1% common variants (with greater than 1% frequency) (Table S2.1). Genome phasing was conducted using SHAPEIT410, and we estimated a low phase switch error rate of less than 1.15% in trios (Table S3.1). Figure 1 Population structure and admixture in India. (A) We show the sampling locations of individuals in the DAD study. States are colored by region (North, North-east, Central, South, East and West) used for analysis. (B) n Principal component analysis (PCA) for Indians in LASI-DAD and 1000G individuals of European (EUR), East (EAS) and South Asian (SAS) ancestry. We show the projection of the first two principal components, colored by of birth. (C) Using qpAdm, we inferred the ancestry proportions for each individual on the ‘Indian cline’ using m_EN as a proxy for Iranian farmer-related, Central_Steppe_MLBA as a proxy for Steppe pastoralist-related and (Onge) as a proxy for AASI-related ancestry. We compared AHG-related ancestry proportion by region (left), ge family (middle), and caste group (right) of each individual. Our dataset is representative of the population diversity in India. It includes individuals born in 23 different states from both rural (63%) and urban (37%) areas. It comprises speakers of around 26 different languages that belong to diverse caste groups as recognized by the Indian government: 4% from Scheduled Tribes, 18% from Scheduled Castes, and 44% from other backward class (OBC). Nearly equal numbers of males and females were recruited in the study, with our dataset constituting 52% of females. For many analyses, we categorized individuals based on their birth location into six major geographic regions: North (n=555), West (n=385), Central (n=373), South (n=715), North-East (n=73), and East (n=530). After performing quality control checks and excluding first-degree relatives, we used a sample of 2,620 individuals for most of our analyses described below, unless specified otherwise (see Methods, Supplementary Note S1-2).
|
|
|
Post by Admin on Mar 12, 2024 3:03:21 GMT
Population structure and admixture To study population relationships of Indians to other worldwide populations, we combined the LASI-DAD dataset with the 1000G11 and applied Principal component analysis (PCA)12, ADMIXTURE13 and f-statistics14. Consistent with previous reports15,16, we find that the population structure in India is related to individuals of West Eurasian-related ancestry (1000G EUR), with limited or no recent gene flow from populations related to sub-Saharan Africans (Fig 1B, Fig S4.1). The population structure in India is correlated to geography (state of birth) and linguistic affiliation, with three main clusters––one cluster that includes the majority of the individuals from North and South of India who speak Indo-European and Dravidian languages and represents varying relatedness to West Eurasians, referred to as ‘Indian cline’ (Fig 1B, Fig S4.2-3). The Indian cline has previously been shown to reflect variable proportions of ancestry from two ancestral groups: the Ancestral North Indians (ANI) who harbor large proportions of ancestry related to West Eurasians, and the Ancestral South Indians (ASI) who are distantly related to West Eurasians15,16. Recent ancient DNA analysis have shown that both ANI and ASI are admixed and in turn, have ancestry from groups related to ancient Iranian farmers, ancient Eurasian Steppe pastoralists, and unsampled indigenous South Asians (Ancient Ancestral South Indians (AASI)) distantly related to Andamanese hunter-gatherers (AHG)17.
Beyond the Indian cline, we find two primary clusters of individuals (n=494): a cluster towards the ASI-end of the cline, and another found closer to the center exhibiting clear relatedness to East Asian-related groups (1000G EAS) in PCA (Fig 1B). The former mainly includes individuals from Central and East India, with the majority from the state of Odisha where predominantly Indo-European and Austro-asiatic languages are spoken. The East Asian-related cluster includes individuals from East and North-East regions of India. West Bengal is the most representative state in this cluster, with almost 10% ancestry related to East Asians. Using ALDER18, we estimated the admixture related linkage disequilibrium related to EAS to infer that this gene flow occurred 50 generations ago or around 520 AD, possibly related to the invasions of the Huna people to India after the collapse of the Gupta Empire (Fig S4.11)19,20. Another predominant group in the East Asian-related cluster is from Assam. This group exhibits significant heterogeneity, as individuals have varying degrees of relatedness to EAS, indicative of the recent gene flow possibly related to the recent migration of East Asian tea plantation workers to India in the last two centuries21 (Fig 1B). Our ADMIXTURE13 analysis mirrors the patterns seen in PCA (Fig S4.6).
Ancestry Composition and Sources To model the ancestry in India, we used qpAdm that compares allele frequency correlations between a population of interest and a set of reference and outgroup populations14,22. First, we examined how well the three-way model with ancient Iranian farmer-related, Eurasian Steppe pastoralist-related, and AHG-related groups describes the ancestry of individuals on the Indian cline (Fig 1B). Following Narasimhan et al. 201917, we used Indus Periphery West that is part of the Indus Periphery Cline––a heterogenous group of 11 outlier samples from Bronze Age cultures of Shahr-i-Sokhta and Bactria Margiana Archaeological Complex––as the proxy for Iranian farmer-related ancestry, Central Steppe Middle to late Bronze age (Central_Steppe_MLBA) as the source for Yamnaya Steppe pastoralist-derived ancestry and AHG-related individuals to represent AASI ancestry17. We find the three-way model provides a good fit for the majority (>90%) of the individuals on the Indian cline, with some exceptions (we define ‘good fit’ as models with qpAdm p-value > 0.01, see Methods). Notably, we find 22 individuals that can be fitted as a two-way mixture between ancient Iranian farmer-related and AHG-related ancestries without Steppe pastoralist-related ancestry (referred to as ASI henceforth).
The archaeological context of the Indus Periphery Cline and their relationship to ancient Indian civilizations (e.g., Indus Valley Civilization) is unclear as these were migrant samples from Bronze Age Central Asian cultures17. Thus, we examined fifteen ancient Iranian-related groups from the Neolithic to Iron Age as the potential source of the Iranian farmer-related ancestry for the 22 ASI individuals and Indus Periphery West. We obtain good fits for all 22 ASI individuals when the Iranian-related ancestry derives from early Neolithic and Copper Age individuals from Central Asian cultures of either Sarazm_EN or Namazga_CA or a group containing Sarazm_EN and Parkhai_Anau_EN that was previously suggested as the source for Indus Periphery Cline17. The latter two models also provide good fits for Indus Periphery West, though using Sarazm_EN alone as the source does not yield a good fit (Table S4.2). Furthermore, a model with Sarazm_EN, AHG-related and Central_Steppe_MLBA also provides a good fit for the vast majority (>95%) of individuals on the Indian cline (p-value in qpAdm > 0.01). In contrast, models with Namazga_CA fail for >15% of individuals on the Indian cline, contrary to previous claims based on fewer samples23. Similarly, models with Sarazm_EN and Parkhai_Anau_EN do not work well for modern Indians and yield negative coefficients for Parkhai_Anau_EN ancestry (Table S4.3).
Turning to the individuals that fall outside the Indian cline, we tried three models including Sarazm_EN, AHG-related, and either (a) Steppe pastoralist-related (as the Indian cline model), (b) Austro-asiatic-related (using Nicobarese), or (c) East Asian-related (using EAS) ancestries. We also tested four-way models with addition of Central_Steppe_MLBA if models (b-c) failed. We obtain good fits for 91% of the individuals that fall outside the cline (Table S4.4). Notably, there are 91 individuals that can be modeled without Steppe pastoralist-related ancestry, including ∼96% of the Austro-asiatic-related individuals (using model b). This suggests Iranian farmer-related ancestry likely did not come through Steppe pastoralist-related groups to India.
Archaeological studies have also documented trade connections between Sarazm and South Asia, including connections with agriculture sites of Mehrgarh and early Indus Valley Civilization24. Indeed, one of the two Sarazm_EN individuals (Sarazm_EN_1) was found with shell bangles that are identical to ones found at sites in Pakistan and India such as Shahi-Tump, Makran and Surkotada, Gujarat25 (J. Mark Kenoyer, personal communication). Surprisingly, when we applied qpAdm, we discovered that Sarazm_EN_1 has substantial AHG-related ancestry (∼15%), unlike the other individual from the Sarazm_EN group (Sarazm_EN_2). Application of the three-way model with Sarazm_EN_2, AHG-related and Central_Steppe_MLBA continues to provide a good fit for most individuals (>96%) on the Indian cline, as well as off-cline individuals (Table S4.7-8). Moreover, the two-way model without Steppe Pastoralist-related ancestry works well for the 22 ASI individuals and Indus Periphery West (without need for additional ancestry from Parkhai_Anau_EN). Together, our data are consistent with a common source for the ancient Iranian-related ancestry in ANI, ASI, Austroasiatics-related and East Asian-related individuals in India, suggesting that the Iranian-related gene flow occurred well before the arrival of Steppe pastoralist-related ancestry in Bronze Age (∼1900–1500 BCE17).
Using AHG-related, Sarazm_EN and Central_Steppe_MLBA as reference populations, we inferred the genetic composition of individuals on the Indian cline. We find marked variation in ancestry proportions across India, with Iranian farmer-related ancestry varying between ∼27–68%, AHG-related between ∼19–69% and Central_Steppe_MLBA between ∼0–45%. Among the three ancestry components, variation in AHG-related shows the strongest correlation to the ANI-ASI cline in PCA (Fig S4.10). AHG-related ancestry proportion is significantly associated with geography (e.g., highest in South and lowest in North of India), language (i.e., higher in Dravidian vs. Indo-European language speakers) and caste affiliation (highest in Scheduled Castes, Scheduled Tribes and OBC compared to other groups) (Fig 1C, Extended Data Fig 1). This highlights that the ancient admixture events are related to the spread of languages and the history of the traditional caste system in India.
|
|
|
Post by Admin on Mar 14, 2024 0:02:19 GMT
Ancestry Composition and Sources To model the ancestry in India, we used qpAdm that compares allele frequency correlations between a population of interest and a set of reference and outgroup populations14,22. First, we examined how well the three-way model with ancient Iranian farmer-related, Eurasian Steppe pastoralist-related, and AHG-related groups describes the ancestry of individuals on the Indian cline (Fig 1B). Following Narasimhan et al. 201917, we used Indus Periphery West that is part of the Indus Periphery Cline––a heterogenous group of 11 outlier samples from Bronze Age cultures of Shahr-i-Sokhta and Bactria Margiana Archaeological Complex––as the proxy for Iranian farmer-related ancestry, Central Steppe Middle to late Bronze age (Central_Steppe_MLBA) as the source for Yamnaya Steppe pastoralist-derived ancestry and AHG-related individuals to represent AASI ancestry17. We find the three-way model provides a good fit for the majority (>90%) of the individuals on the Indian cline, with some exceptions (we define ‘good fit’ as models with qpAdm p-value > 0.01, see Methods). Notably, we find 22 individuals that can be fitted as a two-way mixture between ancient Iranian farmer-related and AHG-related ancestries without Steppe pastoralist-related ancestry (referred to as ASI henceforth).
The archaeological context of the Indus Periphery Cline and their relationship to ancient Indian civilizations (e.g., Indus Valley Civilization) is unclear as these were migrant samples from Bronze Age Central Asian cultures17. Thus, we examined fifteen ancient Iranian-related groups from the Neolithic to Iron Age as the potential source of the Iranian farmer-related ancestry for the 22 ASI individuals and Indus Periphery West. We obtain good fits for all 22 ASI individuals when the Iranian-related ancestry derives from early Neolithic and Copper Age individuals from Central Asian cultures of either Sarazm_EN or Namazga_CA or a group containing Sarazm_EN and Parkhai_Anau_EN that was previously suggested as the source for Indus Periphery Cline17. The latter two models also provide good fits for Indus Periphery West, though using Sarazm_EN alone as the source does not yield a good fit (Table S4.2). Furthermore, a model with Sarazm_EN, AHG-related and Central_Steppe_MLBA also provides a good fit for the vast majority (>95%) of individuals on the Indian cline (p-value in qpAdm > 0.01). In contrast, models with Namazga_CA fail for >15% of individuals on the Indian cline, contrary to previous claims based on fewer samples23. Similarly, models with Sarazm_EN and Parkhai_Anau_EN do not work well for modern Indians and yield negative coefficients for Parkhai_Anau_EN ancestry (Table S4.3).
Turning to the individuals that fall outside the Indian cline, we tried three models including Sarazm_EN, AHG-related, and either (a) Steppe pastoralist-related (as the Indian cline model), (b) Austro-asiatic-related (using Nicobarese), or (c) East Asian-related (using EAS) ancestries. We also tested four-way models with addition of Central_Steppe_MLBA if models (b-c) failed. We obtain good fits for 91% of the individuals that fall outside the cline (Table S4.4). Notably, there are 91 individuals that can be modeled without Steppe pastoralist-related ancestry, including ∼96% of the Austro-asiatic-related individuals (using model b). This suggests Iranian farmer-related ancestry likely did not come through Steppe pastoralist-related groups to India.
Archaeological studies have also documented trade connections between Sarazm and South Asia, including connections with agriculture sites of Mehrgarh and early Indus Valley Civilization24. Indeed, one of the two Sarazm_EN individuals (Sarazm_EN_1) was found with shell bangles that are identical to ones found at sites in Pakistan and India such as Shahi-Tump, Makran and Surkotada, Gujarat25 (J. Mark Kenoyer, personal communication). Surprisingly, when we applied qpAdm, we discovered that Sarazm_EN_1 has substantial AHG-related ancestry (∼15%), unlike the other individual from the Sarazm_EN group (Sarazm_EN_2). Application of the three-way model with Sarazm_EN_2, AHG-related and Central_Steppe_MLBA continues to provide a good fit for most individuals (>96%) on the Indian cline, as well as off-cline individuals (Table S4.7-8). Moreover, the two-way model without Steppe Pastoralist-related ancestry works well for the 22 ASI individuals and Indus Periphery West (without need for additional ancestry from Parkhai_Anau_EN). Together, our data are consistent with a common source for the ancient Iranian-related ancestry in ANI, ASI, Austroasiatics-related and East Asian-related individuals in India, suggesting that the Iranian-related gene flow occurred well before the arrival of Steppe pastoralist-related ancestry in Bronze Age (∼1900–1500 BCE17).
Using AHG-related, Sarazm_EN and Central_Steppe_MLBA as reference populations, we inferred the genetic composition of individuals on the Indian cline. We find marked variation in ancestry proportions across India, with Iranian farmer-related ancestry varying between ∼27–68%, AHG-related between ∼19–69% and Central_Steppe_MLBA between ∼0–45%. Among the three ancestry components, variation in AHG-related shows the strongest correlation to the ANI-ASI cline in PCA (Fig S4.10). AHG-related ancestry proportion is significantly associated with geography (e.g., highest in South and lowest in North of India), language (i.e., higher in Dravidian vs. Indo-European language speakers) and caste affiliation (highest in Scheduled Castes, Scheduled Tribes and OBC compared to other groups) (Fig 1C, Extended Data Fig 1). This highlights that the ancient admixture events are related to the spread of languages and the history of the traditional caste system in India.
|
|