Genetic Ancestry of the Andamanese

Admin
Administrator

Posts: 72,897

Genetic Ancestry of the Andamanese Dec 1, 2021 2:14:29 GMT

Quote

Post by Admin on Dec 1, 2021 2:14:29 GMT

Clades and closest neighbour analysis
The phylogenetic tree was used to define and calculate the age
of the clades of interest, and for the closest
neighbour analysis. A clade was defined when all the
sequences in a subtree came from individuals from the
same (super) population (Fig. 2). In our case, the minimum
number is two to form a clade. Super-populations
were defined as three different categories: North Indian
[Indo-European Speakers] non-tribal, South Indian [Dravidian
Speakers] non-tribal and Indian in general. For
the Indian super-population, we included both North
and South Indian non-tribal individuals with Dravidian
(Irula) and Austro-Asiatic (Birhor) individuals, but not
the Andamanese and Tibeto-Burman individuals as they
have a different ancestry from other Indian populations
(Mondal et al. 2016). We searched for all the biggest
clades in the phylogenetic tree for each super-population,
regardless of the haplogroup classifcation. The algorithm
stops the search when one individual does not belong to
that super-population. Then we calculated the Time to
the Most Recent Common Ancestor (TMRCA) of such
clades (Fig. 2), which were used to calculate the divergence
time of internal clusters in Fig. 4a.

The closest neighbours were those sequences which
were closest to a specifc clade of a super-population (just
outside of the clade) containing at least one Y-chromosome from
a different super-population. In this case, we
also identifed the specifc population to which the closest
neighbour belongs, and the time depth of the joint cluster
(TMRCA of the clade and the closest neighbour together).
Depending on the tree structure, the closest neighbour of
a single clade can consist of a single or multiple individuals
(Fig. 2). The divergence times of such neighbours were
calculated from the average TMRCA of the joint cluster
to every individual of that cluster (essentially the average
height of the joint cluster). The analysis of closest neighbours
provides information about the time and location
of the most recent migrations between the target populations and
other populations represented in the tree. In
Fig. 4a, the blue distribution shows the divergence time of
all such neighbours from specifc super-population clades.
Figure 4b shows the time depth of the closest neighbour
for each sequence, separated by population of origin (horizontal
axis), and differentiating the three super-populations
where the closest neighbour is found (North India, South
India and India). In Fig. 4c, we only concentrated on
Europeans, who are the closest neighbours of the Indian
superpopulation (essentially a subset of Fig. 4b).
All the phylogenetic and clade analyses were done with
the “ape” R package (Paradis et al. 2004). As clade-specifc
analysis can be biased because of sampling effects, we also
looked for the closest European for every Indian individual
regardless of their clade or haplogroup, with similar results
(not shown).

Last Edit: Dec 1, 2021 2:17:14 GMT by Admin

Admin
Administrator

Posts: 72,897

Genetic Ancestry of the Andamanese Dec 1, 2021 5:24:04 GMT

Quote

Post by Admin on Dec 1, 2021 5:24:04 GMT

Whole‑genome sequence analysis
A total of ten Japanese (JPT) autosomal whole-genome
BAM fles (fve haplogroup D and fve haplogroup O) were
downloaded from the 1000 Genomes Project site. Andamanese
and Dai BAM fles were accessed from a previous
project (Mondal et al. 2016). Variant calling was done in a
similar way to that of the Y-chromosome, except that we
changed the ploidy option to 2 for HaplotypeCaller and
did the variant calling for only the polymorphic SNPs in
our data set. We used the VariantRecalibrator from GATK
using dbsnp137, HapMap 3.3, the 1000 Genomes Project
Omni 2.5 and the 1000 Genomes Project Phase 1 SNPs
with high confdence downloaded from the Broad Institute
ftp site (ftp.broadinstitute.org, 11/05/2013). After VariantRecalibrator,
we only kept SNPs which passed the flter
and those without missing information. We added ancestral
information from the 1000 Genomes Project website
(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_
results/supporting/ancestral_alignments/). We then
used ADMIXTOOLS 1.1 (Patterson et al. 2012) to calculate Dstat
for autosomal data.

Simulations
We built a simple model where JPT-O (Japanese with haplogroup O
Y-chromosomes) separated from AND (Andamanese) and JPT-D (Japanese
with haplogroup D) 77 kya,
and AND and JPT-D separated from each other 53 kya,
following the Y-chromosome analysis. We used a mutation
rate (μ) of 1.25 × 10−8 per site per generation (Scally and
Durbin 2012), recombination rate (r) of 1.3 × 10−8
per site per generation and a generation time of 29 years. As Dstat
is neither affected by the effective population size nor by
time of admixture (Patterson et al. 2012), we set the effective
population sizes of all these populations to 10,000 and
the time of admixture from JPT-O to JPT-D around 400
generations ago. We simulated 30,000 regions of 50 kb
each using ms (Hudson 2002):
where <VAR> = 0–.99 with step of .01.
Dstat values were calculated from the simulated data.
The ftting of the data was done by “lm” from the R
package.

Accession numbers
Sequences have been deposited in the European Nucleotide Archive.
PRJEB11455: Andamanese whole-genome
sequences (BAM and FASTQ fles); PRJEB16019: continental Indian
whole-genome sequences; PRJEB19598:
Y-chromosome vcf fles.
Files are also accessible through an ftp server. User:
US3Pdczl (case sensitive)
Password: KMRmDQQv (case sensitive). Server: sitweb.upf.edu
Transfer mode: active
A high quality copy of the Y-chromosome tree in
Fig. 3a and b is accessible in: .
org/jbertranpetit/scientifc-publications/ and also in
www.dropbox.com/sh/ldpkqtci6yxnqsk/AAA2062
ovKULGmsWbvbYDzXVa?dl=0.

Last Edit: Dec 1, 2021 5:24:45 GMT by Admin

Admin
Administrator

Posts: 72,897

Genetic Ancestry of the Andamanese Dec 1, 2021 22:08:35 GMT

Quote

Post by Admin on Dec 1, 2021 22:08:35 GMT

Results and discussion
Indian non‑tribal Y‑chromosome ancestry
The haplogroup frequency distribution in South Asia
(the 42 new Indian sequences and the 263 from 1000
Genomes Project) (Fig. 1) is similar to that documented
earlier (Poznik et al. 2016) in Indian populations. It is
interesting to note that the well-recognized genetic cline
from North to South for non-tribal Indian populations
using autosomal data (Reich al. 2009; Juyal et al. 2014;
Basu et al. 2015) is not evident in the Y-chromosome
haplogroup distribution (Fig. 1; Table 2): there is no specific
haplogroup frequency correlated with the North to
South distribution, a cline that could have been produced
by a single population migration and admixture between
two main original populations (represented now by IndoEuropean
and Dravidian speakers) as proposed by some
researchers (Reich et al. 2009). It should be pointed out,
however, that the total number of populations analysed
here is small, some sample sizes are very small, and three
of the populations [Gujarati Indian from Houston, Texas
(GIH), Indian Telugu from the UK (ITU) and Sri Lankan
Tamil from the UK (STU)] from the 1000 Genomes Project data
set were collected from emigrant individuals.
Nevertheless, previous studies that focused on classical
Y haplogroup frequencies (Kivisild et al. 2003; Cordaux
et al. 2004; Sengupta et al. 2006) have similarly found
that such a cline is not present.
We next analysed whole Y-chromosomal sequences
according to their position in the calibrated phylogeny
(Fig. 3a and b). The initial analysis then considered all
clades (made of at least of two adjacent sequences) that
belong to the same super-population (North India [mainly
Indo-European-speaking populations] or South India
[mainly Dravidian-speaking populations]), and including
only the caste and not the tribal populations (see “Methods”).
When calculating the oldest TMRCA of such clades
for the two groups, we found that North Indian and South
Indian-specifc clades have a similar time of divergence,
without an older distribution for the South (Fig. 4a in red,
with strong overlap), as could be expected if the South
harboured an older population structure than the North. More
specifcally, under the simple model of Ancestral North
Indian (ANI) and Ancestral South Indian (ASI) populations
in different proportions and creating a north to south
cline of ANI-ASI ancestry, the South Indian cluster would
be expected to show a higher number of older TMRCAs than the
North Indian cluster (as it is a more recent
migration with a likely bottleneck, and thus would have a
lower TMRCAs), which we failed to detect in our data set.
Interestingly, both of these populations have their oldest

Admin
Administrator

Posts: 72,897

Genetic Ancestry of the Andamanese Dec 2, 2021 0:32:55 GMT

Quote

Post by Admin on Dec 2, 2021 0:32:55 GMT


Table 2 Haplogroup counts in
Indian population samples
ti is the oldest most recent common ancestor for a population-specifc clade, and to is the time divergence
of the closest neighbour of such clades in kya
Name C D H J L N O R Mean (t0)/mean (ti)
PUN (1) 0 0 0 0 0 0 0 1 NA
UBR (7) 0 0 1 2 0 0 0 4 5.85/2.9 (2.0)
RAJ (7) 1 0 2 2 0 0 0 2 NA
BEN (1) 0 0 0 1 0 0 0 0 NA
VLR (8) 0 0 1 3 4 0 0 0 8.4/6.5 (1.3)
ILA (3) 0 0 2 0 1 0 0 0 NA
BIR (4) 0 0 4 0 0 0 0 0 6.6/4.5 (1.5)
AND (5) 0 5 0 0 0 0 0 0 53.0/8.9 (6)
RIA (6) 0 0 0 0 0 0 6 0 11.1/11.2 (1.0)

TMRCA at ~18 kya, as reported earlier although using
micro-satellite (Sengupta et al. 2006). Nonetheless, it has
to be stressed that the time depth of a cluster may not be
a good estimator of the population expansion time, as the
Y-chromosomes in the migrant population will have preexisting
diversity that may remain in the new location (Barbujani et al.
1998). This is the reason for analysing not just
the age of the clades, but also including the closest neighbour,
as explained below and in “Methods”.
From the same tree (Fig. 3a), it is possible to calculate
the divergence time of the closest neighbour of each specific
clade that has at least one individual not belonging
to the same super-population (North or South India); this
analysis will tell us where the closest sequence has been
found outside the region, and the time depth of such neighbours.
This analysis may thus provide interesting information about the
neighbouring clade that was “left behind” in
the migratory process, and so inform about the common
time and maybe place of origin of the ancestors of both
populations; nonetheless, caution has to be taken because
of the dependence on sampling size of this analysis. Results
(Fig. 4a in blue) show that the distributions for North and
South India are similar, stressing that the neighbours of
North and South Indian Y-chromosomes are mostly of similar age,
although in South India there is a small proportion
of deeper clusters with ages >30 kya, the only difference
detected between the two super-populations.
We then concentrated on the origin and distribution
of such neighbours in different worldwide populations
(Fig. 4b). When looking at these closest neighbours, we
found that all Indian populations (both North and South)
are the closest neighbours of each other till very recently,
~5 kya (Fig. 4b; note the effect of the smaller sample size of
the new Indian sequences as compared to the
1000 Genomes Project). No differences between North
and South could be discerned within India, which had a
similar reciprocal sharing of neighbours except that some
South Indian neighbours are found with large time depth
(in Fig. 4b, four points above 30kya, one each in ILA and
STU, and two in BEB) as the only tendency towards more
ancient times, as seen also in Fig. 4a.
Interestingly, none of the closest neighbours of South
Indian non-tribal clades was found outside India (except
some American populations from the 1000 Genome Project, that
are known to be admixed), while North India
(and India, of course) had a high frequency of neighbours
among Europeans, suggesting that South Indian-specifc
clades have deeper roots in the Indian subcontinent.
Nearly no neighbours were detected among East Asians
or Africans. Surprisingly, the two South European populations
(Toscani in Italia, TSI, and Iberian Population in
Spain, IBS) are the closest neighbours of North Indian
populations outside India (Fig. 4b); unfortunately in this
data set there are no data available for West Asia to indicate
a more plausible place where the two groups (India
and South Europe) could have some partial common origin;
future work in the regions will allow a more precise
analysis. The distribution of time depths for the closest
neighbours of Indians demonstrated two different clusters
for these two South European populations (Fig. 4c). One
is common to all Europeans and close to 38.6 kya (±7.4
kya), while the second is more specific to South Europeans
(TSI and IBS) and around 13.9 kya (±4.6 kya).
However, we need to stress that the absence of a relevant
sample (likely from Western Asia) in the closest neighbour
analysis can lead a higher time of divergence than
the true divergence.
When looking at the haplotype composition of the
Y-chromosomes that have their closest neighbours in
Europe in recent times (less than 25 kya), a high proportion
(~72%), belong to haplogroup J2, which is well
distributed in India and of clear ancestry in West Asia,
likely related to the demic diffusion during the Neolithic
(Singh et al. 2016), and a much lower proportion to haplogroup
R1a (~10%) which is reported to be associated
with the Indo-European language migration (Semino
2000). Interestingly, the divergence times of these closest
European neighbours are ~17.3 (±3.8) kya for J2 and 7.3
(±2.2) kya for R1a.

Last Edit: Dec 2, 2021 19:08:14 GMT by Admin

Admin
Administrator

Posts: 72,897

Genetic Ancestry of the Andamanese Dec 2, 2021 2:25:56 GMT

Quote

Post by Admin on Dec 2, 2021 2:25:56 GMT

Fig. 4 Phylogenetic neighbours of Indian Y-chromosomes. a In red,
distribution of the time to the most recent common ancestor of North
Indian and South Indian-specifc clades (internal clusters). In blue,
distribution of the divergence time of the closest neighbours of such
clusters that do not belong to the specif super-population (external
clusters). Left North Indian-specif clades; and right South Indianspecific clades.
Y axis is kya. b Closest neighbours for North Indianspecif clades (red plus sign),
South Indian-specif clades (blue circle) and Indian-specif clades (black triangle)
that are external to each super-population. The X axis shows the population of these
external neighbours, whereas the Y axis shows the time divergence.
c Histogram of divergence time of the closest external neighbours of
Indian-specif clades that are found among Europeans (CEU, GBR,
FIN, TSI, IBS).

Last Edit: Dec 2, 2021 2:29:21 GMT by Admin