origins and demographic dynamics of Tupí expansion : a genetic tale origens e dinâmica demográfica da expansão Tupi : uma história genética

tupí linguistic groups display a wide geographical dispersion in south America, probably originated, as pointed by linguistic, from Madeira-Guaporé region (MGr) in Brazil. the present study reviewed genetic data on tupians for autosomal and uniparental (Y-chromosome and mtdnA) markers, using it to evaluate tupians geographic origin as well as the demographic dynamics of their dispersion from a genetic point of view. Comparison of genetic variability and mtdnA haplogroups d frequencies suggests a scenario where MGr is the tupí homeland. the relationship between five estimators of genetic variability (thetas-s, -Pi, -m2, -H and -k) shows that tupí groups from MGr and non-MGr experienced different patterns of demographic dynamics, with an ancient tupí expansion in MGr, followed by dispersion to other south America regions, probably associated to depopulation/founder effect events. furthermore, other recent depopulation events could also be detected in both regions. finally, the dispersion seems to be related to patrilocality, as suggested by comparison of uniparental markers genetic differentiation. this genetic model of dispersion dynamics may have an important impact in the interpretation of archeological and linguistic data, allowing to test if female associated technologies, like ceramic, are more extensively shared between dispersed populations than those which are not female-exclusive.

iNTRoDUCTioN Amerindian linguistic composition in Brazil encompasses between 154 and 170 languages clustered in 20 major groups (Moore, 2005), most of them located in Amazonia.this region shows a huge linguistic diversity, with at least 52 linguistic families (Epps, 2009).Considering the size and geographic dispersion, four main families can be highlighted: tupí, Arawak, Carib e Macro-Je (Epps, 2009).
the tupí linguistic family is widely dispersed in south America, branching in ten groups, most of them located in the state of rondônia (Gabas, 2006), a region close to Madeira and Guaporé rivers (Madeira-Guaporé region-MGr).five branches (Arikém, Mondé, Puruborá, ramaráma and tuparí, besides some dialects of Kawahíb complex) are mainly located in MGr, while the dispersion of remaining branches reached regions outside Amazonia, noteworthy the branch tupí-Guarani, which has occupied areas in the Brazilian south, southeast and in the East coast (Gabas, 2006;Epps, 2009).
Linguistic and genetic data support the origin of the tupí family between 2800 and 3000 years ago (Urban, 1998;Amorim et al., 2013).its geographic origin is still controversial, although the Amazonia region has been considered the best candidate.Linguistic analyses point to the MGr as the putative origin, while some archeological data suggest the confluence of the Madeira and Amazon rivers as a better alternative (rodrigues, 1964;Migliazza, 1982;Urban, 1996Urban, , 1998;;Heckenberger et al., 1998;noelli, 1998).Morphological studies do not suggest any specific origin, but they also agree with an Amazonian origin (neves et al., 2011).
the linguistic-based proposition of MGr as geographic origin of tupí family is well accepted, mainly due to higher tupian linguistic variability of this region.Moreover, some archeological peculiarities of MGr also support this idea.this region has evidences of continuous human occupation for at least 9000 years, having the most ancient "dark soil" site (anthropogenic darkened soil), dating to 4700 years before the present (bp) (Zimpel-neto, 2009).old ceramic patterns found in such "dark soils" were indirectly associated to tupian ceramic patterns (Zimpel-neto, 2009). in this context, the origin of tupí and their main dispersion events should have occurred along the last 3000 years (Marrero et al., 2007), and some authors propose agriculture as the most important factor for their wide dispersion across south America (Epps, 2009).
regarding the use of genetic data for infer the past of Amerindians, historically the earlier genetic datasets were restricted to blood groups.in the 1970s, with the development of electrophoresis and serological methods, the number of known genetic markers increased, including variants of many proteins from serum and blood cells.However, the genetic variability detected by those markers, denominated classical markers, is low (rogers and Jorde, 1996).
in the 1990s, it was possible to obtain molecular data from direct dnA analysis, which revealed a substantial number of markers.this breakthrough allowed researchers to detect high levels of genetic variability on human populations, permitting more detailed inferences about their histories.More specifically, mtdnA and Y-chromosome polymorphisms, called uniparental markers, were also extensively explored, being used to reconstruct distinctly male and female biological histories (Mulligan et al., 2004).
the main advantage of classical markers is that they were investigated for a large number of Amerindian tribes.However, they are mostly proteins and reflect only the variability of the coding regions of the genes.it is possible to suggest that some of them will be under natural selection pressure and therefore represents only a small part of human genome variability.
despite modern molecular markers have been extensively studied in Amerindian populations, the number of populations investigated is still low when compared with the classical ones.nonetheless, molecular markers still remain as a more indicated option to perform complex analysis, helping to infer more details about the demographic history of populations.
the main types of molecular markers are single nucleotides Polymorphisms (snPs), short tandem repeats (strs), Alu insertions and specific dnA sequences.these markers can be either found in autosomes (inherited equally from both parents), or in Y-chromosome or mtdnA, being inherited in a patrilineal and matrilineal way, respectively.
Briefly, snPs are mutations involving one nucleotide in dnA sequences quite abundant in the human genome, mostly having two alternative forms (alleles) and low variability.Alu insertions are sequences of approximately 300 bases scattered in the genome, where their insertion is a polymorphism, i.e., an individual may have an Alu insertion in a given position, whereas others might not have any insertion in the same position.Alu insertions are restricted to primate genomes, are easy to genotype and have been used in many population studies (novick et al., 1995).strs are tandemly repeated sequence motifs from one to six bases.their variation is related to the number of times the motif is repeated.these markers are also abundant in the genome, easy to genotype and have many alleles, being highly polymorphic (tautz, 1993).some dnA sequences are also highly polymorphic and commonly composed by few hundred of bases, where each base is a potential snP.Patrilineal markers (Alu insertions, snPs, and strs found in the Y-chromosome) are used to reconstruct male demographic history, while matrilineal markers (mtdnA snPs and mtdnA sequences) reconstruct female history (Mulligan et al., 2004).
All mitochondria have a genome compose by a single dnA strand.some key positions, or snPs, in the nucleotide sequence are likely used to define groups of sequences, denominated haplogroups.the combination of only four snPs in mtdnA allows to define five haplogroups, exclusively observed in Amerindian populations, called haplogroups A, B, C, d and X.Additionally, the sequencing of a specific region of mtdnA with approximately 300 base pairs, called the hypervariable region (HVr i), provides a highly variable marker, very useful to infer the past demographic history of human female lineages (Mulligan et al., 2004).
Genetic variability is the result of both selective pressures and stochastic events.Usually, the level of intrapopulation genetic variability can be measured by at least five different statistical measures or estimators.the first one and most simple is the number of alleles (k) that can be estimated for all types of genetic markers (snPs, strs, Alu insertions and dnA sequences).Heterozigosity (H) is a very robust statistics that evaluates the genetic variability for all kind of markers, considering the number of alleles and their frequencies.there is an extra statistical measure exclusive to str markers called variance of the number of repeats (m2), which is based on the number of alleles, their frequencies and the degree of difference between alleles.Moreover, dnA sequence markers have two additional statistical measures, the nucleotide diversity (Pi) and number of polymorphic sites (s) (Chakraborty et al., 1988;slatkin, 1995, shriver et al., 1997, tajima, 1989).
All five statistical measures can be independently used to estimate a summary statistic called theta.theta is an estimate that could be calculated from all genetic variability statistics aforementioned and is equal to 4nu (for diploid loci), where n is the effective size of the population and u is the mutation rate.it is noteworthy that effective size is not the same as the census population size, but is related to the number of individuals reproductively active.However, this relation holds only if the population is under mutation-genetic drift equilibrium.Mutation is an evolutionary factor that introduces variability in the population while genetic drift is the random variation of allele frequencies that leads to a loss of variability.thus, in absence of selective pressure, a given population retains its effective size for many generations and no expressive migrations occurs the population reach this equilibrium state and theta estimates obtained from different variability estimators should be very similar (Chakraborty et al., 1988;slatkin, 1995, shriver et al., 1997, tajima, 1989).
Populations are not always in mutation-genetic drift equilibrium, sometimes they are in a called transient state, caused by fluctuations in population size over the time, as bottlenecks and rapid expansions.these demographic events affect the variability estimators in different ways.thus, in transient state, theta estimates obtained from distinct variability estimators will not be similar.Moreover, each kind of demographic event, expansion or bottleneck, causes different signatures in terms of the relation between thetas from different estimators.this feature allows evaluate these differences, inferring about past demographic history of the populations (slatkin, 1994, 1995).
the dynamics of tupí expansion was explored only recently by ramallo et al., (2013), suggesting a radial expansion into new territories, starting putatively from MGr.However, no genetic studies have tested for the hypothesis of tupí origin in MGr, except for one previous study from our group (santos et al., 2013).
Genetic studies have accumulated data from tupí populations for many decades.Here, the review the available tupí genetic data allows inferring both its geographic origins and the demographic dynamics of their dispersion issues.the main results of present study found evidence for MGr as the tupí homeland, and that this dispersion was accompanied by depopulation and female-biased migration.

MATERiAL AND METHoDS
Genetic data review: Genetic data on several types of loci, like classic protein markers, str, Alu insertions, Y-strs and mtdnA were extensively gathered from literature. the sources were given accordingly while the results are presented in the next section.
Study design: in order to evaluate the geographic origin of tupians two groups were defined, the MGr and non-MGr tupians.two approaches were used; the first compares the heterozygosity in populations from MGr and outside MGr for a number of genetic markers groups; the second compares the frequencies of mtdnA haplogroups from both regions.All comparisons were performed by Wilcoxon paired test and Mann-Whitney test.
the demographic dynamics inferences are based on theta estimates obtained from five different measures of genetic variability, H, k, m2, Pi and s, as described above.the first approach evaluates the relation between thetas k, H and m2 from str data, as described in the results and discussion section and the comparison was performed by paired Wilcoxon test.the second approach investigates the relationship between mtdnA sequence's theta s and Pi, since they provide clues about population expansion or depopulation (tajima, 1989).
the evaluation of migration processes during expansion of tupians was made comparing uniparental markers, mtdnA and Y-chromosome markers, testing the trend across a number of studies by Wilcoxon paired test.

GEoGRAPHiC oRiGiN oF TUPi GRoUPS
As previously mentioned, the most accepted hypothesis about the demographic dynamics of tupí supports MGr as homeland followed by a wide and rapid dispersion of tupí-Guarani languages across the south America.such kind of expansions is frequently associated with important technological or cultural innovation and agriculture is putatively responsible for this expansion (Urban, 1998;Gabas, 2006;Epps, 2009, Walker et al., 2012).the idea that after dispersion the original populations would display more variability than the dispersed ones proved to be true not only in linguistic, but also in genetic studies.
indeed, at global scale, the sub-saharan origin of modern humans is well established by paleontological and genetic data (nei and Livshits, 1989).the inspection of genetic variability clearly demonstrates that Africans have higher heterozygosity than non-Africans (rogers and Jorde 1996; Jorde et al., 2000).Y-chromosome data showed that the variability decreases with the distance from east Africa, compatible with sequential founder effect that is characteristic of population dispersion (shi et al., 2010).the same effect was observed in Amerindian populations.Using 678 strs in 29 populations Wang et al., (2007) demonstrated that the decline of genetic variability correlated with the distance from the Bering strait region.
in this context, we estimated the variability of genetic markers previously described in the literature using the statistical measure of Heterozigozity (H). the main observation was that H is higher in tupian from MGr than in those outside MGr, for all kind of markers (table 1), suggesting MGr as tupí homeland.
An additional approach to track dispersion and migration paths is the identification of clines in allele frequencies.We analyzed the frequencies of the major mtdnA haplogroups (A, B, C and d) in tupí populations (figure 1).MGr populations show the highest frequencies of haplogroup d (all over 60%), while tupians outside MGr displayed frequencies lower than 30%, except for Mundurukú (55%), which is geographically closer to MGr (Mann-Whitney test, Z(U) =3.12; p=0.0018).this scenario is compatible with a tupí dispersion center in MGr and, during the dispersion, contact with non-tupí populations, which had higher frequencies of non-d haplogroups, lowered haplogroup d frequencies in tupí outside MGr.

DEMoGRAPHiC DYNAMiCS oF TUPi DiSPERSioN
Estimates of genetic variability have a direct relationship with the effective size of population and the mutation rate of each marker.Hence, large populations should have high values of these estimators.Moreover, markers such as strs may lead to higher values than snPs or classical markers because their higher mutation rates.Given the importance of population size in the degree of genetic variability, it is understandable that sudden demographic variations have a significant impact on those estimates and can be traced back  (Callegari Jacques et al., 2011).H estimated from three MGr populations: Gavião, Zoró and suruí) and eight no-MGr populations: Wayampi, Emerillon, Zoé, Urubú-Kaapór, Awá-Guajá, Parakanã, Guaraní and Aché). 3data from four Y-chromosome str (tarazona-santos et al., 2001;Palha et al., 2010).H estimated from four MGr populations: Karitiana, Zoró, suruí and Gavião; and five non-MGr populations: Wayampi, Zoé, Urubú-Kaapór, Awá-Guajá and Parakanã. 4data from 12 Alu insertions (Battilana et al., 2006).H estimated from four MGr populations: Cinta Larga, Gavião, suruí and Zoró; and two non-MGr populations: Guaraní and Aché. 5 H estimative from hyper variable region i of mtdnA (source: Ewerton et al., 2011).H from three MGr populations: Gavião, suruí and Zoró; and six non-MGr populations: Wayampi, Zoé, Urubú-Kaapór, Awá-Guajá, Mundurukúand Guaraní.Comparisons of H estimative between both MGr and non-MGr populations were statistically significant (Wilcoxon paired test; p-value=0.04).by analysis of degree of genetic variability in each marker.An important feature in this analysis is that each estimator responds differently to demographic events, like population bottlenecks (intense depopulation) and expansions.
Aiming to answer some questions about the demographic dynamics of the tupí, we reviewed autosomal str data available for tupí populations living in MGr and outside MGr. for this kind of markers it is possible to estimate three measures of genetic variation: k, H and m2. in a theoretical model of a stable population, it is expected that theta k, theta H and theta m2 (thetas estimated from k, H and m2, respectively) values are similar to each other.nonetheless, given that actual populations are not stationary and some dramatic alterations in the population size may occur, theta values tend to be conflicting.for example, if a population experienced a severe bottleneck, the most expected scenario regarding theta values is: theta k<theta H<theta m2 (King et al., 2000;Cournet et al., 1997).this expectation fits well to present findings using autosomal str, as presented in figure 2. Ewerton (2011) showed that theta k<theta H using paired Wilcoxon test.this finding held not only for tupians, but also for non-tupians tribes, suggesting a general depopulation process.We performed also paired Wilcoxon test and showed that H<theta m2 as well (p<0.01 for all comparisons).once this expectation is able to detect recent depopulation events (Garrigan;Hedrick, 2003), we suggest that the observed pattern reflects a recent depopulation experienced by both MGr and non-MGr tupí populations (and probably for non-tupian populations).some recent events such as European colonization caused intense depopulation in Amerindians, and are likely to be the reason for this recent bottleneck scenario. of course European colonization is not the unique explanation for the depopulation signature.during the expansion process, the non-MGr tupians probably experienced depopulation events.otherwise, MGr tribes have experienced strong reduction of their population sizes in the last many decades due to the exploitation of natural resources of the region and intense contact with diseases.Hence, the signature of depopulation is likely to be a cumulative effect of many independent recent events, putatively not restricted to tupians.Additionally, is important to highlight that this signature reflects recent events.H, K and m2 take more than n (effective size) generations to come back to equilibrium (Garrigan;Hedrick, 2003).thus, even events occurred few generations ago are able to disrupt the mutation-genetic drift equilibrium.
Additionally, in agreement with the previous results and conclusion, the genetic variability showed also higher values in MGr populations than in non-MGr ones.restricting the comparison to theta H the difference was statistically significant (Wilcoxon paired test, p=0.02).
As previously shown, mtdnA haplogroup d has a high frequency in MGr tupians and a low frequency in tupians populations outside MGr.Hence, considering haplogroup d as an important marker for tupí groups, we reviewed mtdnA HVr i sequences and estimated two theta values from two genetic variability estimators (s and Pi), which are specific for dnA sequences.the pattern theta s<theta Pi is indicative of an ancient depopulation, that was observed in tupians outside MGr. on the other hand, an opposite pattern was observed in MGr (theta s>theta Pi), which is characteristic of an ancient population expansion (figure 3).this approach is equivalent to the tajima's d statistics that detected ancient expansion signals among all major haplogroups in south American indians (Bonatto;salzano, 1997).the present analysis restricted the analysis to haplogroups d and compared both regions.the ancient expansion signal could be observed in MGr but seems to be lost outside MGr.this trend agrees with other results of the present meta-analysis and those from other authors, as reviewed.the intense process of dispersion characterized by sequential founder effects and bottlenecks could explain the lack of expansion signal, supporting the hypothesis of MGr as the tupí homeland.

MiGRATioN PRoCESSES DURiNG TUPi EXPANSioN
Patrilocality is defined as lower male mobility when compared to female, meaning that females migrate (or are exchanged among distinct populations) in a high rate than males.this phenomenon is widely studied in the literature using uniparental genetic data at regional scale and even worldwide, as reviewed bellow.the most common approach to assess patrolocality is based on the comparison between interpopulation genetic differentiation estimated for matrilineal (mtdnA) and patrilineal (Y-chromosome) markers.the most common estimate of differentiation between populations is called f statistics or f st .over time, two populations accumulate genetic differences by genetic drift, but this differentiation may be delayed by factors like migration.since, each uniparental marker should inform separately of the history of females and males, and differences between f st estimates from both sets of markers should be attributable to differences in the intensity of migration between genders.this is because migration decreases the differentiation between populations, thus decreasing f st .
the first study comparing mtdnA and Y-chromosome markers was performed in a global scale (seielstad et al., 1998), suggesting that female migration was eight times higher than males ones.Jorde et al., (2000) demonstrated the same trend in Asia and Europe, but not in Africa.However, further studies, analyzing more populations and uniparental markers, showed some controversial results worldwide, but did not discard regional patrilocality (Wilder et al., 2004).
this controversy has been well discussed by Wilkins and Marlowe (2006).for these authors, inferences about male and female demographic histories taken from uniparental markers are not simple and need more complex modeling.sampling of close populations reveals recent and local population dynamics, while worldwide sampling reflects more ancient events.in this context, studies using worldwide sampling will reveal earlier sex biased migration patterns (Wilkins;Morlowe, 2006).
When focusing on Amerindian populations, initial studies were also controversial (Mesa et al., 2000;Goicoechea et al., 2001;Bortolini et al., 2002;fagundes et al., 2002;Hunley et al., 2008), likely due to sampling limitations, low number or inadequacy of markers and even the use of populations with heterogeneous degree of agriculture development.A more recent study, using larger sets of populations and uniparental markers, clearly demonstrates evidences of patrilocality across many regions of the American Landmass (Yang et al., 2010).
When looking specifically for patterns of uniparental marker differentiation in tupí populations, Mazieres et al., (2011) have found a higher differentiation in Y-chromosome markers than in mtdnA in tupí groups from french Guiana, which indicates patrilocality (table 2). it is important to test if this pattern has also been observed outside french Guiana, and therefore, we performed a meta-analysis comparing Y-chromosome and mtdnA markers from other regions of America populations and have found the same evidence (table 2). in other words, independent of the region the genetic differentiation between populations is always higher for Y-chromosome markers than for mtdnA.the trend showed in the table is statistically significant if we compare both columns using paired Wilcoxon test (Z(U)=2.2;p=0.028).interestingly, the highest differences could be observed among the tupians.
As pointed previously, development of agriculture is highly associated with both population size expansion and dispersion in many regions of the world, including that of tupí and Arawak linguistic groups (Epps, 2009), in addition to patrilocality (Wilkins;Morlowe, 2006).the integrated view of meta-analysis data and literature review strongly supports the idea that tupí expansion dynamics was associated with patrilocality practices.since ethnohistorical and genetic data show contact between tupí and non-tupí groups (Aguiar, 1991), it is plausible to suppose that preferential female exchange would not be restricted to tupí tribes.

CoNCLUDiNG REMARKS
the present study shows a comprehensive set of genetic evidence from meta-analysis data and literature review, leading to a proposed scenario (figure 4) where MGr is tupí homeland.Moreover, in this scenario, tupí groups from MGr and outside MGr showed different patterns of demographic dynamics, which are in accordance with the hypothesis of an ancient tupí expansion in MGr, followed by dispersion of tupí-Guarani branch to other south America regions.this dispersion was accompanied by sequential founder effect bottleneck events.furthermore, depopulation could be detected in both regions but in different periods (ancient and recent).finally, the dispersion seems to be related to patrilocality practices.this genetic model of dispersion dynamics may have an important impact in the interpretation of archeological and linguistic data.if some technologies and their products are more associated to the female gender, like ceramic and other artisanal skills, these technologies would be shared more extensively among different populations than those not associated with females, because they would be carried along with the female migration.this hypothesis might be tested with a reappraisal of archeological materials.the same trend would be also observed in linguistics, because the borrowing of words related to these female-associated technologies might also be more extensively observed.regarding the statistical analysis, it should be considered that more robust tests were not applied in the present study as most of the data was gathered from several studies, not allowing for more sophisticated approaches, because more detailed data, like locus estimates of variability and their variances, are not available.Most importantly, many different and independent genetic evidences agree with the main conclusions about origin, demographic sequence of events and patrilocality cultural practices.Additionally, the main conclusions are open to future testing by archeological and linguistic approaches.

table 2 .
f st -based estimates in Y-chromosome and mtdnA markers for tupians and several regions of America.