Simpson ’ s paradox : a demographic case study of population dynamics , poverty , and inequality

Brazil is undergoing a demographic transition characterized by regional inequalities. It is reasonable to assume that aspects related to poverty, development and inequality might reverse the sign of the association of indicators of demographic transition, exemplifying a phenomenon known as Simpson’s Paradox. The aim of this study was to analyze the effect of inequality, poverty and social development on population dynamics in Brazil, verifying the occurrence of Simpson’s paradox in demographic transition. We used population data from the 1991, 2000 and 2010 national censuses, broken down by age and federative unit (FU). The correlation between demographic indicators was assessed by stratifying the FUs into groups according to their median social indicators. The findings show that all FUs have progressed against social indicators and are undergoing demographic transition; however, despite reductions in disparities over the study period, persistent gaps exist between regions. Simpson’s paradox was present when the analysis was carried out by census year and social indicators, and was particularly pronounced in 1991. The main challenge is to define how to analyze demographic dynamics in Brazil and understand how contextual factors alter the pace, quantum, and pattern of demographic transition.


Introduction
Demographic transition theory describes population change over time. It is based on the interpretation of changes in birth and death rates in industrialized societies over the last 200 years initiated in 1929 by the American demographer, Warren Thompson 1 . According to the theory, there was an initial decline in mortality rates combined with stable birth rates, followed by a rise in natural increase, leading to an increase in population 2 .
With regard to births, Phillipe Ariès' pioneering analysis of childhood history and the factors affecting fertility behavior spurred a set of microeconomic, and subsequently cultural, theories explaining the reasons for the fall in fertility, such as the cost of children, investment in their education, and self-fulfillment in adult life, where parenting has to some extent been replaced by other components of lifestyle 3 . For deaths on the other hand, there is a set of theories associating trends in mortality with social determinants of health 4 and inequalities, as fundamental causes of health inequalities 5 . Thus, various social theories emerged that were distinct from the theory routinely used to explain population dynamics reduced to the quantitative demonstration of trends in births and deaths and growth in production 6 .
The first or "classic" demographic transition refers to the historic decline in birth and mortality rates, resulting in a stationary older population with replacement fertility, zero population growth and life expectancy over 70 years 7 . Within this context, it was considered that there was a balance between births and deaths and sustained migration was not necessary to maintain population size. It is also worth highlighting that at that time in history, nuclear families were the norm 8 .
Most regions of the world and countries have experienced unprecedented demographic changes over the last 200 years. However, projections show that these changes tend to result in countries with quite different profiles, with stagnation and potential population decline in parts of the developed world and continued rapid growth in less developed regions. Contemporary societies therefore find themselves at different stages of demographic transition 9 . In other words, while the transition has been relatively homogenous across industrialized countries, the theory and model are frequently imprecise when applied to peripheral countries, due to specific social, political, and economic factors affecting specific populations 10 .
The largest country in Latin America, both in size and in population, Brazil is currently undergoing this process. Most of the population is urban and the country has experienced a sharp fall in fertility, starting in the 1970s after a period of decline in mortality beginning in the 1930s 11 . A country of continental proportions, Brazil is characterized by stark regional inequalities. Although the country has reduced inequalities at the base of the pyramid, lifting part of the population above the poverty line, income and wealth remains highly concentrated at the top 12 . It is worth noting that the current context in Brazil threatens to undermine this progress, with the intensification of the fiscal crisis in 2014 and adoption of austerity measures after the coup in 2016 13 .
Evidently, the fact that demographic transition theory is a general model means that it can be relatively inaccurate when describing individual cases. In this regard, there are a few social and economic theories that seek to provide a more detailed analysis of the demographic transition and a wide range of operational models that use different techniques to measure demographic components and explain the transition phenomenon 14 . However, few studies have explored variables that may partially explain the effect of transition, be it patterns or the level at which changes occur. In this regard, it is reasonable to assume that aspects related to poverty, development and social inequality may be confounding factors that reverse the sign of the association between variables because they are directly related to patterns of birth or changes in probability of dying [15][16][17][18][19] . This situation exemplifies a phenomenon known as Simpson's Paradox.
Simpson's paradox is an extreme condition of confounding in which an apparent association between two variables is reversed when the data are analyzed within each stratum of a confounding variable. With Simpson's Paradox, the marginal correlation between cause and effect would be considered spurious. In other words, it can be inferred to be causal because a third factor functions as a cause of the correlation among the variables. This effect can lead to an erroneous conclusion that a given association is true, when in fact it is not 20 . For this paradox to occur, two conditions must be present: (a) an ignored or overlooked confounding variable that has a strong effect on the outcome variable; and (b) a disproportionate distribution of the confounding variable among the groups being compared 21 . This phenomenon has long been recognized as a theoretical possibility, but few real examples have been presented. In light of the above, the aim of this study was to analyze the effect of poverty and inequality on population dynamics in Brazil, verifying the occurrence of Simpson's Paradox in demographic transition.

Background
We conducted an ecological study whose unit of analysis was Brazil's 27 federative units (FUs). FUs are subnational entities with a certain degree of autonomy (self-governing, self-legislating and self-funding) and their own government. Together, they form the Federative Republic of Brazil 22 . Brazil's FUs consist of 26 states and the Federal District, distributed across five major regions. Like many countries across all continents, the process of regionalization in Brazil was influenced by demographic, economic, political, and social factors. It is important to stress, however, that this process has been characterized by a rise in inequality between social classes and regions. Thus, despite social and economic progress, significant regional economic and social disparities remain 23 .

Data sources
The study uses population data from the 1991, 2000 and 2010 national censuses conducted by the Brazilian Institute of Geography and Statistics (IBGE). We also used the following social indicators from the Atlas of Human Development in Brazil 24 : a) Human Development Index (HDI): a summary measure of long-term progress in key dimensions of human development (income, education, and health). The latest method of calculating the HDI considers the following dimensions: "a long and healthy life" (health), measured by life expectancy at birth; "a decent standard of living" (income), measured by gross national income (GNI) per capita, expressed as purchasing power parity (PPP) in US dollars and using 2005 as a baseline; and "knowledge" (education). The latter is measured by mean of years of schooling for adults aged 25 years and over and expected years of schooling for children of school entering age. Each dimension index is calculated as follows: = The HDI is therefore calculated based on the following dimension indices: Life Expectancy Index (LE), Education Index (EI) (calculated based on the "Mean Years of Schooling Indicator" (MYSI) and "Expected Years of Schooling Indicator" (EYSI)); and "Income Index" (II). The summary measure is the geometric mean of the three normalized indices: HDI = LE * EI * II b) Gini Coefficient: a measure of inequality based on the distribution of income across a population. It ranges from 0.0 (perfect equality, with every household earning the same income) to 1.0 (absolute inequality, with a single household earning a locality's entire income). Mathematically speaking, the Gini Coefficient is equivalent to half of the relative mean absolute difference in income between any two randomly selected households of a population normalized for the mean. The Gini Coefficient may be calculated using the Brown formula: Where: G = Gini Coefficient; X = population; and Y = income c) Proportion of the population living in extreme poverty: proportion of individuals with a household income equal to or less than one-quarter of the minimum wage. d) Average income per capita: average income in a specific FU, comprising the sum of the income of all inhabitants divided by the total number of inhabitants.
The population data were used to calculate the following demographic indicators: i) Crude Birth Rate (CBR): total number of live births per thousand inhabitants.; ii) Crude Mortality Rate (CMR): total number of deaths per thousand inhabitants.

Data analysis
The registration coverage of births in Brazil is incomplete 25 . For this reason, birth and mortality indicators are frequently taken from census data and standardized retrospective studies, such as national demographic and health surveys. It is therefore necessary to adjust the data using in-(observed value -minimum value) (maximum value -minimum value) 8 Dimension index direct demographic estimation techniques. For the purposes of this study, we used data from the 1991, 2000 and 2010 censuses and corrected the undercounting of births on Brazil's registry of births using the adjustment factors suggested by the IBGE and established data adjustment methods, such as Brass' P/F ratio technique and Gompertz's relational method 26 . To correct the adult mortality data, we used the adjusted Synthetic Extinct Generations (SEG-adj) method, proposed by Hill et al 27 , while for infant mortality we used the Brass and Coale method 28 with a variation proposed by Trusell 29 .
The social indicators were ranked to assess differences inequality, poverty, and social development across the FUs. We compared the indicators across UFs in each census year and estimated variation over the study period (1991 to 2010). To assess whether there was a time effect in the relationship between birth and mortality rates (which is indicative of demographic transition), we calculated the correlation between the two rates for each census year and compared the direction and magnitude of the relationship. In addition, we estimated the correlation between the demographic indicators (birth and mortality rates) and economic indicators (Gini coefficient, average income per capita, proportion of the population living in extreme poverty and HDI) for each year. We also applied the 1991 baseline values in Brazil to observe changes in each UF throughout the study period. Non-linear correlations were smoothed using LOWESS (locally weighted running line smoother) to identify relationships between the variables of interest.
Based on the results of the initial analysis, the correlation between demographic indicators was assessed by stratifying the federative units into two groups according to their median social indicators. Since, conceptually speaking, the demographic transition is characterized by a shift in birth and mortality patterns, we synthesized the relation between crude birth and mortality rates. The relationship was tested using the Pearson correlation coefficient and the statistical significance of the correlation was measured.

Results
In general, the states made progress against the social indicators over the study period. Figure 1 shows the ranking of the FUs in 1991, 2000 and 2010. Figure 2 presents the relative variation of indicators between 1991 and 2010 (the changes in these indicators between each census year can be seen in the Table 1, Figures 3 and 4). The increase in HDI values over the period shows that Brazil achieved overall progress in human development across the FUs, resulting in a reduction in inequality. It is important to highlight the formation of spatial clusters of similar HDI values. We used the classification used by the United Nations 17 (very low, up to 0.444; low, between 0.500 and 0.599; medium, between 0.600 and 0.699; high, between 0.700 and 0.799; and very high above 0.800). The very low HDI group includes almost all the states in the Northeast, while the low HDI group includes the states in the North and Northeast not included in the low HDI group. The medium HDI group is made up of the states in the Midwest and the two states in the Southeast with the greatest variation in social indicators, either due to the profile of the states or the fact that they have many municipalities with large structural and demographic differences. Finally, the high HDI group is made up of states from the South and Southeast, revealing disparities between the South-Southeast and North-Northeast.
A similar relationship was found for the proportion of the population living in extreme poverty and average income per capita, with a notable reduction in the former and general increase in the latter between 1991 and 2010. Despite overall improvement, inequality levels between the North-Northeast and South-Southeast remained relatively static Regarding inequality, the findings show that trends are marked by disparities. The Gini coefficient fell across a large part of the FUs and there was a reduction in rate disparities across FUs. However, the gap between the FUs with the highest levels of inequality and those where inequality has decreased or remained static has widened. It is important to highlight those changes in income indicators do not necessarily accompany trends (level or pattern) in inequality indicators. This means that states with greater levels of wealth do not necessarily have lower levels of inequality. Figure 5 shows that there is a correlation effect between economic and demographic indicators. In general, this correlation persists throughout the study period. This effect is more pronounced for average income per capita and proportion of the population living in extreme poverty, possibly since these two indicators do not have a normal distribution.
It is important to highlight that there is an association between birth and mortality rates, as shown in Figure 6. Using rates in 1991 as a baseline to compare trends across the FUs, we created quadrant charts that show four different scenarios in a clockwise direction: high birth rates and high mortality rates; high birth rates and low mortality rates; low birth rates and low mortality rates; and low birth rates and high mortality rates. These scenarios approximately describe the phases of the demographic transition. Figure 6a shows that the FUs experienced a transition over the study period. However, inequality and gaps between the North-Northeast and South-Southeast persist throughout the study period ( Figures  6b, 6c and 6d). The findings show spatial clustering of the stages of demographic transition, with FUs in the North and Northeast being at earlier stages than the South and Southeast in all years of the study period. Figure 6e shows a positive However, in the stratified analysis, the correlation is negative in each of the census years (Figure 6f). This change may be related to a change during the demographic transition in the country. Thus, the year of analysis is a confounding factor, revealing the presence of Simpson's Paradox.
It is important to note that the main indicators that describe the stages of the demographic transition in Brazil show a certain pattern in our findings ( Table 2). In general, there was a shift in the relationship between birth and mortality rates in Brazil between 1991 and 2010, from a direct relationship between 1991 and 2000, characterizing an earlier stage of the demographic transition, to an inverse relationship in 2010. Although none of the relationships were statistically significant, it is worth mentioning that the correlation was stronger in 1991 than in 2000.
Finally, the stratified analysis of the correlations including the indicators of inequality, development, poverty, and income returned atypical findings, revealing the presence of Simpson's Paradox. When stratified, some of the correlations in the dataset that were previously not statistically significant changed, moving in an opposite direction to the general dataset and in some cases becoming statistically significant. This was particularly notable in 1991, when the correla-  tion not only changed direction in the groups but was also statistically significant. In 2000, the correlation changed direction but was not statistically significant, suggesting that the transition was in course, narrowing the gap in birth and mortality rates between the FUs. In 2010, the correlations in the subgroups moved in the same direction as the general dataset and the only indicator that showed a statistically significant correlation was income. This phenomenon suggests that income, poverty, development, and inequality may act as confounding factors that reverse the sign of the association between the demographic components, with differences in the strength of interaction.

Discussion
Population growth does not occur independently of social organization. The findings suggest that there is a relation between level of income and poverty -both associated with patterns of social and spatial inequality -and the country's population dynamics. Despite the relationship between poverty and inequality and the pace of demo- graphic transition in Brazil, the findings show significant regional differences across Brazil, resulting in two regional blocs (North-Northeast and South-Southeast) that seem to be moving in opposite directions, reflected in the demographic indicators of transition and economic indicators 30 . The central discussion proposed by this study is based on one crucial fact: the direction of the causal arrows is determined by the causal structure of the problem, not statistical considerations. Thus, any treatment of Simpson's example that ignores the causal structure does not contain sufficient information to determine the appropriateness of the marginal vs the conditional association measure 31 . The assumption that Simpson's Paradox is simply confounding misses the main point of Simpson's original study: statistical reasoning is insufficient to choose between the marginal and conditional association measure 32 . In fact, there is an understanding that statistical information needs to be reinforced with adequate theoretical models for causal inference from observational data 33 .
It is possible that the main reason why the association was reversed is that the probability of births and dying varies depending on social strata and level of poverty. This relationship is documented in the literature [34][35][36] . This means that demographic transition is closely related to basic factors such as social class, social hierarchy, poverty and inequality 37 , and that these elements are part of the causal structure of the phenomena that describe the transition: births and deaths.
Moreover, the demographic transition model is not predictive. In fact, like all models, it has limitations. For example, the model assumes that over time all countries go through the same stages: near-zero population growth, due to high birth and mortality rates; rapid decline in mortality rates and a noticeably slower fall in birth rates; accelerating decline in birth rates; and low birth and mortality rates, pointing to a discrete rise in mortality 1,38 .
It is important to note that, from this perspective, the abovementioned stages depend on a continuous industrialization process and, consequently, urbanization. This process is unpredictable in low-income countries in Africa or those with stagnant economies due peripheralization of production processes 18 . In addition, the model's time scale, especially in various countries in Southeast Asia, is utterly distinct from that observed in European countries, insofar as the pace of development in the former is much faster than that of the first industrialized countries 39 .   It is worth mentioning that the relationship between development and demographic transition is not unilateral: inequality, poverty and economic growth influence births and deaths and, in the opposite direction, birth rates and mortality also influence pace of development, insofar as they modify the population's age structure 9 .
Although public policies seek to address widening disparities and reduce socio-spatial inequalities, a study conducted by Arretche 40 highlights that, despite progress against certain indicators, regional inequalities remain stark in Brazil. This reality is also shown by studies in the field of public health 41,42 . Regional planning therefore needs to combine economic incentives and social policies to improve living conditions and reduce inequality.
Naturally, studies are subject to caveats. There is no consensus on whether there are single or multiple demographic transitions. This means that the country's population dynamics may be influenced by other aspects, such as changing of transition. Quite the opposite, there is a processual movement that should be assessed with care, so as not to disregard the contextual effects that mediate and/or influence the pace of demographic phenomena. Finally, using the FU ranking proposal as a benchmark for stage of development or social progress may prompt a certain amount of controversy. It is important to clarify, however, that that this ranking was used for illustrative purposes only, to demonstrate the different dynamics between FUs, illustrating that Brazil is characterized by structural heterogeneity, based on the concept outline above. That said, this initial diagnosis serves as a point of departure for a more robust analysis of demographic transition in Brazil.
In the presence of Simpson's Paradox, the results of the analyses of the general datasets contradict the findings from the subgroups of these same datasets. Data analysis methods that do not take confounding factors into account, including 2 × 2 table epidemiological analysis, the independent-samples t-test, Wilcoxon rank-sum test, chi-squared test, and univariate regression, cannot manage the Simpson's Paradox and may therefore result in erroneous conclusions. The Mantel-Haenszel test and multivariate regression are examples of rational analysis methods that provide valid results 31,43 . However, one of the limitations of the method used by the present study is the small number of units of analysis (just 27 federative units). The study question should therefore be revisited using a more robust dataset and longer time interval.
In conclusion, the current study provides an initial analysis of the phenomenon of interest. We recommend that future studies perform a panel analysis on a different geographical scale using larger datasets.

Collaborations
All authors contributed to study conception, data collection and analysis, and writing and critically revising this manuscript.