Adult obesity in different countries : an analysis via beta regression models

Obesity is considered a serious public health problem, as an epidemic disease with major global repercussions that is associated with the development of other chronic conditions such as hypertension, diabetes, and cardiovascular diseases. The current study examines the distribution of adult obesity in different countries using a beta regression model. This is a descriptive ecological study with a quantitative and inferential approach and a focus on beta regression analysis. Application of this method used a set of real data from public sources on adult obesity in 78 countries in 2014. Descriptive data analysis showed that 50% of the countries showed adult obesity prevalence greater than 20%. In addition, analysis of the distribution of prevalence by country showed lower adult obesity levels in countries of Asia and Africa. Meanwhile, higher values were found in countries of the Americas and Europe. Boxplot analysis also evidenced a possible difference in the proportion of obese adults between the Americas and Europe on one side and Africa and Asia on the other. Adjustment of the beta regression model with varying dispersion and 5% significance identified mean annual per capita alcohol intake, percentage of insufficient physical activity, percentage of the population living in urban areas, and life expectancy as variables associated with adult obesity. Obesity; Chronic Disease; Linear Models Correspondence S. A. Souza Universidade Federal da Paraíba. Cidade Universitária s/n, João Pessoa, PB 58051-085, Brasil. saul_asouza@hotmail.com 1 Universidade Federal da Paraíba, João Pessoa, Brasil. doi: 10.1590/0102-311X00161417 Cad. Saúde Pública 2018; 34(8):e00161417 QUESTÕES METODOLÓGICAS METHODOLOGICAL ISSUES This article is published in Open Access under the Creative Commons Attribution license, which allows use, distribution, and reproduction in any medium, without restrictions, as long as the original work is correctly cited.


Introduction Adult obesity in the global scenario
Obesity is considered an epidemic disease with major global repercussions, affecting both developed and developing countries 1,2 .Causes of obesity can include genetic, metabolic, environmental, social, cultural, economic, lifestyle, and demographic factors 3,4 .
Body mass index (BMI), which assesses individual fat concentration, is defined as the ratio between body weight in kilograms (kg) and height squared (m 2 ) 5 .Persons with BMI ≥ 30kg/m 2 are classified as obese.
The World Health Organization (WHO) defines obesity as excessive fat accumulation that presents harm to the person's health 5 .Thus, consumption of energy-dense foods and lack of physical activity are key facilitators of calorie gain and decreased body energy expenditure over the course of the day, making the individual's energy balance positive and facilitating fat accumulation 6 .
Obesity is classified in the group of chronic noncommunicable diseases (NCDs) and is considered one of the most important risk factors for other complications such as diabetes mellitus, hypertension, cardiovascular diseases, etc. 7,8 .The NCDs, especially those just cited, pose a serious public health problem as the leading causes of mortality in the world 9 .In 2008, for example, NCDs accounted for 63% of deaths in the world, 80% of which in low and middle-income countries 10 .
Obesity is a disease with major social, family, and financial impact, especially for the families of affected individuals.Treatments for obese persons -dealing with the consequences of the condition -represent enormous expenditures for the health system.In Brazil, for example, the costs of procedures associated with overweight and obesity are an estimated 2.1 billion dollars per year 11 .The United States is one of the countries suffering most from obesity-related problems, since more than a third (35%) of the American population is now obese, and the expenditures for treating the disease exceed billions of dollars a year 12 .
The Organisation for Economic Co-operation and Development (OECD) is an international organization consisting of 34 countries -both developed and developing -whose objective is to promote policies that improve the economy and people's social welfare around the world.The organization's report for the year 2014 showed that in the previous five years, Canada, England, Italy, Republic of Korea, Spain, and the United States showed modest or practically stable annual growth in overweight and obesity.Meanwhile, Australia, France, Mexico, and Switzerland showed growth of 2% to 3%, with no evidence of a reduction or containment of this epidemic across the countries.It is estimated that countries' health sector expenditures related to obesity vary from de 1% to 3% and are greater when associated with other complications 13 .
Therefore, since obesity is a global problem that involves various countries, including Brazil, it is necessary to learn more about the global distribution of obesity and identify possible factors related to its growth in recent years.Several authors have used logistic regression methods for this purpose, particularly in epidemiological studies, in order to identify associations between the independent variables in a context where the response variable is dichotomous and individuals are the unit of interest 14,15 .The current study aims to examine the distribution of adult obesity across different countries using a beta regression model.This approach is valid since the response variable is a defined proportion on the interval (0,1).

Traditional regression models and the beta regression model
The literature boasts numerous statistical methods that can be used to model data.However, in most cases what one sees is the indiscriminate use of the logistic regression model.It is thus useful to know the different types of models proposed in the literature in order to optimize the analysis of the associations between the independent variables and the response variable.
In various observational or experimental situations, researchers seek to understand and explain phenomena in different areas of science.It is possible to use regression models for this purpose, since they allow expressing the relationship between the response variable Y t and the p independent covari-Cad.Saúde Pública 2018; 34(8):e00161417 ates (X 1 ,…, X p ), addressed in the study.Linear regression is one of the most well-known methods, due to the ease in interpretation of its parameters by researchers, besides being available in various statistical packages.This regression model can be expressed as follows: With t = 1,…,n, in which n is the total number of observations in the study.Here, Y t is the outcome or response variable, (X 1 ,…,X p ) are the independent covariates, and (β 0 ,…,β p ) are the unknown parameters to be estimated.The errors ε t , are a random, independent sequence with normal distribution with mean zero and constant variance.Briefly, regression models seek to describe the relationship between variables using a mathematical equation 16 .
Kieschnick & McCullough 17 studied the modeling of variables on the interval (0,1) and identified seven types of models used in the literature to analyze data on the open interval (0,1).These models are: linear normal, logit, censored normal, non-linear normal, beta distribution, simplex distribution, and quasi-likelihood.The authors further discussed the inappropriate use of the ordinary least squares estimator in this setting.Finally, they recommend the use of beta distribution regression or a quasi-likelihood regression 18 for data with this type of restriction.
Ferrari & Cribari-Neto 19 proposed the beta regression model to model asymmetrical data on the interval (0,1).This class of models assumes that the distribution of the probability of the response variable is beta, that is, the data must be displayed as rates or proportions, equivalent to prevalence rates in epidemiological models.Unlike linear normal models, the usual estimator is maximum likelihood.It is thus possible to estimate the vector of unknown parameters based on the likelihood function.The normal linear model cannot be used when the data contain zeros and/or ones, that is, when some observation is equal to the interval's limits.This is because the proportions on the interval (0,1) are not defined on all the real numbers, which is one of the assumptions of normal distribution -the principal characteristic assumed by the variable to allow applying the linear model 20 .
In this setting, the beta regression model's log-likelihood function becomes unlimited.In addition, it is not adequate to assume that the data are from an absolutely continuous distribution.Therefore, an adequate solution would be the zero-or one-inflated beta regression model, in which the response variable's distribution is a mixture of a Bernoulli distribution and a beta distribution 20 .
In the regression structure to model the mean response, the mean response y t is related to a linear predictor η t by means of a link function as follows: Where β = (β 1 ,…,β k ) T is the vector of unknown parameters to be estimated and X = (X t1 , …,X tk ) are observations of k independent variables.Here, the mean response is obtained by applying the inverse of the link function g(.), that is, .Importantly, this model assumes a constant precision parameter throughout the observations.Still, in certain situations this parameter may vary over the course of the observations 21,22,23,24,25 .That is, the precision parameter is variable and needs to be modeled with a regression structure similar to that of the mean response.The precision's regression structure is thus defined as: Where γ = (γ 1 ,…,γ q ) T is a vector of unknown parameters, Z = (Z t1 ,…,Z tq ) are observations of q independent variables (k + q < n), is the linear predictor, and h(.) is a link function.There are some possible choices for the link functions g(.) and h(.).For example, for g(.), referring to the model of the mean, one can use the logit link function, log, or cloglog, .In relation to the model of the precision, one can use the function or for h(.) 26 .The concept of heteroscedasticity, or non-constant variance of errors, when applied to the beta regression model, differs from that applied to the normal model, which frequently uses variance as a measure of dispersion.In fact, even if the dispersion parameter is constant, the variance of the response variable is non-constant, since it depends on the unknown means that vary according to the Cad.Saúde Pública 2018; 34(8):e00161417 model.Dispersion is naturally treated as the inverse of precision, i.e., the greater the dispersion of data over the course of observations, the lesser the precision of the mean response and vice-versa.In addition, the correct modeling of dispersion directly influences the parameters of the mean structure, which improves the inferential results.

Methodology
This is a descriptive ecological study with a quantitative and inferential approach and a focus on regression analysis.The data refer to adult obesity in 78 countries in 2014 in which calculation of the observed proportion was based on the adult population 18 years and older with BMI > 30kg/m 2 .The sample consisted of 78 observations (proportions) in countries around the world, of which 25 (32%) in Africa, 11 (14%) in the Americas, 14 (18%) in Asia, 25 (32%) in Europe, and 3 (4%) in Oceania.
Data were collected from the online databases of the World Bank (http://databank.wordbank.org) and WHO (http://www.who.int).The World Bank database refers to five institutions that aim to reduce poverty and provide technical and financial assistance to developing countries.The WHO database refers to an organization working in more than 150 countries and relies on governments and other partners to guarantee the highest possible level of health for people.
The collected data were tabulated in an electronic spreadsheet and submitted to the R software (The R Foundation for Statistical Computing; http://www.r-project.org).This software is an openaccess platform with various statistical data analysis methods already implemented .Importantly, the most up-to-date available data were collected, covering the largest number of countries.Furthermore, since these are public domain databases, it was not necessary to submit the project to the Institutional Review Board.
Initially, a descriptive analysis of the data was performed to extract important information on the study's independent variables.The variables cited in this study are listed below with their respective descriptions: OB2014: proportion of obese adults, 18 years or older, with BMI > 30kg/m 2 in 2014; INAT: percentage of insufficient physical activity in adults in 2010.In other words, the percentage of the target population with less than 150 minutes of moderate physical activity per week or less than 75 minutes of vigorous physical activity per week, or the equivalent; EDUC: expenditures on education as a percentage of total government spending in 2010; VIDA: life expectancy at birth (in years) in 2014; ALC: mean annual per capita consumption of pure alcohol-equivalent, based on the population 15 years and older in 2008; URB: percentage of the population living in urban areas in 2014.
Next, inferential procedures and goodness-of-fit measures were performed for the beta regression model, using the betareg package of the R software.As discussed, the beta regression model with varying dispersion has the advantage of allowing modeling the data's variability, which permits improving the inferential results.The model was also chosen because the target variables are furnished as proportions.The beta regression model has the further advantage of allowing expansion of the conclusions concerning the study's topic by estimating the impact of a given covariable on the mean response.

Results and discussion
Table 1 shows the descriptive data analysis, presenting the minimum value, first quartile (Q 3/4 ), median, mean, third quartile (Q 1/4 ), maximum, and coefficient of variation (CV) for the variables used to model the beta regression.From this table, we see that the proportion of obese adults varies from 0.03 to 0.41, with approximately 25% of the 78 countries presenting OB2014 values greater than 0.26 or 26%.
In 50% of the countries, the prevalence of persons practicing insufficient physical activity exceeded 23.8%, with a minimum of 4.10% and maximum of 63.6%.The lowest life expectancy at birth was Cad.Saúde Pública 2018; 34(8):e00161417

Table 1
Descriptive data for the study variables.Approximately 25% of the 78 countries showed URB values greater than 74.82%.Mean annual per capita alcohol consumption varied from 0.10 to 15.40 liters, with a mean of 7.39.The CV is defined as the ratio between the standard deviation and the mean, classified as a measure of dispersion.Based on CV, the variable ALC shows the highest variability of data in relation to the mean, with a CV of 0.597.Note that a CV of zero would tell us that the data for a given variable are homogeneous (i.e., all the observations would be equal to the mean).

Variables
Colombia, in South America, showed the highest proportion of adults practicing insufficient physical activity.Other countries came close to this proportion, such as Malaysia, South Africa, and Mauritania, the first of which located in Asia and the latter two in Africa.The highest life expectancy values were seen in Spain and Italy, in Europe, followed by Singapore in Asia.
Europe was the continent with the highest per capita alcohol intake.In order, Lithuania, Romania, and Hungary had the highest national alcohol consumption figures in Europe.Singapore and Qatar in Asia and Belgium in Europe were the countries with the highest percentages of people living in urban areas.Africa was the continent with the highest expenditures on education as a percentage of total government spending, led by Ethiopia, Namibia, and Benin.Finally, the highest proportion of obese adults was in Qatar, in Asia, followed by the United States, in North America, while the lowest proportions were in Cambodia and Nepal, in Asia.
As shown in Table 2, OB2014 correlates positively with most of the covariables, except for EDUC.The highest linear correlations with the response variable were for URB and VIDA.Although there was a 0.70 correlation between the two, there were no problems related to the multicollinearity in the further regression analysis.
Figure 1 shows the histogram of frequencies and the boxplot for the variable "proportion of obese adults in 2014".The figure shows that the response variable's distribution is asymmetrical, easily observed in the boxplot, since the median is closer to the third quartile.There is also an absence of outliers, or discrepant values outside the boxplot's limits, which are defined from the quantities and , referring to the upper and lower limits, respectively.
Figure 2 shows the boxplot for the variable OB2014 on the continents Africa, America, Asia, Europe, and Oceania.The highest concentration of countries with low OB2014 values is in Africa and Asia, while America, Europe, and Oceania have the highest values.Note that there is no intersection between the boxplots for Europe and Oceania and those of Africa and Asia, signifying a possible difference between the proportions of obese adults on these continents.

Figure 1
Histogram and boxplot for the proportion of obese adults in 78 countries in 2014, respectively.
The beta regression model considered the data set on adult obesity in the countries, totaling 78 observations.Initially, when fitting the beta regression model, it is essential to examine the data's dispersion.Regression models with varying dispersion require a structure to model the parameters' precision in order to improve the inferential results 27 .
The likelihood ratio test was used for this purpose in order to test the null hypothesis of fixed precision, i.e., 21,25,28 .The result was a p-value less than 0.0001 (the value obtained from the sample data reflects the likelihood of rejecting the null hypothesis given that it is true).That is, setting significance at 5%, we reject the null hypothesis of fixed precision.A regression structure is thus necessary to model the data's precision.

Figure 2
Boxplot for the variable OB2014 on the continents Africa, the Americas, Asia, Europe, and Oceania.

The beta regression model with varying dispersion is as follows:
with t 1,…,78.In this model, the parameter for precision varies with the observations, thus displaying a heteroscedastic structure.However, even if the data's dispersion is fixed, the variance of the response variable is non-constant, since the value depends on unknown means that vary with the regression structure.
Table 3 presents the estimates, standard errors, and p-values used to determine the significance of the proposed model's estimates.Here, the beta regression model with varying dispersion uses the loglog and log link functions to relate the linear predictor to the mean response and the precision, respectively.It is possible to use the Wald test 29 to verify the null hypothesis that β i = 0 with j = 1,…,p, that is, the variable associated with parameter β i does not present a significant effect on the mean response 30 .Thus, considering the 5% nominal level, the variables insufficient physical activity (INAT), persons living in urban areas (URB), alcohol consumption (ALC), and life expectancy (VIDA) are relevant for explaining the proportion of obese adults in countries, since they present p-value < 0.05.
In addition, such covariables show a positive effect by increasing the proportion of obese adults in the countries.That is, the result is consistent with those obtained in the descriptive analysis through the linear correlations with the response variable, presented in Table 2.The positive effect of the INAT variable can be explained by the decrease in the loss of calories over the course of the day due to insufficient physical activity.Meanwhile, the positive effect of the URB variable may be linked to the difficulty in eating meals at home due to growing problems with the urban transportation system caused by increasing urbanization.Thus, the fast pace of modern life encourages the consumption of meals away from home, especially energy-dense "fast foods" 31 .Modernization and lifestyle changes due to technological progress also make people more sedentary and increase their odds of becoming Table 3 Estimates of the coefficients, standard error, and p-value of the beta regression model with variable dispersion, considering the link functions loglog and log for modeling the mean and dispersion, respectively.obese.The positive effect of the ALC variable can be interpreted as the high calorie intake from alcohol consumption, thereby contributing to the increase in obesity in the countries.Population aging leads to various body changes, with a declining metabolic and increase in weight gain 32 .Thus, the positive effect of the VIDA variable may be related to the aging process, since the higher the life expectancy in the countries, the larger the proportion of elderly individuals.

Link function
For example, for countries with the covariables INAT, URB, and ALC fixed on the median and with a life expectancy of 74 years, according to the adjusted model, the estimated mean proportion of obese adults is: Still, since the link function used was loglog, the inverse function applied to the linear predictor in order to obtain the expected value for the response variable is: That is, for countries with 23.80% of insufficient physical activity, 60% of the population living in urban areas, mean annual per capita alcohol consumption of 7.15 liters, and life expectancy 74 years, the expected proportion of obese adults is 0.17, or 17%.
As for modeling the precision, Table 3 shows that the covariables life expectancy (VIDA), government spending on education (EDUC), and alcohol consumption (ALC) were statistically relevant at 5% significance.Note that the higher the VIDA and EDUC values in the countries, the lower the data's precision and thus the greater the dispersion.Meanwhile, the higher the ALC values, the higher the precision, that is, the increase in precision means lower dispersion of the data, making the mean response more precise.In short, modeling the data's variability is an approach that allows improving the inferential results.
The model's goodness-of-fit was verified using the adjusted coefficient of determination (pseudo-R 2 ) and the RESET test 33,34 .Pseudo-R 2 is a global measure of the explained variation, analogous to the coefficient of determination used in linear regression models.This measure is defined as the square of the sample correlation coefficient between η and g(y) 19 .Thus, with pseudo-R 2 = 0.69, the covariables are said to be capable of explaining about 70% of the total variability in the proportion of obese adults in the countries.In addition, this measure presents values restricted to the interval (0.1), that is, the closer to one, the better the model's goodness-of-fit or explanatory power.
The RESET test for beta regression models was used to test the model's correct specification 21,25,33 .The test's mechanism consists of adding as covariable to the sub-model of the mean the estimated linear predictor raised to the second power, η 2 .The test's underlying concept is that this covari-Cad.Saúde Pública 2018; 34 (8):e00161417 able has some power to explain the response variable, so we reject the null hypothesis of absence of specification errors.That is, the proposed model presents a correct functional configuration, with no omissions of variables occurring 34 .Therefore, with p-value = 0.0075, we lack sufficient evidence to reject the null hypothesis that the model is well specified at 5% level of significance.
Normal probability graph with simulated envelope is a technique that allows identifying deviations from the model's assumption and possible discrepant observations.Figure 3 shows that the observations are distributed randomly within the envelope's limits and close to the central line, presenting a reduced number of observations that slightly exceed these limits.Thus, we do not have sufficient evidence to disagree with the model's adequacy.
It is further possible to estimate a given covariable's impact, like the percentage of insufficient physical activity on the proportion of obese adults in the countries, as follows 22 : Where E(.) is the expected value or expectancy.That is, one derives the linear predictor in relation to the target covariable for which one wishes to estimate the individual effect.
Thus, with the aim of estimating the impact curves to describe the effect of insufficient physical activity on the proportion of obese adults in the countries, three situations were considered, as shown in Figure 4, that is, in which the covariables URB, ALC, and VIDA are fixed in the first, second, and third quartiles.It is thus possible to vary the values of INAT to determine the resulting increase in the mean response.As a result, the impact is positive and increases slowly as the levels of insufficient physical activity increase.In addition, there are no major differences between the curves in quantiles 0.50 and 0.75, and they decrease as the INAT values increase.That is, starting at a given value of INAT close to 0.50, no major increases occur in the mean response.

Figure 3
Graph of normal probability with simulated envelope.Source: study data, Impact of insufficient physical activity on the proportion of obese adults in 78 countries in 2014.

Final remarks
Given the above, we conclude that 50% of the 78 countries present obesity values greater than 0.20.In addition, their mean life expectancy oscillates around 72 years.Importantly, the levels of insufficient physical activity exceed 23.8% in 50% of the countries.Based on the boxplot analysis, a possible difference was observed in the proportions of obese adults in the Americas and Europe as compared to Africa and Asia.
The beta regression model used here found that the covariables percentage of insufficient physical activity, percentage of the population living in urban areas, life expectancy, and mean annual per capita alcohol intake have a significant and positive effect on obesity.That is, they tend to increase the proportion of obese adults when each of these variables is increased individually while maintaining the others constant.

Submitted on 17 /
Sep/2017 Final version resubmitted on 13/Mar/2018 Approved on 23/Mar/2018 49 years and the highest was 83 years, with a mean life expectancy at birth of 72 years.Expenditures on education as a percentage of total government spending varied from 5.53% to 26.3%.Furthermore, 25% of the 78 countries showed EDUC values less than 11.25%.Considering the percentage of the population living in urban areas, 50% of these countries showed values less than 60%, with a minimum of 16.1% and maximum of 100%.