COMPOSITIONAL STATISTICAL MODELS UNDER A BAYESIAN APPROACH: AN APPLICATION TO TRAFFIC ACCIDENT DATA IN FEDERAL HIGHWAYS IN BRAZIL

. This study considers the use of a composicional statistical model under a Bayesian approach using Markov Chain Monte Carlo simulation methods applied for road trafﬁc victims ocurring in federal roads of Brazil in a speciﬁed period of time. The main motivation of the present study is based on a database with information on the injury severity of each person involved in an accident occurred in federal highways in Brazil during a time period ranging from January, 2018 to April, 2019 reported by the federal highway police ofﬁce of Brazil. Four types of events associated with each injured person (uninjured, minor injury, serious injury and death) are grouped for each state of Brazil in each month characterizing compositional multivariate data. Such kind of data requires speciﬁc modeling and inference approaches that differ from the traditional use of multivariate models assuming multivariate normal distributions.The proportion events associated to the accidents (uninjured, minor injuries, serious injuries and deaths) are considered as a sample of vectors of proportions adding to a value one together with some covariates such as pavement conditions in each province, regions of Brazil, months and years that may affect the severity of the injury of each person involved in an accident. From the obtained results, it is observed that the proportions of serious accidents and deaths are affected by some covariates as the different regions of Brazil and years.


INTRODUCTION
A major world public health problem is related to traffic accidents where the death toll reached 1.35 million in 2016. With the fast increase of vehicles in circulation and the lack of monitoring infrastructure especially in third world countries the situation tends to get worse. According to a report from the World Health Organization  as progress is made in the prevention and control of infectious diseases, the number of deaths from non-communicable diseases and injuries has increased significantly in recent years. *Corresponding author Traffic is already responsible for the eighth cause of death in all age groups, where traffic injuries are currently the leading cause of death for children and young adults aged 5 to 29 years. An improvement in traffic deaths reduction has already been observed in more developed countries, but the situation is catastrophic in most emerging and poor countries. There is a strong association between the risk of death in traffic and the income level of the countries. With an average rate of 27.5 deaths per 100,000 inhabitants, the risk of death in traffic is three times higher in lowincome countries than in high-income countries, where the average rate is 8.3 deaths per 100,000 inhabitants. In addition, the number of road traffic fatalities is disproportionately high among low and middle-income countries relative to the size of their populations and the number of motor vehicles in circulation compared to the rest of the world (see Table 1).

METHODOLOGY
This study considered a database related to the victims of road accidents (victims of land transport accidents ICD-10 headings V01 to V89, World Health Organization, 2004) reported by the federal police (PF) of Brazil regarding all federal highways in the period ranging from January 1, 2018 to April 30, 2019 covering all states of the federation (https://www.prf.gov.br/portal/ dados-abertos/acidents) where the federal police reported for each victim the type of injury (unharmed, minor injury, serious injury and death) and some important factors such as cause of the accident, type of accident, phase of the day, weather condition, type of track, road layout, age of the victim, gender of the victim and type of vehicle. This information is described in the accident reports prepared by the road police officers for each road accident. In this paper the data are grouped in the form of monthly compositional data (observed proportion of uninjured, lightly injured, severely injured and injured who died at the accident site) for each federative unit in Brazil. The data set is presented in Table A1 in an appendix at the end of the manuscript. Table  2 shows the total of casualties in each class (unharmed, mild, severe, death) from January 1, 2018 to April 30, 2019 for all units of the federation. Figure 1 shows the box-plots of each class (unharmed, mild injury, severe injury and death) considering all federative units of the Brazil federation. Figure 2 shows the time series for the proportions %unharmed, %mild, %severe and %death. Figure

Modeling of Compositional Data
Compositional data are vectors of proportions specifying G fractions of a total. Denoting x = (x 1 , x 2 , . . . , x G ) to be a compositional vector, we must have x i > 0, for i = 1, . . . , G and x 1 + x 2 + . . . + x G = 1. Compositional data often result when raw data is normalized or when data is obtained as proportions of a certain heterogeneous amount. These conditions are usual in geol-  Thus, the compositional data introduced in Table A.1 are denoted by x 1i = % unharmed, x 2i = % mild injuries, x 3i = % severe injuries and x 4i = % deaths. Let us assume a model with additive ratio 7 log (ALR) transformation given by y 1i = log(x 2i /x 1i ), y 2i = log(x 3i /x 1i ) and y 3i = log(x 4i /x 1i ) given by, where β β β 1 , β β β 2 and β β β 3 are vectors of regression parameters, z i is a covariate vector associated to the i th observation i = 1, 2, . . . , 432, w i is a random effect (latent unobserved variable) that captures the dependency between the proportions for each province/month and ε ji are errors (non-observed variables) assumed to be independent random variables with normal distributions N(0, σ 2 j ). Different distributions could be assumed for the random effects w i ; in study, it is assumed a normal distribution N(0, σ 2 w ). For a hierarchical Bayesian analysis of the model, it is assumed normal prior distributions for the regression parameters with known hyperparameter values. For the second stage of the hierarchical Bayesian analysis, it is assumed a gamma prior distribution for the inverse of the variance σ 2 w of the latent variable w i , that is, where G(a, b) denotes a gamma distribution with mean a/b and variance a/b 2 ; τ j = 1/σ 2 w ; a w and b w are known hyperparameters. Further, it is assumed prior independence among the parameters.
Posterior summaries of interest for model (1) are obtained using simulated samples of the joint posterior distribution for the model parameters using MCMC methods. The simulation algorithm to generate samples of the joint posterior distribution for the model parameters is obtained from the complete conditional posterior distributions for each parameter required in the MCMC simulation algorithm. A great simplification in the simulation procedure is to use some existing Bayesian simulation software. One such software is the Openbugs software (see, for example, Lunn et al., 2009), where it is only needed to specify the joint distribution for the observations and the prior distributions for the parameters of the assumed model.
Associated with the compositional data, there are some covariates such as month, year and region of Brazil where the accident occurred. In addition to these covariates, other independent variables of interest may also be associated with the compositional responses, such as road condition, road layout, weather condition, and accident time. An important covariate in the occurrence of road accidents is given by the condition of the pavement. Table 3  For the analysis of the compositional data given in Table A.1, it is assumed the following covariates: month, year, percentage of pavement good, fair, bad, very bad (the optimum percentage is not considered due to restriction %optimum + %good + regular% +%bad + %very bad = 1) and the dummy variables related to the northeast (1 for NE and 0 otherwise), midwest (1 for CO and 8 COMPOSITIONAL STATISTICAL MODELS UNDER A BAYESIAN APPROACH 0 otherwise), southeast (1 for SE and 0 otherwise) and south (1 for S and 0 otherwise) regions where the northern region (N) is considered as a reference.
In the data analysis, it is first assumed a regression model with compositional data (1) not considering the presence of the latent factor W denoted as "model 1", that is, assuming independence among the responses in the additive log-ratio (ALR) transformation y 1i = log(x 2i /x 1i ), y 2i = log(x 3i /x 1i ) and y 3i = log(x 4i /x 1i ) where x 1i = % unharmed, x 2i = % minor injuries, x 3i = % severe injuries and x 4i = % deaths. Thus, it is assumed the linear regression models: where, and ε ji are independent assumed errors with normal distributions N(0, σ 2 j ), j = 1, 2, 3. From the ALR transformations assuming the real proportions p 1i , p 2i , p 3i and p 4i where, p 1i + p 2i + p 3i + p 4i = 1, we have, y 1i = log(x 2i /x 1i ), y 2i = log(x 3i /x 1i ) and y 3i = log(x 4i /x 1i ), and the inverse estimated proportions in each class are easily obtained from the expressions, where y 1i , y 2i , y 3i and y 4i are predicted values based on the estimated model.
Assuming normal independent prior distributions N(0,1) for all regression parameters and gamma distributions G(1,1) for the variances of the errors ε 1i , ε 2i and ε 3i , Table 4 shows the posterior summaries of interest (Monte Carlo estimators given by the posterior parameter means, posterior standard deviations of the parameters and 95% credibility intervals) based on 1000 simulated Gibbs samples (every 100th simulated sample among 100,000 generated Gibbs samples  1704  161  869  374  56  3164  PI  1572  161  1542  36  79  3390  PR  2815  163  2718  516  118  6330  RJ  1486  198  486  313  71  2554  RN  561  133  807  252  103  1856  RO  574  133  532  491  155  1885  RR  657  0  391  44  10  1102  RS  3776  814  3449  724  92  8855  SC  1277  276  1104  486  91  3234  SE  320  15  54  165  94  648  SP  6851  849  1836  419  28  9983  TO  708  50  2154  61  546  3519 to get an approximately uncorrelated sample) of the joint posterior distribution for all model parameters obtained using the Openbugs software and considering a burn-in sample of size 11,000 discarded to eliminate the effect of the initial parameter values needed for the MCMC algorithm. Convergence of the MCMC simulated samples was monitored by traceplots of the generated Gibbs samples (see Gelman et al., 2013) From the results presented in Table 4, it is observed that the significative effects (zero not included in the 95% credibility intervals) are: • Response y 2 = log(x 3 /x 1 ) where x 1 = % unharmed and x 3 = % serious injury: poor pavement (regression parameter β 27 is estimated by a negative value) and NE (northeast) region where regression parameter β 28 is estimated by a positive value indicating that the difference between x 3 = % of serious injuries and x 1 = % unharmed increases in the NE region when compared to the N region (north considered as reference).
• Response y 3 = log(x 4 /x 1 ) where x 1 = % unharmed and x 4 = % deaths: covariate year (regression parameter β 33 is estimated by a negative value indicating a decrease in the death/unharmed difference in the year 2019); NE (northeastern) region where the regression parameter β 38 is estimated by a positive value indicating that the difference between x 3 = % deaths and x 1 = % unharmed increases in the NE region as compared to the N region (north considered as reference); CO region (midwest) where the regression parameter β 39 is estimated by a positive value indicating that the difference between x 3 = % deaths and x 1 = % unharmed increases in the CO region when compared to the N region (north considered as reference); SE region (southeast) where the regression parameter β 310 is estimated by a positive value indicating that the difference between x 3 = % deaths and x 1 = % unharmed increases in the SE region when compared to the N region (north considered as reference); and region S (south) where the regression parameter β 311 is also estimated by a positive value indicating that the difference between x 3 = % deaths and x 1 = % unharmed increases in region S when compared with region N (north considered as reference).
Now assuming a regression model with compositional data defined by (1) and (4) in the presence of the latent factor W denoted by "model 2", that is, assuming dependence between the responses assuming a gamma distribution G(1,1) for the variance σ 2 w of the random factor w i with a normal distribution N(0, σ 2 w ) included in model (4), we have in Table 5, the posterior summaries of interest assuming the MCMC simulation method based on 1000 simulated Gibbs samples (every 400 th simulated samples among 400,000 generated Gibbs samples to get an approximately uncorrelated sample) of the joint posterior distribution for all model parameters obtained using Openbugs software and considering a burn-in sample of size 111,000 discarded to eliminate the effect of the initial parameter values needed for the MCMC algorithm. Convergence of the MCMC simulated samples was monitored by traceplots of the generated Gibbs samples.
From the results presented in Table 5, it is observed that the significative effects (zero not included in the 95% credibility intervals) are the same as those obtained using "model 1".
For the discrimination of the best model, it is used the Deviance Information Criterion (DIC). The DIC criterion (Spiegelhalter et al., 2014) is based on the posterior average of the deviance. Deviance is defined by, where θ θ θ is a vector of unknown parameters of the model; L(θ θ θ ) is the likelihood function and C is a constant (not always known) when comparing two models. The DIC criterion is defined by,  where D( θ θ θ ) is the posterior averaged deviation θ θ θ = E( θ θ θ /y) and p D is the number of model parameters, given by p D =D − D( θ θ θ ) whereD = E(D(θ θ θ /y) is the posterior mean of the deviation that measures the quality of data fit for each model. Table 6 shows the DIC values obtained from the generated Gibbs samples using the Openbugs software for both models considered in the data analysis.  Table 4, it can be observed that the "model 2" is better fitted by the data. Assuming "model 2", the estimated proportions for the four classes given by (5) and the observed proportions are presented in Figure 4. From the plots of Figure 4, it is observed good fit of model 2 to the compositional data associated to accident victims in Brazilian federal roads.

DISCUSSION OF THE RESULTS AND CONCLUDING REMARKS
From the obtained results usig ALR compositional models it is possible to get important conclusions on the study. Since the significative covariates affecting the responses y 2i = log(x 3i /x 1i ) and y 3i = log(x 4i /x 1i ), where x 1i = % unharmed, x 2i = % minor injuries, x 3i = % severe injuries and x 4i = % deaths are given by poor pavement, NE region and year, Figures 5, 6, 7 and 8 show the scatter plots associated to each response and covariate from where it is possible to get important interpretations for the compositional multivariate dataset. From the graphs of Figure 5, it is possible to observe that although there is great variability in the response y 2i there is a slight decreasing in the response with %lousy pavement (pavement in very poor condition) and an increasing in the response y 2i in the NE region when compared to the other regions of Brazil. From the graphs of Figure 6, it is possible to see an increasing in the response y 3i in the year 2019 when compared to the year 2018; an increasing in the response y 3i in the NE region when compared to the other regions of Brazil; a decreasing in the response y 3i in the CO region when compared to the other regions of Brazil; an apparently decreasing in the response y 3i in the SE region when compared to the other regions of Brazil and an apparently decreasing in the response y 3i in the S region when compared to the other regions of Brazil. From the graphs of Figure 7, it is possible to see an increasing of %severe injuries in the NE region when compared to the other regions of Brazil and a decreasing of %unharmed persons in the NE region; in relation to the factor very bad (lousy) pavement, it is difficult to see the effect in %unharmed and %severe injuries.
From the graphs of Figure 8, it is possible to observe that although there is great variability in the responses %deaths and %unharmed, we see a small increasing of %deaths related to the year 2019 when compared to the year 2018; similarly apparently there is an increasing of %deaths in the NE region when compared to the other regions of Brazil; a decreasing of %deaths in the SE and S regions.
In summary, from the obtained results, it is concluded that the rates of serious accidents and deaths are affected by some covariates as the regions of Brazil (especially the NE region where the rates for accidents with severe injuries and deaths are higher than the rates for the other regions of Brazil), years and some sligh effect of pavement conditions of the roads, which could be important for the road managers to take decisions to improve the road conditions in Brazil. This is an important result which could help in future decreasing of the high rates of severe injuries and deaths in the Brazilian federal roads.
As concluding remarks, it is possible to point out that the use of existing compositional Bayesian models could be of great interest in the data analysis of road accidents as seen in this study. It is important to point out that other prior distributions could be considered for the parameters of the model possibly incorporating with prior opinions of engineer experts in road traffic. The use of MCMC methods to get the posterior summaries of interest using free existing simulation softwares like the OpenBugs software could be a great option in the data analysis under a hierarchical Bayesian data analysis which only requires the specification of the likelihood function and the prior distributions for the parameters of the model. It is important to point out that other dependence structures could be assumed for the ALR transformed data, like a multivariate normal distribution for the errors in the compositional model (see for example, Shimizu et al., 2015).
In a future work the results of this study could be extended to the presence of other covariates as weather conditions, type of road (double lane and single lane), roads with and without tolls, period of day, speed of the vehicle at the moment of the accident and many other possible covariates that could affect the responses given by a proportion vector (x 1i = %unharmed, x 2i = %mild injuries , x 3i = %severe injuries and x 4i = %deaths).