Acessibilidade / Reportar erro

Comparison of Normal, Logistic, Laplace, and Student’s t distributions for experimental error in the Bayesian description of dry matter accumulation in Allium sativum

Comparação das distribuições Normal, Logística, Laplace e t de Student para o erro experimental na descrição bayesiana do acúmulo de matéria seca de Allium sativum

ABSTRACT:

This study assessed distributions associated with Bayesian nonlinear modeling error in the description of total plant dry matteraccumulation (TDMA) of Allium sativumas a function of days after planting (DAP). According to the DIC criterion, Logistic and Gompertzmodels that use student’s t distribution error exhibited the highest DIC with logistic error distribution. In general, the difference of DIC in all the scenarios was not more than 5.The Bayes factor (BF) criterion showed no difference in the Logistic and Gompertzmodel fit when four distributions are used for the errors, where BF values do not exceed 2. Posterior distributions and the usual estimators of Logistic and Gompertz model parameters were similar even forvaried error distribution. In summary, there was no difference in the use of 4 distributions associated with the modeling error of garlic plant growth by the Bayes factor, whereby the results showed that alternating between error distributions significantly changes the number of Markov Chain Monte Carlo (MCMC) iterations.

Key words:
Bayesian regression; Nonlinear regression; MCMC; Symmetrical location-scale family; Empirical Bayes

RESUMO:

O objetivo deste trabalho foi avaliar algumas distribuições associadas ao erro na modelagem não linear bayesiana na descrição do acúmulo de matéria seca total da planta (MSTP) de Allium sativum em função dos dias após o plantio (DAP). Pelo critério DIC os modelos Logístico e Gompertz que utilizam a distribuição do erro t de Student apresentaram a melhor qualidade de ajuste, sendo que o modelo Logístico apresentou o maior DIC com a distribuição de erro Logística. No geral, a diferença de DIC em todos os cenários não apresentou valores superiores a cinco. Pelo critério do Fator de Bayes (FB), não houve diferença no ajuste do modelo Logístico e Gompertz quando se utilizam as quatro distribuições para os erros, sendo que os valores de FB não superaram 2. As distribuições a posteriori e os estimadores usuais dos parâmetros dos modelos Logístico e Gompertz apresentaram semelhanças mesmo variando a distribuição do erro. Em suma não houve diferença na utilização das quatro distribuições associadas ao erro na modelagem do crescimento planta de alho pelo fator de Bayes, sendo que os resultados mostram que alternar entre as distribuições dos erros altera de forma significativa o número de iterações de MCMC.

Palavras-chave:
Regressão bayesiana; Regressão não linear; MCMC; Família simétrica de locação-escala; Bayes empírico

INTRODUCTION:

Sigmoid models are widely used in agricultural science to describe animal and plant growth. Most of the models, including Logistic and Gompertz, are analytical solutions of ordinary differential equations (STEWART, 2016STEWART, J. Cálculo volume 2. São Paulo: Cengage Learning, 2016. 672p.).

According to CORDEIRO & DEMÉTRIO (2008CORDEIRO, G. M.; DEMÉTRIO, C. G. B. Modelos lineares generalizados e extensões. Pernambuco, 2008.), since the 1970s, nonlinear regression theory has restricted the use of nonlinear models (NLM) to the assumption of normality for the residual and consequently, the response variable. With the introduction of generalized linear models (GLM), NELDER & WEDDERBURN in 1972NELDER, J. A.; WEDDERBURN, R. W. M. Generalized linear Models. Journal of the Royal Statistical Society.Series A (General). v.135, n.3, p.370-84, 1972. Available from: <Available from: https://doi.org/10.2307/2344614 >. Accessed: Nov. 10, 2022. doi: 10.2307/2344614.
https://doi.org/10.2307/2344614...
defined the response variables that belong to the exponential family. Likewise, CORDEIRO & PAULA (1989) defined the exponential family for a normal nonlinear model in which the systematic component is not a linear combination of the parameters.

Nonlinear regression with a normal error is susceptible to extreme observations. However, the assumption of normal error can be relaxed using different symmetrical and asymmetrical distributions, in both linear and nonlinear models under Bayesian estimation (DE LA CRUZ & BRANCO, 2009DE LA CRUZ, R.; BRANCO, M. Bayesian analysis for nonlinear regression model under skewed errors, with application in growth curves. Biometrical journal, v.51, p.588-609, 2009. Available from: <Available from: http://dx.doi.org/10.1002/bimj.200800154 >. Accessed: Feb. 10, 2022. doi: 10.1002/bimj.200800154.
http://dx.doi.org/10.1002/bimj.200800154...
; ROSSI & SANTOS, 2014ROSSI, R.M.; SANTOS, L.A. Bayesian modeling growth curves for quail assuming skewness in errors. Semina: Ciências Agrárias, v.35. p.1637, 2014. Accessed: Nov. 10, 2022. Available from: <Available from: https://doi.org/10.5433/1679-0359.2014v35n3p1637 >. doi: 10.5433/1679-0359.2014v35n3p1637.
https://doi.org/10.5433/1679-0359.2014v3...
).

In terms of dimensionality, a data base to be analyzed should contain at least 4 observations, so that the number of parameters is smaller than the dataset (n > p). With maximum likelihood estimation, in addition to the 3 usual parameters of the Gompertz and Logistic models (a, β, γ) contained in the location parameter of a symmetrical nonlinear model, it is necessary to estimate the scale (s) and (v) parameters in the case of the student’s t error. Dimensionality increases from 3 to 5, which is impossible to estimate in a 4-point regression. In these cases, a necessary option is Bayesian regression.

In Bayesian inference, a priori selection is made before accessing the data, using methodologies such as improper priors, conjugated distributions or meta-analysis assessment. In FIRAT et al. (2016FIRAT, MZ. et al. Bayesian analysis for the comparison of Nonlinear Regression Model Parameters: an Application to the Growth of Japanese Quail. Revista Brasileira de Ciência Avícola, v.18, p.19-26, 2016. Available from: <Available from: https://doi.org/10.1590/1806-9061-2015-0066 >. Accessed: Nov. 10, 2022. doi: 10.1590/1806-9061-2015-0066.
https://doi.org/10.1590/1806-9061-2015-0...
) and MACEDO et al. (2017MACEDO, L. R. et al. Bayesian inference for the fitting of dry matter accumulation curves in garlic plants. Pesquisa Agropecuária Brasileira, v.52, n.8, p.572-581, 2017. Available from: <Available from: https://doi.org/10.1590/S0100-204X2017000800002 >. Accessed: Nov. 10, 2022. doi: 10.1590/S0100-204X2017000800002.
https://doi.org/10.1590/S0100-204X201700...
), the Logistic and Gompertz parameters (a, β, γ) follow normal prior distribution.

When the prior distribution obtained originates in the data, the methodology is empirical. Inference that uses prior empirical distribution is known as an empirical Bayesian approach, which is not necessarily a Bayesian inference, since the data are used twice, once in the likelihood function and once in prior distribution. However, this methodology is a good approximation for Bayesian inference (CARLIN & LOUIS, 2008CARLIN, B. P.; LOUIS, T. A. Bayesian Methods for Data Analysis. 2 ed. United States: CRC Press, 2008, 246p.).

In terms of parametric inference, GUJARATI & PORTER (2012GUJARATI, D. M.; PORTER D. C. Econometria básica. 5ed. Rio de Janeiro: AMGH ltda, 2012, 924p.) and SOUZA (1998SOUZA, G.S. Introdução aos modelos de Regressão Linear e não linear. Brasília: Embrapa, 1998. 480p.) report that significance tests and confidence intervals of normal nonlinear models are asymptomatically valid, while F and t-tests, confidence intervals and regions depend on the asymptotic normality of parametric estimators. In Bayesian theory, the results of credible intervals and significance tests are valid, irrespective of sample size.

This study compared the symmetrical class of normal scale-location, student’s t-test, Laplace and logistic distributions for experimental error using the Bayesian methodology to describe MSTP accumulation of garlic as a function of days after planting (DAP).

MATERIALS AND METHODS:

The data used in thisstudywereobtainedfromtheUFV Germplasm Bank (BHG/UFV), whichcontains89 Allium sativumfruitaccessions. The experiment was conducted in the Zona da Mataregion of Minas GeraisState, Brazil, (20º45’S, 42º51’W, at 650m of altitude) at the Universidade Federal de Viçosa, in randomized blocks with 8 repetitions. Accumulated plant dry matter (g) was calculated 60, 90, 120 and 150 DAP (independent variable with n =4). Each DAP includes 8 repetitions, and the mean of each DAP was used.

The normal distribution is a member of the exponential family, location-scale family and symmetrical location-scale family. Distributions other than normal also belong to the symmetrical location-scale family and are generally denoted by the letter S. In this regression, the error follows a εi ~ S (0, σ2) distribution and its density is fεiei,μ,s=s-1.gs-1ei-μ2, where µ ϵ ℝ is the location parameter and s> 0the scale parameter (CORDEIRO et al., 2000CORDEIRO, G.M. et al. Corrected maximum-likelihood estimation in a class of symmetric nonlinear regression models. Statistics & Probability Letters, v.46, p.317-328, 2000. Available from: <Available from: https://doi.org/10.1016/S0167-7152(99)00118-2 >. Accessed: Feb. 10, 2022. doi: 10.1016/S0167-7152(99)00118-2.
https://doi.org/10.1016/S0167-7152(99)00...
). Figure 1 representsdistribution density S, these being the normal -

Figure 1
Normal, Logistic, Laplace and Student’s t model densities and their respective probability density functions, varying the location (µ), scale (s) and degrees of freedom (v) parameters for the t error.

f Y y , μ , σ = 1 σ 2 π exp - 1 2 σ 2 y - μ 2

Logistic -fYy,μ,s=exp-y-μss1+exp-y-μs2;

Laplace - fYy,μ,s=12sexpy-μs

and student’s t models -

f Y y , μ , v = Γ v + 1 2 Γ v 2 1 s 1 + - y - μ s 2 v v + 1 2

The nonlinear regression model is defined by y i = f (x i , θ) + εi , and the sample distribution of the response variable alsohas an independent S distribution, but not identically distributed (Y i ~ S ( f (x i , θ), s). For the description of garlic plant growth, the Logistic (f (x i ; α, β, y) = α [1 + β exp( - yx i )]-1 and Gompertzf (x i ; α, β, y) = α exp[ - β exp (-yx)] models were used, where α is asymptomatic plant growth, βa value with no biological interpretationand ythe growth rate (RATKOWSKY, 1983RATKOWSKY, D. A. Nonlinear regression modeling. New York: Dekker, 1983.).

In order to implement Bayesian regression, only accession 63 was used to fit the Logistic and Gompertz models. The definition of a nonlinear regression considers the Bayes theorem, where the joint posterior π (α, β, γ, τ, v | y) is proportional to the product of the likelihood function of the response variable L (α, β, γ, τ, v | y)with joint distribution of prior π (α, β, γ, τ, v), where y = (y 1, y 2, …,y n ).

πα,β,γ,τ,νyLα,β,γ,τ,νyπα,β,γ,τ,ν(1)

The distribution of each parameter was obtained by the MCMC (Markov Chain Monte Carlo) method using the Openbugs interface, where the PFCD (Posterior Full Conditional Distribution) of each parameter depends on the individual posterior distribution.

The likelihood function is the product of the PDF (probability density function) of Yi that has S distribution:

Lyi,α,β,γ,s,v=i=1n1sgyi-fα,β,γs2 (2)

With regard to the determination of prior distributions, the precision parameter is defined as τ=1s~U10-1,102 for models (b), (c), (d), and v ~ U (2,10)2for case t. All the choices involving uniform priors are based on Laplace’s principle of insufficient reason. The PFCD ofσ2, for the normal S case is known and has inverse Gamma distribution:

σ2yi~GIn2+a,12yi-fxi,α,β,γ,τ,ν2(3)

Where τ=1σ2~Gamma(a,b)and n = 4.

In order to implement empirical prior distribution for the vector (α, β, γ)and σ2for the normal case, Gauss-Newton method estimates were obtained for the Logistic and Gompertz models of 50 accessions from the database (only 50 of the 89 accessions are sigmoid), which are used to obtain the histograms and densities of each parameter estimate, and thereby determine the adequate candidate distribution. After probability density distributions were determined, the hyper parameters were obtained by equaling the mean and variance of these distributions to the mean and sample variance, respectively.

EX=n-1i=0nθ̂lVX=Sθ̂2(4)

Figure 2 shows the results of this methodology for the 50 database accessions analyzed in the present study and the results computed were used as empirical prior.

Figure 2
Gauss-Newton estimate density - computed by the density function () of ggplot2 - of the parameters (a, β, γ, σ2) of the Logistic and Gompertz models and their empirical prior determined by equation 4.

Heindenberg-Welch and Geweke criteria were used to analyze the convergence of MCMC chains, selecting nIter (number of iterations), nBurnin (values disregarded in the initial iterations of the chain) and nThin (jump values)values are selected to meet the two criteria simultaneously.

After the nThinandnBurninvalues are obtained, the Openbugs interface computes the DIC (Deviance Information Criterion) calculationsto compare the models fit by the Bayesian methodology. The Bayes factor is a measure of plausibility, given that DIC may be affected by posteriors that are bimodal or asymmetrical. The Bayes factor is defined as the ratio of marginal posteriors BF = p (y, M 1) p -1 (y, M 2), with Mi being the nth model to be compared, and the value of each p (y)is calculated by the harmonic mean:

p̂y=1Gg=1G1fyθg-1(5)

Where G are the prior values generated, θ = (θ1, θ2, …,θn )and y = (y 1, y 2, …, y n ).

RESULTS AND DISCUSSION:

Table 1 shows the usual estimators of the parameter posteriors. According to the DIC comparison criterion, the Logistic (6.94) and Gompertz (4.94) models exhibited the lowest values when the student’s t-test is considered. In line with Bayesian modeling of the Cordona growth curve created by ROSSI & SANTOS (2014ROSSI, R.M.; SANTOS, L.A. Bayesian modeling growth curves for quail assuming skewness in errors. Semina: Ciências Agrárias, v.35. p.1637, 2014. Accessed: Nov. 10, 2022. Available from: <Available from: https://doi.org/10.5433/1679-0359.2014v35n3p1637 >. doi: 10.5433/1679-0359.2014v35n3p1637.
https://doi.org/10.5433/1679-0359.2014v3...
), the student’s t-test has a smaller DIC when compared to the normal error. In all the scenarios, except for the normal error, the Gompertz model obtained lower DICs than those of its Logistic counterpart. According to the criterion of the Bayes factor in all the error distribution scenarios, there was no evidence that either the Logistic or Gompertz model is more plausible, and the values obtained were less than 2. The high β estimates obtained in the Logistic and Gompertz models in the present study were also reported by MACEDO et al. (2017MACEDO, L. R. et al. Bayesian inference for the fitting of dry matter accumulation curves in garlic plants. Pesquisa Agropecuária Brasileira, v.52, n.8, p.572-581, 2017. Available from: <Available from: https://doi.org/10.1590/S0100-204X2017000800002 >. Accessed: Nov. 10, 2022. doi: 10.1590/S0100-204X2017000800002.
https://doi.org/10.1590/S0100-204X201700...
), who analyzed the dry matter accumulation of garlic using frequentist and Bayesian regression.

Table 1 and figure 3 show the (a, β, γ) posteriors from the Logistic and Gompertz models, exhibiting similar graphs and values when the normal - graphs (a), (b) and (c); Logistic - graphs (e), (f) and (g); Laplace - graphs (i), (j) and (k); and student’s t - graphs (m), (n) and (o) errors are considered. This demonstrated that alternating error distribution had little influence on obtaining the usual estimators and posterior distributions.

Table 1
Estimate of parameters (α, β, γ, τ, v) followed by the usual mean estimators, HPD credible intervals with lower (LL) and upper limit (UL), Bayes factor (BF) and Deviance Information Criterion (DIC).

Figure 3
Posterior parameter distribution (a, β, γ, τ) of Logistic and Gompertz models fit to access 63 considering an error with normal, Logistic, Laplace and student’s t distribution.

In the graph (q) of Figure 3, the posterior density ofv, in the case of the student’s t error, exhibited a uniform trend, indicating prior dominance as a function of likelihood, which does not occur in (h), (l) and (p), whose graphs are more informative. Similar to the study of MARTINS FILHO et al. (2008MARTINS FILHO, S. et al. Abordagem Bayesiana das curvas de crescimento de duas cultivares de feijoeiro. Ciência Rural, Santa Maria, v.38, n.6, p.1516-1521, 2008. Available from: <Available from: https://doi.org/10.1590/S0103-84782008000600004 >. Accessed: Nov. 10, 2022.doi: 10.1590/S0103- 84782008000600004.
https://doi.org/10.1590/S0103-8478200800...
), the (a, β, γ) posteriors showed a uniform and more informative trend respectively, in the Bayesian growth modeling of the “neguinho” and “carioca” bean cultivars when these consider a uniform prior.

In computational terms, the MCMC iterative process and convergence analysis required some computational time, as explained by PEREIRA et al. (2022PEREIRA, A. A. et al. Bayesian modeling of the coffee tree growth curve. Ciência Rural, v.52, n.9, 2022. Available from: <Available from: https://doi.org/10.1590/0103-8478cr20210275 >. Accessed: Nov. 10, 2022. doi: 10.1590/0103-8478cr20210275.
https://doi.org/10.1590/0103-8478cr20210...
). Figure 4 shows the nIter values of Markov chains that each model needed in the 4 error scenarios. In all the scenarios, except the Gompertz model with normal error, nThin was less than 20.

Figure 4
Log graph of the number of iterations (niter) of the Logistic and Gompertz models considering the Laplace, Logistic, normal and student’s t distributions.

The Gompertz model with normal error needed an nThin of 1100 to control the high self-correlation of its chains, which contributed to the nIter of 4 ∙ 106. When considering the Logistic error, the same Gompertz model needed an nThin and nIter of 20 and 40,000, respectively, which reveals computational economy. All the chains created in this process passed the Geweke and Heidenberg-Welch tests.

The results of Bayes factor, DIC and fitted graphs (a) and (b) of Figure 5 show no difference in the use of 4 garlic plant growth modeling errors. It was concluded that alternating between the 4 symmetrical distributions for the error significantly alters the nIter values and, as such, it is up to the researcher to select the error with the highest computational economy.

Figure 5
Logistic and Gompertz models fitted to the mean of each DAP, considering the error with normal, Logistic, Laplace and Student’s t distribution.

CONCLUSION:

There was no difference in the use of the normal, Logistic, Laplace and student’s t errors for the experimental error in the Bayesian nonlinear modeling of garlic using the Logistic and Gompertz models. There are significant differences in the size of MCMC iterations for each error distribution.

ACKNOWLEDGEMENTS

The present study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brasil - Finance code 001.

REFERENCES

  • CR-2023-0056.R2

Edited by

Editors: Leandro Souza da Silva (0000-0002-1636-6643) Alessandro Dal’ColLúcio (0000-0003-0761-4200)

Publication Dates

  • Publication in this collection
    01 Mar 2024
  • Date of issue
    2024

History

  • Received
    01 Feb 2023
  • Accepted
    14 Aug 2023
  • Reviewed
    31 Jan 2024
Universidade Federal de Santa Maria Universidade Federal de Santa Maria, Centro de Ciências Rurais , 97105-900 Santa Maria RS Brazil , Tel.: +55 55 3220-8698 , Fax: +55 55 3220-8695 - Santa Maria - RS - Brazil
E-mail: cienciarural@mail.ufsm.br