Abstract
We define two new flexible families of continuous distributions to fit real data by compoun-ding the Marshall–Olkin class and the power series distribution. These families are very competitive to the popular beta and Kumaraswamy generators. Their densities have linear representations of exponentiated densities. In fact, as the main properties of thirty five exponentiated distributions are well-known, we can easily obtain several properties of about three hundred fifty distributions using the references of this article and five special cases of the power series distribution. We provide a package implemented in R software that shows numerically the precision of one of the linear representations. This package is useful to calculate numerical values for some statistical measurements of the generated distributions. We estimate the parameters by maximum likelihood. We define a regression based on one of the two families. The usefulness of a generated distribution and the associated regression is proved empirically.
Key words
generating function; Marshall–Olkin family; maximum likelihood; moment; distribution
Introduction
The Marshall–Olkin (“MO“) family (Marshall & Olkin 1997MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652. ) adds one parameter to a parent distribution. Let be the parent cumulative distribution function (cdf) of a random variable with parameter vector . The survival function and probability density function (pdf) of are and , respectively.
The cdf and survival function of the MO class with baseline are
and respectively, where .Equation (1) can generate many continuous distributions from popular ones. The MO-G density function can be expressed as
For , is the simplest case of (3). Marshall & Olkin 1997MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652. pioneered the MO-Weibull (MOW) distribution which is a useful extension of the Weibull.
Consider random variables independent and identically distributed (i.i.d.) with cdf and pdf given by (1) and (3), respectively. Here, is a discrete random variable with support . Henceforth, let and be two random variables assuming that has the zero-truncated power series (PS) distribution with probability mass function (pmf)
where (for ), is called the power parameter and . The probability generating function (pgf) of is .Five important distributions are special cases of (4): the zero-truncated Poisson, logarithmic, negative binomial, geometric and zero-truncated binomial distributions.
The cdf of conditional given is
and then the unconditional cdf of follows from (4)The conditional cdf of under is
and then the unconditional cdf of follows from (4) asEquations (5) and (6) define two Marshall–Olkin Power Series-G (MOPS-G) families under baseline G. They provide a strong motivation for explaining the failure time of any mechanism formed by an unknown number of identical and independent (parallel or serial) components. The densities of and are obtained by differentiating (5) and (6). We emphasize that these equations can generate many MOPS models. For each baseline G, we can generate ten () associated models from the five discrete distributions in Equation (4). For , we have the Power Series-G (PS-G) classes under baseline G.
The minimum () and maximum () statistics can be applied in several series and parallel systems with identical components and have many industrial and biological applications. In parallel systems, the random variable models the time of the first component to fail, while models the time for the breakout system. A dual interpretation can be given for systems with serial components. These random variables are also very useful in oncology. For example, suppose we are studying a recurrence of a certain type of cancerous tumor of an individual after undergoing any kind of treatment. So, the time for the first cell to activate to produce cancer cells can be modeled by the generated distribution of , while the disease manifestation (if it occurs only after an unknown number of factors have been active) can be modeled by the generated distribution of .
Four new distributions based on the MOPS construction are introduced for illustrative purposes in Section Four special models. We derive linear representations for the densities of and in Section Expansions. A package in R is presented in Section Numerical evaluation to calculate numerically several mathematical properties for the generated distributions based on the linear representations. General structural properties for the two families are addressed in Section Properties. In Section Estimation, we estimate the parameters for one of the families. We introduce in Section Regression the Marshall–Olkin Truncated Poisson Weibull regression defined from one of the families. In Section Two simulation studies, some simulations examine the accuracy of the maximum likelihood estimates (MLEs) and the quantile residuals (qrs). Two applications prove the utility of our finding in Section Applications. Finally, we offer concluding remarks in Section Conclusions.
Four special models
First, consider the zero-truncated Poisson in (4). The cdfs of the Marshall–Olkin Zero-Truncated Poisson-G (MOTP-G) distributions are determined from Equations (5) and (6) as
andThe Weibull cdf with scale parameter and shape parameter is (for )
Then, the cdf and survival function of the MO-Weibull (MOW) distribution are
and respectively.By inserting the last two formulae in Equations (7) and (8) and differentiating the resulting expressions, we obtain the MOTP-Weibull (MOTPW) densities
Second, consider the geometric distribution in (4). The cdfs of the Marshall–Olkin Geometric-G (MOG-G) classes follow from Equations (5) and (6)
andThe Burr XII (BXII) cdf is (for )
where and are shape parameters. For and in Equation (15), we have the log-logistic (LL) and Lomax distributions, respectively.Hence, the cdf and survival function of the Marshall–Olkin Burr XII (MOBXII) distribution are
and respectively.By inserting the last two formulae in Equations (13) and (14) and differentiating the resulting expressions with respect to and , respectively, we obtain the MOG-Burr XII (MOGBXII) densities
andFor the MOTPW and MOGBXII distributions (to the maximum ) referred to (11) and (18), some plots of the densities and cumulative functions are displayed in Figures 1 and 2, respectively. The various forms of the densities indicate more flexibility than the parent distributions.
Plots of the density and cumulative functions of the MOTPW distribution under four scenarios. (a) , , , and varying . (b) , , , and varying . (c) , , , and varying . (d) , , , and varying .
Plots of the density and cumulative functions of the MOGBXII distribution under four scenarios. (a) , , , and varying . (b) , , , and varying . (c) , , , and varying . (d) , , , and varying .
We can note increasing, decreasing, and unimodal shapes for the hrf of the MOTPW distribution in Figure 3. Also, we see a slightly different hrf with increasing, decreasing and increasing shape.
Graphics comparing the histograms from two simulated data sets and the MOTPW and MOGBXII densities of under specified parameters are reported in Figure 4. They show good agreement between the simulated values and these densities.
Expansions
We obtain useful linear representations for the density functions of and for two separated cases and . For , we have .
By inserting (1) in Equation (5) and letting , we can write
First, we consider the density of the maximum when . For and , the negative binomial expansion holds
Expanding as in Equation (21) since , we have
Henceforth, let exp-G be the exponentiated-G (exp-G) random variable with power parameter . Its cdf and pdf are and , respectively. Many exp-G properties have been studied exhaustively by several authors (Tahir & Nadarajah 2015TAHIR MH & NADARAJAH S. 2015. Parameter induction in continuous univariate distributions: Well-established G families. An Acad Bras Cienc 87: 539–568. ). We can write
where for and Further, using the binomial theorem, we obtain where for .By differentiating the last equation, we obtain
We now move to the density of the maximum when . We modify the denominator in (20)
and then apply Equation (21) to find where (for and ). By differentiating , the density of follows asNext, we consider the density of the minimum . By inserting (2) in Equation (6), we have
For , we apply expansion (21) in the last equation to
where for andBy using the binomial theorem in , we have
where for .By differentiating , the density of can be expressed as
We now obtain the density of when . By changing the denominator in Equation (24), we have
Applying expansion (21) in the last equation
where (for and ).Using the binomial theorem, we can rewrite as
where for By simple differentiation where is the exp-G density with power parameter .Equations (22), (23), (25) and (26) are the main results of this section. These linear representations have great utility for deriving structural properties of the maximum and minimum from well-known exp-G properties. More than thirty five exp-G models have been studied so far and then it is possible to construct at least three hundred fifty () MOPS-G models with properties determined from those exp-G properties. We can use statistical platforms with ten terms to have precise results.
Numerical evaluation
In order to evaluate the analytical results presented in the previous sections, a package was implemented using the R programming language (R Core Team 2022R CORE TEAM. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/.
https://www.R-project.org/...
). The MarshallOlkinPSG package was constructed in a generic way, that is, its most important functions allow generalizations for any baseline G distribution or even inform a zero-truncated PS distribution.
The library code can be obtained from GitHub at https://github.com/prdm0/MarshallOlkinPSG. On the library’s website (see https://prdm0.github.io/MarshallOlkinPSG) it is possible to have more information on the functions implemented through the documentation and usage examples.
To install the package hosted and maintained on GitHub, it is necessary to previously install the remotes library. With the prerequisite met, the package MarshallOlkinPSG can be installed as:
# Install the remotes package: # install.packages("remotes") remotes::install_github("prdm0/MarshallOlkinPSG", force = TRUE)The function eq_19() implements Equation (23) and compares, for example, with the exact MOTPW density in Equation (11). To facilitate comparison, the function pdf_theorical() implements this density function. By doing help(eq_19) it is possible to access an example of comparison of the two equations. Note that Equation (23) approximates (11) very well when finite sums are taken in applied problems. In other words, the results achieved by the function eq_19() approximates very well those from pdf_theorical(). The function eq_19() will also allow any baseline cdf as an argument of eq_19().
The function eval_plot_moptw() allows to validating numerically Equation (23) by means of plots. The true parameters for the MOTPW density are: , , , and . In addition, we require just a few terms in the sums to obtain a reasonable level of precision as shown in the plots in Figure 5, where six or eight terms provide very accurate approximations.
Numerical evaluation of (19) with finite sums, where N and K denote the upper limits of terms in the related sums with the running indices n and k, respectively.
Properties
We now provide some mathematical properties of that can be easily utilized in the linear representations of the previous section to find the corresponding properties of and .
The th ordinary moment of has the form
where is the qf of .Explicit expressions for several exp-G moments can be determined from (27).
The th incomplete moment of follows the previous algebra
where the integral can be calculated for the great majority of G distributions. The first incomplete moment is the most important case of (28) to find mean deviations and Lorenz and Bonferroni curves.The moment generating function (mgf) of follows as
The mgfs of exp-G distributions con be determined from Equation (29).
Estimation
The MLEs are appropriate at least in large samples to determine confidence intervals for the parameters. We consider the random variable defined from Equations (3) and (5) for any baseline G with any unknown parameter vector . By simple differentiation of (5), the density of takes the form
where follows from (4) and .The log-likelihood function for from a random sample of is
A similar development can be conducted for the random variable defined from Equation (6) for any baseline G.
We can find the MLE by maximizing Equation (31) using the MaxBFGS sub-routine (Ox program), optim function (R), and PROC NLMIXED (SAS). The AdequacyModel package can also maximize (31) using the PSO (particle swarm optimization) approach from the quasi-Newton BFGS, Nelder-Mead and simulated-annealing methods to maximize the log-likelihood function and it does not require initial values. Details are available at Marinho et al. 2019MARINHO PRD, SILVA RB, BOURGUIGNON M, CORDEIRO GM & NADARAJAH S. 2019. AdequacyModel: An R package for probability distributions and general purpose optimization. PloS One 14(8): e0221487. and https://github.com/prdm0/AdequacyModel.
These scripts can be executed for a wide range of initial values and may lead to more than one maximum. However, in these cases, we consider the MLEs corresponding to the largest value of the maximum log-likelihood. There are sufficient conditions for the existence of these estimates such as compactness of the parameter space and the concavity of the log-likelihood function, but they can exist even when the conditions are not satisfied. In general, there is no explicit solution for the estimates from maximizing (31), but we can establish theoretical conditions on their existence and uniqueness for very special models by examining the ranges of the score components.
Regression
Consider that are independent random variables from any distribution in (11) assuming that the parameters and vary through them. We propose a new regression based on the response variable in (11) with the systematic components
respectively, where \({\bf v}_{i}^{T}=(v_{i1},\ldots,v_{ip})\), and . Equations (11) and (32) define the MOTPW regression. For , it follows the truncated Poisson Weibull (TPW) regression.In a similar manner, we can construct many other regressions based on other MOPS-G distributions defined from Equations (5) and (6).
The log-likelihood function for the vector from the MOTPW regression can be reduced to
We obtain the MOTPW distribution for and .Let be the MLE of . Equation (33) can also be maximized using the gamlss regression framework (Stasinopoulos & Rigby 2008STASINOPOULOS DM & RIGBY RA. 2008. Generalized additive models for location scale and shape (GAMLSS) in R. J Stat Soft 23: 1–46. ) in R.
Two simulation studies
We perform two simulation studies. The first one examines the accuracy of the MLEs of the parameter estimates in the MOTPW distribution. The second one does the same for the MOTPW regression.
The MOTPW distribution
First, we evaluate the precision of the estimates in the MOTPW distribution based on 1,000 Monte Carlo simulations using the R software. The simulation procedure follows as:
-
The inverse function comes from (7)
-
Generate and obtain the values of the MOTPW distribution.
The true parameters are , , and . The average estimates (AEs), biases, and mean squared errors (MSEs) are listed in Table I. The three measures decrease steadily when becomes large.
The MOTPW regression
We perform some Monte Carlo simulations for some values of to investigate the accuracy of the MLEs in the MOTPW regression under four scenarios: Scenario 1: and ; Scenario 2: and ; Scenario 3: and ; Scenario 4: and . We take values greater than and less than one for and .
The explanatory variables are generated in the regression by taking , , and .
For each scenario and value of , one thousand samples are generated from the MOTPW regression fitted to each generated data set. The quantities reported in Table II are in good agreement with the asymptotic results for the MLEs.
Residual analysis
We investigate the quantile residuals (qrs) to verity the adequacy of the response distribution to determine outliers in the MOTPW regression. The same approach can be adopted to many other regressions defined from the distributions in (5) and (6). The qrs are given by (Dunn & Smyth 1996DUNN PK & SMYTH GK. 1996. Randomized quantile residuals. J Comput Graph Stat 5(3): 236–244. )
where is the normal cdf and and are defined in Equation (32).We consider the same scenarios for the simulations in Section Two Simulation Studies. For each fitted regression, the qrs are calculated from Equation (35). Figures 6, 7, 8, and 9 display QQ plots which show that the empirical distribution of these residuals is close to the standard normal distribution.
Applications
The beta Weibull (BW) and Kumaraswamy Weibull (KwW) distributions have been widely used to fit real data in the last ten years or so. We compare the MOTPW distribution with the BW and KwW distributions since all of them have four parameters. The BW density pioneered by Lee et al. 2007LEE C, FAMOYE F & OLUMOLADE O. 2007. Beta-Weibull distribution: some properties and applications to censored data. J Mod Appl Stat Meth 6(1): 173–186. is
The KwW density introduced by Cordeiro & Castro 2011CORDEIRO GM & CASTRO M DE. 2011. A new family of generalized distributions. J Stat Comput Simul 81(7): 883–898. has the form
where all parameters are positive.Application 1: Hourly dollar wage data
The first application refers to hourly dollar wages for US workers. These data are obtained from the SemiPar package (Wand et al. 2005WAND M, COULL B, FRENCH J, GANGULI B, KAMMANN E, STAUDENMAYER J & ZANOBETTI A. 2005. SemiPar 1.0. R package. URL http://cran.r-project.org.
http://cran.r-project.org...
). Table III lists the estimates, standard errors (SEs) in parentheses, and three classical statistics. The lowest values of these measures reveal that the MOTPW is the best model. Next, the likelihood ratio (LR) statistic for comparing the MOTPW and TPW models is which supports the wider distribution.
Figure 10a shows the histogram and the estimated MOTPW density. Figure 10b provides the empirical function and estimated MOTPW cdf, thus revealing that this distribution is appropriate for these data.
Application 2: Diabetes data
We consider two variables from the data reported by Reaven & Miller 1979REAVEN G & MILLER R. 1979. An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16(1): 17–24. : the response is the relative weight defined by the ratio between the actual weight and the expected weight (given the person’s height), and the explanatory variable indicates the diagnostic group (0 =normal, 1= chemical diabetes, 2 = overt diabetes). The diagnostic group has three levels and then we have two dummy variables (for and ). The objective is to know what are the relations among the relative weight and the levels of the diagnostic group.
The systematic components for the MOTPW regression are
The measures for the fitted regressions are reported in Table IV. Clearly, the MOTPW is the best regression for these data.
Table V provides the estimates, SEs and -values for the best regression.
We note that the co-variable is significant and is not. So, there is a real difference between normal and chemical diabetes groups in relation to relative weight and no difference between normal and overt diabetes groups to relative weight. The same findings can be seen in Figure 12.
The LR statistic to compare the MOTPW and TPW regressions is (-value=0.032) that indicates that the fist regression is superior to the second regression to these data in terms of model fitting.
The plot of the residuals reported in Figure 11a does not detect outliers and departures from the general assumptions. The worm plot (Buuren & Fredriks 2001BUUREN S VAN & FREDRIKS M. 2001. Worm plot: a simple diagnostic device for modelling growth reference curves. Stat Med 20(8): 1259–1277. ) of the residuals in Figure 11b and the QQ plot displayed in Figure 11c show the adequacy of the MOTPW regression for the current data.
A graphical comparison from the estimated cdfs in Figure 12 also supports the regression analysis.
Conclusions
We define two flexible Marshall–Olkin–Power-Series (MOPS) families of continuous distributions which can be very useful to fit real data. They are obtained by combining the Marshall–Olkin class (Marshall & Olkin 1997MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652. ) and the power series distribution. Hundreds of continuous distributions can be easily formulated from the two families. We discuss some special distributions and maximum likelihood estimation. We introduce the Marshall–Olkin Truncated Poisson Weibull regression associated with one of the families. Some mathematical properties of these families are presented. We provide a package implemented in R software which can be used to determine numerically some mathematical properties for any distribution in the new families. The utility of the proposed models is proved empirically in two applications.
ACKNOWLEDGMENTS
We gratefully acknowledge from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil.
- BUUREN S VAN & FREDRIKS M. 2001. Worm plot: a simple diagnostic device for modelling growth reference curves. Stat Med 20(8): 1259–1277.
- CORDEIRO GM & CASTRO M DE. 2011. A new family of generalized distributions. J Stat Comput Simul 81(7): 883–898.
- DUNN PK & SMYTH GK. 1996. Randomized quantile residuals. J Comput Graph Stat 5(3): 236–244.
- LEE C, FAMOYE F & OLUMOLADE O. 2007. Beta-Weibull distribution: some properties and applications to censored data. J Mod Appl Stat Meth 6(1): 173–186.
- MARINHO PRD, SILVA RB, BOURGUIGNON M, CORDEIRO GM & NADARAJAH S. 2019. AdequacyModel: An R package for probability distributions and general purpose optimization. PloS One 14(8): e0221487.
- MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652.
- R CORE TEAM. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/
» https://www.R-project.org/ - REAVEN G & MILLER R. 1979. An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16(1): 17–24.
- STASINOPOULOS DM & RIGBY RA. 2008. Generalized additive models for location scale and shape (GAMLSS) in R. J Stat Soft 23: 1–46.
- TAHIR MH & NADARAJAH S. 2015. Parameter induction in continuous univariate distributions: Well-established G families. An Acad Bras Cienc 87: 539–568.
- WAND M, COULL B, FRENCH J, GANGULI B, KAMMANN E, STAUDENMAYER J & ZANOBETTI A. 2005. SemiPar 1.0. R package. URL http://cran.r-project.org
» http://cran.r-project.org
Publication Dates
-
Publication in this collection
18 July 2022 -
Date of issue
2022
History
-
Received
28 Dec 2020 -
Accepted
7 June 2021