Acessibilidade / Reportar erro

A competitive family to the Beta and Kumaraswamy generators: Properties, Regressions and Applications

Abstract

We define two new flexible families of continuous distributions to fit real data by compoun-ding the Marshall–Olkin class and the power series distribution. These families are very competitive to the popular beta and Kumaraswamy generators. Their densities have linear representations of exponentiated densities. In fact, as the main properties of thirty five exponentiated distributions are well-known, we can easily obtain several properties of about three hundred fifty distributions using the references of this article and five special cases of the power series distribution. We provide a package implemented in R software that shows numerically the precision of one of the linear representations. This package is useful to calculate numerical values for some statistical measurements of the generated distributions. We estimate the parameters by maximum likelihood. We define a regression based on one of the two families. The usefulness of a generated distribution and the associated regression is proved empirically.

Key words
generating function; Marshall–Olkin family; maximum likelihood; moment; distribution

Introduction

The Marshall–Olkin (“MO“) family (Marshall & Olkin 1997MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652. ) adds one parameter to a parent distribution. Let G(z)=G(z;𝛕) be the parent cumulative distribution function (cdf) of a random variable Z with parameter vector 𝛕=(τ1,,τq). The survival function and probability density function (pdf) of Z are G(z)=G(z;𝛕) and g(z)=g(z;𝛕), respectively.

The cdf H(z) and survival function H(z)=1H(z) of the MO class with baseline G(z;𝛕) are

H(z)=H(z;α,𝛕)=G(z;𝛕)1αG(z;𝛕),z,α>0,(1)
and
H(z)=H(z;α,𝛕)=αG(z;𝛕)1αG(z;𝛕),(2)
respectively, where α=1α.

Equation (1) can generate many continuous distributions from popular ones. The MO-G density function can be expressed as

h(z)=h(z;α,𝛕)=αg(z;𝛕)[1αG(z;𝛕)]2.(3)

For α=1, h(z)=g(z;𝛕) is the simplest case of (3). Marshall & Olkin 1997MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652. pioneered the MO-Weibull (MOW) distribution which is a useful extension of the Weibull.

Consider N random variables Z1,,ZN independent and identically distributed (i.i.d.) with cdf H(z) and pdf h(z) given by (1) and (3), respectively. Here, N is a discrete random variable with support {1,2,}. Henceforth, let X=max{Z1,,ZN} and Y=min{Z1,,ZN} be two random variables assuming that N has the zero-truncated power series (PS) distribution with probability mass function (pmf)

pn=(N=n;θ)=anθnC(θ),n=1,2,,(4)
where an>0 (for n1), θ is called the power parameter and C(θ)=n=1anθn>0. The probability generating function (pgf) of N is P(z)=E(zN)=C(zθ)/C(θ).

Five important distributions are special cases of (4): the zero-truncated Poisson, logarithmic, negative binomial, geometric and zero-truncated binomial distributions.

The cdf of X=max{Z1,,ZN} conditional given N=n is

FX(xN=n)=[Xx|N=n]=H(x;α,𝛕)n,
and then the unconditional cdf of X follows from (4)
FX(x)=n=1H(x;α,𝛕)nanθnC(θ)=C(θH(x;α,𝛕))C(θ).(5)

The conditional cdf of Y=min{Z1,,ZN} under N=n is

FY(yN=n)=[Yy|N=n]=1H(y;α,𝛕)n,
and then the unconditional cdf of Y follows from (4) as
FY(y)=1n=1H(y;α,𝛕)nanθnC(θ)=1C(θH(y;α,𝛕))C(θ).(6)

Equations (5) and (6) define two Marshall–Olkin Power Series-G (MOPS-G) families under baseline G. They provide a strong motivation for explaining the failure time of any mechanism formed by an unknown number N of identical and independent (parallel or serial) components. The densities of X and Y are obtained by differentiating (5) and (6). We emphasize that these equations can generate many MOPS models. For each baseline G, we can generate ten (2×5) associated models from the five discrete distributions in Equation (4). For α=1, we have the Power Series-G (PS-G) classes under baseline G.

The minimum (Y) and maximum (X) statistics can be applied in several series and parallel systems with identical components and have many industrial and biological applications. In parallel systems, the random variable Y models the time of the first component to fail, while X models the time for the breakout system. A dual interpretation can be given for systems with serial components. These random variables are also very useful in oncology. For example, suppose we are studying a recurrence of a certain type of cancerous tumor of an individual after undergoing any kind of treatment. So, the time for the first cell to activate to produce cancer cells can be modeled by the generated distribution of Y, while the disease manifestation (if it occurs only after an unknown number of factors have been active) can be modeled by the generated distribution of X.

Four new distributions based on the MOPS construction are introduced for illustrative purposes in Section Four special models. We derive linear representations for the densities of X and Y in Section Expansions. A package in R is presented in Section Numerical evaluation to calculate numerically several mathematical properties for the generated distributions based on the linear representations. General structural properties for the two families are addressed in Section Properties. In Section Estimation, we estimate the parameters for one of the families. We introduce in Section Regression the Marshall–Olkin Truncated Poisson Weibull regression defined from one of the families. In Section Two simulation studies, some simulations examine the accuracy of the maximum likelihood estimates (MLEs) and the quantile residuals (qrs). Two applications prove the utility of our finding in Section Applications. Finally, we offer concluding remarks in Section Conclusions.

Four special models

First, consider the zero-truncated Poisson in (4). The cdfs of the Marshall–Olkin Zero-Truncated Poisson-G (MOTP-G) distributions are determined from Equations (5) and (6) as

FX(x)=(eθ1)1[exp{θH(x;α,𝛕)}1](7)
and
FY(y)=1(eθ1)1[exp{θH(y;α,𝛕)}1].(8)

The Weibull cdf with scale parameter λ>0 and shape parameter β>0 is (for x0)

G(z;λ,β)=1exp[(λz)β].

Then, the cdf and survival function of the MO-Weibull (MOW) distribution are

H(z)=H(z;α,λ,β)=1exp[(λz)β]1αexp[(λz)β](9)
and
H(z)=H(z;α,λ,β)=αexp[(λz)β]1αexp[(λz)β],(10)
respectively.

By inserting the last two formulae in Equations (7) and (8) and differentiating the resulting expressions, we obtain the MOTP-Weibull (MOTPW) densities

\begin{aligned} \label{MOTPW1} f_X(x)=\frac{\alpha\,\theta\,\beta\,\lambda^{\beta}\,x^{\beta-1}\,\rm{e}^{-u}}{(\rm{e}^\theta-1)\,(1-\bar{\alpha}\rm{e}^{-u})^2}\, \exp\Bigg[\frac{\theta\,(1-\rm{e}^{-u})}{(1-\bar{\alpha}\rm{e}^{-u})}\Bigg]\end{aligned}(11)
and
\begin{aligned} \label{MOTPW2} f_Y(y)=\frac{\alpha\,\theta\,\beta\,\lambda^{\beta}\,x^{\beta-1}\,\rm{e}^{-u}}{(\rm{e}^\theta-1)\,(1-\bar{\alpha}\rm{e}^{-u})^2}\, \exp\Bigg[\frac{\alpha\,\theta\,\rm{e}^{-u}}{(1-\bar{\alpha}\rm{e}^{-u})}\Bigg],\end{aligned}(12)
respectively, where u=u(x)=(λx)β in fX(x) and u=u(y)=(λy)β in fY(y).

Second, consider the geometric distribution in (4). The cdfs of the Marshall–Olkin Geometric-G (MOG-G) classes follow from Equations (5) and (6)

FX(x)=(1θ)θ[θH(x;α,𝛕)1θH(x;α,𝛕)](13)
and
FY(y)=1(1θ)θ[θH(y;α,𝛕)1θH(y;α,𝛕)].(14)

The Burr XII (BXII) cdf is (for x>0)

G(z;β,λ)=1(1+zβ)λ,(15)
where β>0 and λ>0 are shape parameters. For λ=1 and β=1 in Equation (15), we have the log-logistic (LL) and Lomax distributions, respectively.

Hence, the cdf and survival function of the Marshall–Olkin Burr XII (MOBXII) distribution are

H(z)=H(z;α,λ,β)=1(1+zβ)λ1α(1+zβ)λ(16)
and
H(z)=H(z;α,λ,β)=α(1+zβ)λ1α(1+zβ)λ,(17)
respectively.

By inserting the last two formulae in Equations (13) and (14) and differentiating the resulting expressions with respect to x and y, respectively, we obtain the MOG-Burr XII (MOGBXII) densities

fX(x)=αβλ(1θ)xβ1(1+xβ)λ1[1θ(1αθ)(1+xβ)λ]2(18)
and
fY(y)=αβλ(1θ)xβ1(1+xβ)λ1{1[1(1θ)α](1+xβ)λ}2.(19)

For the MOTPW and MOGBXII distributions (to the maximum X) referred to (11) and (18), some plots of the densities and cumulative functions are displayed in Figures 1 and 2, respectively. The various forms of the densities indicate more flexibility than the parent distributions.

Figure 1
Plots of the density and cumulative functions of the MOTPW distribution under four scenarios. (a) 𝛂=30, 𝛌=2, 𝛃=1.5, and varying 𝛉. (b) 𝛂=30, 𝛌=2, 𝛃=1.5, and varying 𝛉. (c) 𝛉=0.09, 𝛌=2, 𝛃=1.5, and varying 𝛂. (d) 𝛉=0.09, 𝛌=2, 𝛃=1.5, and varying 𝛂.
Figure 2
Plots of the density and cumulative functions of the MOGBXII distribution under four scenarios. (a) 𝛂=10, 𝛌=2, 𝛃=1.5, and varying 𝛉. (b) 𝛂=10, 𝛌=2, 𝛃=1.5, and varying 𝛉. (c) 𝛉=0.9, 𝛌=2, 𝛃=1.5, and varying 𝛂. (d) 𝛉=0.9, 𝛌=2, 𝛃=1.5, and varying 𝛂.

We can note increasing, decreasing, and unimodal shapes for the hrf of the MOTPW distribution in Figure 3. Also, we see a slightly different hrf with increasing, decreasing and increasing shape.

Figure 3
Plots of the hrf of the MOTPW model.

Graphics comparing the histograms from two simulated data sets and the MOTPW and MOGBXII densities of X under specified parameters are reported in Figure 4. They show good agreement between the simulated values and these densities.

Figure 4
Plots of the MOTPW (a) and MOGBXII (b) densities and histograms of simulated data.

Expansions

We obtain useful linear representations for the density functions of X and Y for two separated cases α(0,1) and α>1. For α=1, we have H(z;1,𝛕)=G(z;𝛕).

By inserting (1) in Equation (5) and letting G(x)=G(x;𝛕), we can write

FX(x)=n=1pnG(x)n[1αG(x)]n.(20)

First, we consider the density of the maximum X when α(0,1). For |z| <1 and n=1,2,, the negative binomial expansion holds

(1z)n=k=0(nk)(z)k.(21)

Expanding [1αG(z)]n as in Equation (21) since α(0,1), we have

FX(x)=n=1k=0(nk)(α)kpnG(x)n[1G(x)]k.

Henceforth, let Tsexp-G(s) be the exponentiated-G (exp-G) random variable with power parameter s>0. Its cdf and pdf are Πs(x)=Πs(x;𝛕)=G(x;𝛕)s and πs(x)=πs(x;𝛕)=sG(x;𝛕)s1g(x;𝛕), respectively. Many exp-G properties have been studied exhaustively by several authors (Tahir & Nadarajah 2015TAHIR MH & NADARAJAH S. 2015. Parameter induction in continuous univariate distributions: Well-established G families. An Acad Bras Cienc 87: 539–568. ). We can write

FX(x)=n=1wn,0Πn(x)+n=1k=1wn,kΠn(x)[1G(x)]k,
where wn,k=wn,k(α,θ)=(nk)(α)kpn for n=1,2, and k=0,1, Further, using the binomial theorem, we obtain
FX(x)=n=1wn,0Πn(x)+n=1k=1i=0kwn,k,iΠn+i(x),
where wn,k,i=(1)i(ki)wn,k for i=0,1,,k.

By differentiating the last equation, we obtain

fX(x)=n=1wn,0πn(x)+n=1k=1i=0kwn,k,iπn+i(x).(22)

We now move to the density of the maximum X when α>1. We modify the denominator in (20)

FX(x)=n=1pnG(x)nαn[1(1α1)G(x)]n
and then apply Equation (21) to find
FX(x)=n=1k=0vn,kΠn+k(x),
where vn,k=vn,k(α,θ)=(1)k(nk)αn(1α1)kpn (for n=1,2, and k=0,1,). By differentiating FX(x), the density of X follows as
fX(x)=n=1k=0vn,kπn+k(x).(23)

Next, we consider the density of the minimum Y. By inserting (2) in Equation (6), we have

FY(y)=1n=1αnpnG(y)n[1αG(y)]n.(24)

For α(0,1), we apply expansion (21) in the last equation to

FY(y)=1n=1k=0qn,kG(y)n+k,
where qn,k=qn,k(α,θ)=(1)k(nk)αkαnpn for n=1,2, and k=0,1,

By using the binomial theorem in G(y)n+k, we have

FY(y)=1+n=1k=0i=0n+kqn,k,iΠi(y),
where qn,k,i=(1)i+1(n+ki)qn,k for i=0,1,,n+k.

By differentiating FY(y), the density of Y can be expressed as

fY(y)=n=1k=0i=1n+kqn,k,iπi(y).(25)

We now obtain the density of Y when α>1. By changing the denominator in Equation (24), we have

FY(y)=1n=1pnG(y)n[1(1α1)G(y)]n.

Applying expansion (21) in the last equation

FY(y)=1n=1k=0tn,kG(y)nG(y)k,
where tn,k=tn,k(α,θ)=(1)k(1α1)k(nk)pn (for n=1,2, and k=0,1,).

Using the binomial theorem, we can rewrite FY(y) as

FY(y)=1+n=1k=0i=0ntn,k,iΠi+k(y),
where tn,k,i=(1)i+1(ni)tn,k for i=0,1, By simple differentiation
fY(y)=n=1k=0i=0ntn,k,iπi+k1(y),(26)
where πi+k1(y) is the exp-G density with power parameter i+k1.

Equations (22), (23), (25) and (26) are the main results of this section. These linear representations have great utility for deriving structural properties of the maximum X and minimum Y from well-known exp-G properties. More than thirty five exp-G models have been studied so far and then it is possible to construct at least three hundred fifty (70×5) MOPS-G models with properties determined from those exp-G properties. We can use statistical platforms with ten terms to have precise results.

Numerical evaluation

In order to evaluate the analytical results presented in the previous sections, a package was implemented using the R programming language (R Core Team 2022R CORE TEAM. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/.
https://www.R-project.org/...
). The MarshallOlkinPSG package was constructed in a generic way, that is, its most important functions allow generalizations for any baseline G distribution or even inform a zero-truncated PS distribution.

The library code can be obtained from GitHub at https://github.com/prdm0/MarshallOlkinPSG. On the library’s website (see https://prdm0.github.io/MarshallOlkinPSG) it is possible to have more information on the functions implemented through the documentation and usage examples.

To install the package hosted and maintained on GitHub, it is necessary to previously install the remotes library. With the prerequisite met, the package MarshallOlkinPSG can be installed as:

# Install the remotes package: # install.packages("remotes") remotes::install_github("prdm0/MarshallOlkinPSG", force = TRUE)

The function eq_19() implements Equation (23) and compares, for example, with the exact MOTPW density in Equation (11). To facilitate comparison, the function pdf_theorical() implements this density function. By doing help(eq_19) it is possible to access an example of comparison of the two equations. Note that Equation (23) approximates (11) very well when finite sums are taken in applied problems. In other words, the results achieved by the function eq_19() approximates very well those from pdf_theorical(). The function eq_19() will also allow any baseline cdf G(x) as an argument of eq_19().

The function eval_plot_moptw() allows to validating numerically Equation (23) by means of plots. The true parameters for the MOTPW density are: α=1.20, θ=1.50, β=1.33, and λ=2. In addition, we require just a few terms in the sums to obtain a reasonable level of precision as shown in the plots in Figure 5, where six or eight terms provide very accurate approximations.

Figure 5
Numerical evaluation of (19) with finite sums, where N and K denote the upper limits of terms in the related sums with the running indices n and k, respectively.

Properties

We now provide some mathematical properties of Ts that can be easily utilized in the linear representations of the previous section to find the corresponding properties of X and Y.

The nth ordinary moment of Ts has the form

μn=E(Tsn)=stnG(t;𝛕)s1g(t;𝛕)dt=s01QG(u)nus1du,(27)
where QG(u;𝛕)=G1(u;𝛕) is the qf of G.

Explicit expressions for several exp-G moments can be determined from (27).

The nth incomplete moment of Ts follows the previous algebra

mn(y)=E(Tsn|Ts <y)=s0G(y;𝛕)QG(u)nus1du,(28)
where the integral can be calculated for the great majority of G distributions. The first incomplete moment m1(y) is the most important case of (28) to find mean deviations and Lorenz and Bonferroni curves.

The moment generating function (mgf) of Ts follows as

\[\label{mgf} M(w)=E(\rm{e}^{w T_s})=s\,\int_{-\infty}^{\infty} \rm{e}^{w t}\,G(t;\pmb{\tau})^{s-1}\,g(t;\pmb{\tau}) dt=s\,\int_{0}^{1} \exp\left[w\, Q_G(u)\right]\,u^{s-1} du.\](29)

The mgfs of exp-G distributions con be determined from Equation (29).

Estimation

The MLEs are appropriate at least in large samples to determine confidence intervals for the parameters. We consider the random variable X defined from Equations (3) and (5) for any baseline G with any unknown parameter vector ψ=(α,θ,ττ)T. By simple differentiation of (5), the density of X takes the form

fX(x;α,θ,𝛕)=αθg(x;𝛕)C(θH(x;α,𝛕))C(θ)[1αG(x;𝛕)]2,(30)
where C() follows from (4) and H(x;α,𝛕)=G(x;𝛕)/[1αG(x;𝛕)].

The log-likelihood function for ψ from a random sample x1,,xn of X is

=(ψ)=log[αθC(θ)]+i=1nlog[g(xi;ττ)]+i=1nlog[C(θH(xi;α,ττ))]2i=1nlog[1α¯G¯(x;ττ)].(31)

A similar development can be conducted for the random variable Y defined from Equation (6) for any baseline G.

We can find the MLE ψ^ by maximizing Equation (31) using the MaxBFGS sub-routine (Ox program), optim function (R), and PROC NLMIXED (SAS). The AdequacyModel package can also maximize (31) using the PSO (particle swarm optimization) approach from the quasi-Newton BFGS, Nelder-Mead and simulated-annealing methods to maximize the log-likelihood function and it does not require initial values. Details are available at Marinho et al. 2019MARINHO PRD, SILVA RB, BOURGUIGNON M, CORDEIRO GM & NADARAJAH S. 2019. AdequacyModel: An R package for probability distributions and general purpose optimization. PloS One 14(8): e0221487. and https://github.com/prdm0/AdequacyModel.

These scripts can be executed for a wide range of initial values and may lead to more than one maximum. However, in these cases, we consider the MLEs corresponding to the largest value of the maximum log-likelihood. There are sufficient conditions for the existence of these estimates such as compactness of the parameter space and the concavity of the log-likelihood function, but they can exist even when the conditions are not satisfied. In general, there is no explicit solution for the estimates from maximizing (31), but we can establish theoretical conditions on their existence and uniqueness for very special models by examining the ranges of the score components.

Regression

Consider that X1,,Xn are independent random variables from any distribution in (11) assuming that the parameters λ and λ vary through them. We propose a new regression based on the response variable in (11) with the systematic components

λi=exp(viTη1)andβi=exp(viTη2),i=1,,n,(32)
respectively, where \({\bf v}_{i}^{T}=(v_{i1},\ldots,v_{ip})\), η1=(η11,,η1p)T and η2=(η21,,η2p)T. Equations (11) and (32) define the MOTPW regression. For α=1, it follows the truncated Poisson Weibull (TPW) regression.

In a similar manner, we can construct many other regressions based on other MOPS-G distributions defined from Equations (5) and (6).

The log-likelihood function for the vector ψ=(α,θ,η1T,η2T)T from the MOTPW regression can be reduced to

l(ψ)=nlog[αθexp(θ)1]+i=1nlog(βi)+i=1nβilog(λi)+i=1n(βi1)log(xi)i=1n(λixi)βii=1nlog{1α¯exp[(λixi)βi]}+θi=1n{1exp[(λixi)βi]}{1α¯exp[(λixi)βi]}.(33)
We obtain the MOTPW distribution for λi=λ and βi=β.

Let ψ^ be the MLE of ψ. Equation (33) can also be maximized using the gamlss regression framework (Stasinopoulos & Rigby 2008STASINOPOULOS DM & RIGBY RA. 2008. Generalized additive models for location scale and shape (GAMLSS) in R. J Stat Soft 23: 1–46. ) in R.

Two simulation studies

We perform two simulation studies. The first one examines the accuracy of the MLEs of the parameter estimates in the MOTPW distribution. The second one does the same for the MOTPW regression.

The MOTPW distribution

First, we evaluate the precision of the estimates in the MOTPW distribution based on 1,000 Monte Carlo simulations using the R software. The simulation procedure follows as:

  • The inverse function Q(u)=F1(u) comes from (7)

    Q(u)=λ1{log[θlog[uexp(θ)u+1]θ+αlog[uexp(θ)u+1]log[uexp(θ)u+1]]}1β.(34)

  • Generate u U(0,1) and obtain the values x=Q(u) of the MOTPW distribution.

The true parameters are λ=3, β=1, θ=1.5 and α=0.7. The average estimates (AEs), biases, and mean squared errors (MSEs) are listed in Table I. The three measures decrease steadily when n becomes large.

Table I
Simulation results for the MOTPW distribution.

The MOTPW regression

We perform some Monte Carlo simulations for some values of n to investigate the accuracy of the MLEs in the MOTPW regression under four scenarios: Scenario 1: θ=0.6 and α=0.4; Scenario 2: θ=0.6 and α=1.4; Scenario 3: θ=1.7 and α=0.4; Scenario 4: θ=1.7 and α=1.4. We take values greater than and less than one for θ and α.

The explanatory variables v1,,vn are generated in the regression by taking λi=0.5+0.8vi, βi=0.3+0.1vi, and viBernoulli(0.5).

For each scenario and value of n, one thousand samples are generated from the MOTPW regression fitted to each generated data set. The quantities reported in Table II are in good agreement with the asymptotic results for the MLEs.

Table II
Simulation results for the MOTPW regression.

Residual analysis

We investigate the quantile residuals (qrs) to verity the adequacy of the response distribution to determine outliers in the MOTPW regression. The same approach can be adopted to many other regressions defined from the distributions in (5) and (6). The qrs are given by (Dunn & Smyth 1996DUNN PK & SMYTH GK. 1996. Randomized quantile residuals. J Comput Graph Stat 5(3): 236–244. )

qri=Φ1{[exp(θ)1]1exp{θ1exp[(λixi)βi]1αexp[(λixi)βi]}1},(35)
where Φ() is the normal cdf and λi and βi are defined in Equation (32).

We consider the same scenarios for the simulations in Section Two Simulation Studies. For each fitted regression, the qrs are calculated from Equation (35). Figures 6, 7, 8, and 9 display QQ plots which show that the empirical distribution of these residuals is close to the standard normal distribution.

Figure 6
QQ plots for scenario 1 (𝛉=0.6 and 𝛂=0.4). (a) 𝐧=100. (b) 𝐧=500. (c) 𝐧=1,000.
Figure 7
QQ plots for scenario 2 (𝛉=0.6 and 𝛂=1.4). (a) 𝐧=100. (b) 𝐧=500. (c) 𝐧=1,000.
Figure 8
QQ plots for scenario 3 (𝛉=1.7 and 𝛂=0.4). (a) 𝐧=100. (b) 𝐧=500. (c) 𝐧=1,000.
Figure 9
QQ plots for scenario 4 (𝛉=1.7 and 𝛂=1.4). (a) 𝐧=100. (b) 𝐧=500. (c) 𝐧=1,000.

Applications

The beta Weibull (BW) and Kumaraswamy Weibull (KwW) distributions have been widely used to fit real data in the last ten years or so. We compare the MOTPW distribution with the BW and KwW distributions since all of them have four parameters. The BW density pioneered by Lee et al. 2007LEE C, FAMOYE F & OLUMOLADE O. 2007. Beta-Weibull distribution: some properties and applications to censored data. J Mod Appl Stat Meth 6(1): 173–186. is

\begin{aligned} f(x)=\frac{c\lambda^c}{B(a,b)}x^{c-1} {\rm exp}\{-b(\lambda x)^c\}[1-{\rm exp}\{-(\lambda x)^c\}]^{a-1},\,\,x>0,\end{aligned}
where all parameters are positive.

The KwW density introduced by Cordeiro & Castro 2011CORDEIRO GM & CASTRO M DE. 2011. A new family of generalized distributions. J Stat Comput Simul 81(7): 883–898. has the form

f(x)=abcλcxc1exp{(λx)c}[1exp{(λx)c}]a1{1[1exp{(λx)c}]a}b1,x>0,
where all parameters are positive.

Application 1: Hourly dollar wage data

The first application refers to hourly dollar wages for n=534 US workers. These data are obtained from the SemiPar package (Wand et al. 2005WAND M, COULL B, FRENCH J, GANGULI B, KAMMANN E, STAUDENMAYER J & ZANOBETTI A. 2005. SemiPar 1.0. R package. URL http://cran.r-project.org.
http://cran.r-project.org...
). Table III lists the estimates, standard errors (SEs) in parentheses, and three classical statistics. The lowest values of these measures reveal that the MOTPW is the best model. Next, the likelihood ratio (LR) statistic for comparing the MOTPW and TPW models is 6.159(p-value <0.013) which supports the wider distribution.

Figure 10a shows the histogram and the estimated MOTPW density. Figure 10b provides the empirical function and estimated MOTPW cdf, thus revealing that this distribution is appropriate for these data.

Figure 10
(a) Estimated MOTPW pdf. (b) Estimated MOTPW cdf and the empirical cdf.
Table III
Results for hourly dollar wage data.

Application 2: Diabetes data

We consider two variables from the data reported by Reaven & Miller 1979REAVEN G & MILLER R. 1979. An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16(1): 17–24. : the response xi is the relative weight defined by the ratio between the actual weight and the expected weight (given the person’s height), and the explanatory variable vi1 indicates the diagnostic group (0 =normal, 1= chemical diabetes, 2 = overt diabetes). The diagnostic group has three levels and then we have two dummy variables (dij) (for i=1,,145 and j=1,2). The objective is to know what are the relations among the relative weight and the levels of the diagnostic group.

The systematic components for the MOTPW regression are

λi=exp(η10+η11di1+η12di2)andβi=exp(η20+η21di1+η22di2),i=1,,145.

The measures for the fitted regressions are reported in Table IV. Clearly, the MOTPW is the best regression for these data.

Table IV
Measures for diabetes data.

Table V provides the estimates, SEs and p-values for the best regression.

Table V
Results for diabetes data.

We note that the co-variable di1 is significant and di2 is not. So, there is a real difference between normal and chemical diabetes groups in relation to relative weight and no difference between normal and overt diabetes groups to relative weight. The same findings can be seen in Figure 12.

The LR statistic to compare the MOTPW and TPW regressions is w=4.590 (p-value=0.032) that indicates that the fist regression is superior to the second regression to these data in terms of model fitting.

The plot of the residuals reported in Figure 11a does not detect outliers and departures from the general assumptions. The worm plot (Buuren & Fredriks 2001BUUREN S VAN & FREDRIKS M. 2001. Worm plot: a simple diagnostic device for modelling growth reference curves. Stat Med 20(8): 1259–1277. ) of the residuals in Figure 11b and the QQ plot displayed in Figure 11c show the adequacy of the MOTPW regression for the current data.

Figure 11
(a) Residual plot. (b) Worm plots. (c) QQ plot.

A graphical comparison from the estimated cdfs in Figure 12 also supports the regression analysis.

Figure 12
Estimated cdf and the empirical cdf.

Conclusions

We define two flexible Marshall–Olkin–Power-Series (MOPS) families of continuous distributions which can be very useful to fit real data. They are obtained by combining the Marshall–Olkin class (Marshall & Olkin 1997MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652. ) and the power series distribution. Hundreds of continuous distributions can be easily formulated from the two families. We discuss some special distributions and maximum likelihood estimation. We introduce the Marshall–Olkin Truncated Poisson Weibull regression associated with one of the families. Some mathematical properties of these families are presented. We provide a package implemented in R software which can be used to determine numerically some mathematical properties for any distribution in the new families. The utility of the proposed models is proved empirically in two applications.

ACKNOWLEDGMENTS

We gratefully acknowledge from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil.

  • BUUREN S VAN & FREDRIKS M. 2001. Worm plot: a simple diagnostic device for modelling growth reference curves. Stat Med 20(8): 1259–1277.
  • CORDEIRO GM & CASTRO M DE. 2011. A new family of generalized distributions. J Stat Comput Simul 81(7): 883–898.
  • DUNN PK & SMYTH GK. 1996. Randomized quantile residuals. J Comput Graph Stat 5(3): 236–244.
  • LEE C, FAMOYE F & OLUMOLADE O. 2007. Beta-Weibull distribution: some properties and applications to censored data. J Mod Appl Stat Meth 6(1): 173–186.
  • MARINHO PRD, SILVA RB, BOURGUIGNON M, CORDEIRO GM & NADARAJAH S. 2019. AdequacyModel: An R package for probability distributions and general purpose optimization. PloS One 14(8): e0221487.
  • MARSHALL AW & OLKIN I. 1997. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 84(3): 641–652.
  • R CORE TEAM. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/
    » https://www.R-project.org/
  • REAVEN G & MILLER R. 1979. An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16(1): 17–24.
  • STASINOPOULOS DM & RIGBY RA. 2008. Generalized additive models for location scale and shape (GAMLSS) in R. J Stat Soft 23: 1–46.
  • TAHIR MH & NADARAJAH S. 2015. Parameter induction in continuous univariate distributions: Well-established G families. An Acad Bras Cienc 87: 539–568.
  • WAND M, COULL B, FRENCH J, GANGULI B, KAMMANN E, STAUDENMAYER J & ZANOBETTI A. 2005. SemiPar 1.0. R package. URL http://cran.r-project.org
    » http://cran.r-project.org

Publication Dates

  • Publication in this collection
    18 July 2022
  • Date of issue
    2022

History

  • Received
    28 Dec 2020
  • Accepted
    7 June 2021
Academia Brasileira de Ciências Rua Anfilófio de Carvalho, 29, 3º andar, 20030-060 Rio de Janeiro RJ Brasil, Tel: +55 21 3907-8100 - Rio de Janeiro - RJ - Brazil
E-mail: aabc@abc.org.br