Acessibilidade / Reportar erro

Bootstrap confidence intervals for industrial recurrent event data

Abstract

Industrial recurrent event data where an event of interest can be observed more than once in a single sample unit are presented in several areas, such as engineering, manufacturing and industrial reliability. Such type of data provide information about the number of events, time to their occurrence and also their costs. Nelson (1995) presents a methodology to obtain asymptotic confidence intervals for the cost and the number of cumulative recurrent events. Although this is a standard procedure, it can not perform well in some situations, in particular when the sample size available is small. In this context, computer-intensive methods such as bootstrap can be used to construct confidence intervals. In this paper, we propose a technique based on the bootstrap method to have interval estimates for the cost and the number of cumulative events. One of the advantages of the proposed methodology is the possibility for its application in several areas and its easy computational implementation. In addition, it can be a better alternative than asymptotic-based methods to calculate confidence intervals, according to some Monte Carlo simulations. An example from the engineering area illustrates the methodology.

industrial data; recurrent events; bootstrap; asymptotic theory; confidence intervals


Bootstrap confidence intervals for industrial recurrent event data

Osvaldo AnacletoI; Francisco LouzadaII,* * Corresponding author

IDepartment of Mathematics and Statistics, The Open University, Milton Keynes, United Kingdom

IIDepartamento de Matematica Aplicada e Estatística, Universidade de São Paulo, São Carlos, Brazil. E-mail: louzada@icmc.usp.br

ABSTRACT

Industrial recurrent event data where an event of interest can be observed more than once in a single sample unit are presented in several areas, such as engineering, manufacturing and industrial reliability. Such type of data provide information about the number of events, time to their occurrence and also their costs. Nelson (1995) presents a methodology to obtain asymptotic confidence intervals for the cost and the number of cumulative recurrent events. Although this is a standard procedure, it can not perform well in some situations, in particular when the sample size available is small. In this context, computer-intensive methods such as bootstrap can be used to construct confidence intervals. In this paper, we propose a technique based on the bootstrap method to have interval estimates for the cost and the number of cumulative events. One of the advantages of the proposed methodology is the possibility for its application in several areas and its easy computational implementation. In addition, it can be a better alternative than asymptotic-based methods to calculate confidence intervals, according to some Monte Carlo simulations. An example from the engineering area illustrates the methodology.

Keywords: industrial data, recurrent events, bootstrap, asymptotic theory, confidence intervals.

1 INTRODUCTION

In several areas, such as engineering, manufacturing and industrial reliability, we may observe recurrent event data, where the event of interest can be the repeated failures in a piece of equipment, systems which accumulate several repairs, or the number of bugs in a software under study, for instance. There are currently several models and methods developed for the analysis of such data, as described in Hougaard (2000). Approaches often used to model recurrent event data, which allow us to learn about an individual process, are those based on Poisson and renewal processes. (Cox & Isham, 1980; Cox & Lewis, 1966; Andersen, Borgan, Gill & Keiding, 1993; Lawless, 1987; Follmann & Goldberg, 1988; Prentice, Williams & Peterson, 1981).

For recurrent event data, it is interesting to study the number of events that occurs over the time. Adding to that, as for each event a cost can be associated, an event cost study may be also important for the analyst. From this perspective, a mean cumulative function (MCF) for the number (or cost) of events per sample unit can be defined. Nelson (1988, 1995) presented a non parametric procedure to calculate confidence intervals for this function. However, as it relies on asymptotic distributional assumptions, the quality of their results can be affected if information about the event of interest are not largely available, which results in a small sample size of units. In this context, computer-intensive methods (Davison & Hinkley, 1999; Chernik, 2008) such as bootstrap can be used to construct confidence limits for the MCF.

In this paper, we present the estimate and confidence limits proposed by Nelson, and also introduce a bootstrap-based technique in order to obtain confidence limits for the MCF. In Section 2, we introduce the MCF estimate proposed by Nelson (1995). Two methods to calculate confidence limits for the MCF are presented in Section 3, the Nelson asymptotic procedure and our proposed technique. Section 4 presents a simulation study in order to compare the two approaches discussed in Section 3 via coverage probabilities, and in Section 5 the methodology is illustrated on a valve seats replacement data set. Some concluding remarks in Section 6 finalize the paper.

2 MODEL FORMULATION AND THE MCF ESTIMATOR

Consider a population of units which are exposed to recurrent events. Despite the occurrence of censoring, an uncensored cumulative history function Yi(t) for the cost of events is associated for each population unit i. Yi(t) denotes the cumulative cost of events on unit i up to age t. The model proposed by Nelson (1995) is a population of such uncensored cumulative functions, which extend in principle to any time of interest and does not depend on the censoring of a sample.

Figure 1 shows such functions as smooth curves for easy viewing, although each Yi(t) is better described by staircase functions due to the nature of data (Nelson, 1995). Since different units undergo different number and cost of events at different ages, there is a population distribution of cumulative cost at a given age t. It is assumed that the distribution of the cumulative cost at any age t has a finite mean C(t), which is called the population mean cumulative function percost per unit (MCF). The MCF is represented in Figure 1 as a dark curve.


3 ESTIMATION

To estimate the population MCF, consider a sample of units which was exposed to recurrent events and their censored histories. Figure 2 shows these censored histories, where each horizontal line represents an unit cost history, each x denotes an occurrence of an event, and each dashed vertical line denotes the censoring age of a sample unit.


Following Nelson (1995), note that, in the representation presented in Figure 2, the units are shown in an ascending order and numbered backward according to their censoring ages. Also, these censoring ages divide the observed age range into N intervals, as well as these intervals are also numbered backward. Denote the total incremental event cost accumulated over all events in interval i on unit n by Yinwhere i, n = 1, 2,..., N . Based on this representation, the estimate for the mean cost cumulative function at a given age t in the interval I is,

Note that the first row in (1) denotes the total incremental cost of N units in the interval N divided by N, the second row denotes the total incremental cost of N – 1 units in the interval N – 1 divided by N – 1 and so on. The sum in the last row is the total incremental cost up to age t of all I units which are exposed in interval I . It implies that each row represents the average incremental cost per unit for each interval from N up to I . Since the MCF estimate depends on the intervals which have considered for the representation in Figure 2, this estimate and also the confidence limits are conditional on the given censoring ages. Also, it is assumed that the set of units considered for the estimate are a random sample from some population, and the event histories for each unit are statistically independent of their censoring ages. To avoid consideration of ties, it is assumed the the sample ages of recurrences are known exactly and are distinct points on a continous time scale. Note that the estimate for the mean number cumulative function is the same except that 1 is used as the cost for each event.

From the representation presented in Figure 2 and the property of the variance of a sum ofrandom variables, Nelson (1995) derived the variance of (1), denoted by V [C*(t)]. Since V [C*(t)] is the variance of a sum, it consists of the variances denoted by V [Yin] of all the total incremental costs in (1) and the covariances V (Yin, Yj,n), which reflects the population autocorrelation between incremental cost in intervals i and j. Then the variance of (1) is,

The first block of terms consists of the individual variances of each of the Yinin (1), the second block consists of the covariances between incremental costs in interval i = N and those of each in subsequent intervals i = N – 1, i = N – 2,..., i = I . The third block of terms consists of the covariances between incremental costs in interval i = N – 1 and those of each in subsequent intervals i = N – 2, i = N – 3,..., i = I and so on until the last block, which consists of the covariances between incremental costs in interval i = I + 1 and those in the interval i = I up to age t.

Since the variances appearing in the first row of the first block in (2) are N independent observations from the same incremental cost distribution for the interval N , we have

V(YNN) = V(YN,N –1) = ... = (YN1) = V(YNn).

Hence, the sum of the first row of the first block in (2) is NV (YNn). By this reasoning, the N – 1 variances in the second row of the first block have a common value V(YN–1,n) and a sum of (N – 1)V(YN–1,n) and so on. Similarly, the covariance terms can be combined, since the covariance terms in a sum in a single row of (2) are all equal. For instance, the first row of the second block has N – 1 covariances with a common value V (YNn, YN–1,n) and a sum of (N – 1)V(YNn, YN –1,n). Hence, the variance of the C*(t) can be simplified as

For the estimation of the terms in (3), consider the i sample incremental costs Yii, Yii–1, ..., Yi1 observed in interval i. These observed costs are a random sample from the incremental cost distribution of the interval i. Thus, their sample variance,

is an unbiased estimate of the population variance V(Yin). Also, the population covariance V(Yin, Yjn) is estimated by the sample covariance,

The inclusion of (4) and (5) into (3) provides an unbiased estimate for the true variance V [C*(t)] (Nelson, 1995).

4 CONFIDENCE INTERVALS FOR MCF

In this Section, the usual procedure to calculate confidence limits for the Mean Cumulative Function are presented, as well as an alternative based on bootstrap techniques.

4.1 Confidence Intervals Based on Asymptotic Theory

Suppose that N cumulative history functions for cost represented in Figure 2 are a simple random sample from a infinite population. At a given time t, the estimator of the Mean Cumulative Function estimator is given by (3). Since this estimator is the sample mean considering censored histories, by the central-limit theorem (Lehmann, 1999), C*(t) has a normal distribution with Mean C(t) (the mean cumulative function at the time t) and variance V*[C*(t)] (Nelson, 1995). Hence, the two sided normal approximate (100 – α)% confidence interval for C(t) is given by,

where V*[C*(t)] is the V[C*(t)] estimator and Kα is the α/2 standard normal percentile.

This procedure are based on a sample of units, in which the asymptotic based confidence intervals presented here can not perform well if the size sample is small. In this context, computer-intensive methods such as bootstrap can be used to construct confidence intervals for the Mean Cumulative Function.

4.2 Confidence Intervals Based on the Bootstrap

The bootstrap is a computer-intensive method which can be used to obtain confidence intervals for quantities of interest (Efron, 1979). According to Moretti & Mendes (2003), this technique is especially useful for dealing with statistical problems involving a small sample size and those involving estimators whose distribution (exact or asymptotic) has not yet been obtained. The basic idea is to consider the observed data as a population, and then samples from this population are obtained based on a sampling scheme with replacement from the original sample. If this procedure is repeated several times, different values of the quantities of interest can be obtained, thus providing an empirical distribution of this quantity. Based on this idea, it is possible to construct the percentile confidence intervals (Efron & Tibshirani, 1993; Davison & Hinkley, 1999; Chernik, 2008; Souza, Souza & Staub, 2009) for the MCF, resampling the original set of units exposed to recurrent events and calculating the mean cumulative function estimate for each sample available.

Along these lines, an algorithm to obtain the 100 (1 – α) % percentile bootstrap confidence intervals for the MCF is given by the following steps:

Step 1: From the original dataset, obtain B resamples of units based on a sampling with replacement scheme;

Step 2: To each of the B resamples, calculate the Mean Cumulative Function estimate;

Step 3: Based on the estimates obtained from the resamples of the original dataset, calculate the and (1 –) percentiles from the empirical distributions for each recurrent time for the unitsfrom the original dataset, provided for the B sets of estimatives calculated from the B resamples.

A program to calculate the bootstrap confidence intervals for the MCF is available from the authors. An implementation of the variance estimate of (3) as well as asymptotic confidence limits are provided by the SAS software.

5 A SIMULATION STUDY

In order to assess the efficiency and have a comparison of the confidence intervals provided by the asymptotic theory and the bootstrap, as well as verifying the sample size influence in these methods, a simulation study was performed to check the coverage probability and the mean range of the confidence intervals developed here.

The study considered the sample sizes of 10, 30 and 100. For each sample size, four scenarios based on the parameter settings for the data generation were considered: the number of events in each sample unit was generated from a Poisson distribution with means 2 and 5, and the recurrence times were generated from an Weibull distribution with scale parameter 1000 and shape parameter equal to 1 and 3, assuming the biggest time generated for each unit as a censored event. We considered then four different scenarios: Scenario 1 (Poisson distribution with mean equals to 2 and Weibull distribution with shape equals 1), Scenario 2 (Poisson distribution with mean equals to 2 and Weibull distribution with shape equals 3), Scenario 1 (Poisson distribution with mean equals to 5 and Weibull distribution with shape equals 1) and Scenario 1 (Poisson distribution with mean equals to 5 and Weibull distribution with shape equals 3). It was not considered the presence of ties. We studied the behavior of the 90% confidence limits for the mean number cumulative function, which is a MCF particular case.

To determine the coverage probability, it was first generated an original data set and their MCF estimate was calculated. Then, 399 samples was generated considering the same specifications that it was used to generate the original data set. Then, the number B of resamples was set at 399. According to Hall (1986) this number of replications is enough to obtain a critical level of 0.05 from the 0.95 percentile of the empirical distribution of the test statistics. To set up the Monte Carlo simulation, this procedure was repeated 399 times. The MCF estimate was calculated for each of the samples. In order to calculate the coverage probability, it was necessary to set percentiles, since the recurrence times from the 399 samples and the original sample were generated from a probability distribution, thus varying from sample to sample. The percentiles 10, 25, 50, 75 and 90. where chosen. With this, for each considered percentile it was verified whether the confidence intervals of the 399 samples covered the MCF estimate obtained in the original sample. If not, it was also verified if the related percentile MCF estimate lied above the upper limit or below the lower limit.

The results for all scenarios considered are presented are presented in Table 1. It contains, for each verified percentile, the time related to the percentile in the original sample, and, for the two methods considered, the coverage probability, the average range of the interval and the standard deviation of the interval range in the 399 samples apart from the original sample. The relative difference of these quantities between the methods is also presented, always considering the quantity provided by the asymptotic method in the denominator.

For sample sizes bigger than 30, in all the scenarios, it was verified that the bootstrap method and the asymptotic method provides similar coverage probabilities. Besides, the results indicate that the coverage probabilities are underestimated in the smallest percentiles and overestimated in the biggest quantiles, and it tends to decrease as the sample size increases as well as the average confidence intervals and its standard deviations do. However, for the sample size 10, the bootstrap method provides smaller confidence intervals average ranges as well as smaller standard deviations of these ranges. It indicates that, since the asymptotic methods requires a sufficiently large sample for developing inferences, the bootstrap method can be used as an alternative approach to provide confidence limits for the mean cumulative function in presence of small samples.

6 THE VALVE SEATS REPLACEMENT DATA

The presented methodology was applied to a real dataset provided by Nelson (1995). The data is the valve seats replacement over the time in 41 engines in a fleet. Is this case, the recurrent event is the valve seats replacements in each of the engines. The interest relies on verifying if the replacement rate increases with engine age (in days).

The confidence limits obtained via asymptotic theory and the bootstrap method are presented in the Figure 3, as well as in Table 1. It is verified that both methods indicate that the replacement rate is constant over the time. Besides, it is also verified that the confidence limits becomes bigger as the age increases, since information about the valve seats replacements decreases over the time. However, the asymptotic procedure leads to negative lower confidence limits, which is impossible from the practical point of view. This problem is overcome by considering our boostrap procedure. Also, for approximately 92% of the recurrence times the bootstrap procedure produces confidence interval ranges approximately 50% smallest than the asymptotic one. Even though these results are not conclusive, they provide an indication of the advantage of the pratical use of the boostrap confidence interval procedure over the asymptotic method.


7 CONCLUDING REMARKS

In this work, we presented the estimate and confidence intervals based in the asymptotic theory proposed by Nelson (1995) for the Mean Cumulative Function, using non parametric methodology for recurrent events data. Also, it was presented an implementation of the bootstrap technique for the construction of confidence limits for the MCF. These two procedures were applied in a real dataset. One of the advantages of the proposed methodology presented here is the possibility for its application in several areas, its easy computational implementation.

Our simulation results suggest that the confidence intervals based on the two procedures are similar to moderate and large sample sizes. However, for small sample sizes, the bootstrap method provides smaller confidence intervals ranges as well as smaller standard deviations of these ranges. Hence, the bootstrap method can be used as an alternative approach to provide confidence intervals for the Mean Cumulative Function, in particular when there are restrictions regarding the availability of information about the event under study.

We only considered the percentile bootstrap method to develop the confidence intervals for the MCF, since such method is the most straightforward one. Also, we keep the non-parametric nature of you the MCF estimator, which is non-parametric. However, other bootstrap schemes, such as the normal, percentile t and the pivotal method (Davison & Hinkley, 1999; Chernik, 2008), can also be considered in the context of obtaining confidence intervals for the MCF.

ACKNOWLEDGMENTS

We would like to thank the referees for their very useful comments which substantially improved the paper. The research was partially supported by the Brazilian Organizations CAPES and CNPQ.

REFERENCES

[1] ANDERSEN PK, BORGAN O, GILL RD & KEIDING. 1993. Statistical Models Based on Counting Processes. New York: Springer.

[2] COX DR & LEWIS PAW. 1966. The Statistical Analysis of Series of Events. London: Methuen.

[3] COX DR & ISHAM V. 1980. Point Processes. London: Chapman and Hall.

[4] CHERNICK MR. 2008. Bootstrap Methods: A Guide For Practitioners and Researches. New York: Wiley.

[5] DAVISON AC & HINKLEY DV. 1997. Bootstrap Methods and Their Application. Cambridge University Press.

[6] EFRON B. 1979. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7: 1–26.

[7] EFRON B & TIBSHIRANI RJ. 1993. An introduction to the bootstrap. New York: Chapman & Hall.

[8] FOLLMANN DA & GOLDBERG MS. 1988. Distinguishing heterogeneity from decreasing hazard rates. Technometrics, 30: 389–396.

[9] HALL P. 1986. On the number of bootstrap simulations required to construct a confidence interval. The Annals of Statistics, 14: 125–129.

[10] HOUGAARD P. 2000. Analysis of Multivariate Survival Data. New York: Springer-Verlag.

[11] LAWLESS JF. 1987. Regression methods for Poisson process model. Journal of the American Statistical Association, 82: 808–815.

[12] LEHMANN EL. 1999. Elements of Large-sample Theory. New York: Springer-Verlag.

[13] MORETTI AR & MENDES BVM. 2003. Sobre a precisão das estimativas de máxima verossimilhança nas distribuições bivariadas de valores extremos. Pesquisa Operacional [online], 23(2): 301–324.

[14] NELSON W. 1988. Graphical Analysis of System Repair Data. Journal of Quality Technology, 20:24–35.

[15] NELSON W. 1995. Confidence Limits for Recurrence Data-Applied to Cost or Number of Product Repair. Technometrics, 37: 147–157.

[16] PRENTICE RL, WILLIAMS BJ & PETERSON AV. 1981. On the regression analysis of multivariate failure time data. Biometrika, 68: 373–379.

[17] SOUZA MO, SOUZA GS & STAUB RB. 2009. Influência de variáveis contextuais em medidas não-paramétricas de eficiência: uma aplicação com métodos de reamostragem. Pesquisa Operacional [online], 29(2): 289–302.

Received May 18, 2009 / Accepted June 17, 2011

  • [1] ANDERSEN PK, BORGAN O, GILL RD & KEIDING. 1993. Statistical Models Based on Counting Processes New York: Springer.
  • [2] COX DR & LEWIS PAW. 1966. The Statistical Analysis of Series of Events London: Methuen.
  • [3] COX DR & ISHAM V. 1980. Point Processes London: Chapman and Hall.
  • [4] CHERNICK MR. 2008. Bootstrap Methods: A Guide For Practitioners and Researches New York: Wiley.
  • [5] DAVISON AC & HINKLEY DV. 1997. Bootstrap Methods and Their Application Cambridge University Press.
  • [7] EFRON B & TIBSHIRANI RJ. 1993. An introduction to the bootstrap New York: Chapman & Hall.
  • [10] HOUGAARD P. 2000. Analysis of Multivariate Survival Data New York: Springer-Verlag.
  • [12] LEHMANN EL. 1999. Elements of Large-sample Theory. New York: Springer-Verlag.
  • *
    Corresponding author
  • Publication Dates

    • Publication in this collection
      03 Apr 2012
    • Date of issue
      Apr 2012

    History

    • Received
      18 May 2009
    • Accepted
      17 June 2011
    Sociedade Brasileira de Pesquisa Operacional Rua Mayrink Veiga, 32 - sala 601 - Centro, 20090-050 Rio de Janeiro RJ - Brasil, Tel.: +55 21 2263-0499, Fax: +55 21 2263-0501 - Rio de Janeiro - RJ - Brazil
    E-mail: sobrapo@sobrapo.org.br