Using multi-parameters distributions to assess the probability of occurrence of extreme rainfall data

Distribuições multi-paramétricas para avaliar a probabilidade de ocorrência de eventos extremos de precipitação pluvial R E S U M O Erosão e saturação hídrica do solo estão frequentemente associadas a eventos extremos de precipitação pluvial, tal como as enchentes. Desta forma, a literatura científica indica a necessidade de se realizar estudos que aprimorem a determinação da probabilidade de ocorrência desses eventos. O objetivo principal do estudo foi comparar o desempenho das distribuições multiparamétricas Wakeby, Kappa e Generalizada dos Valores Extremos na estimativa dos máximos anuais de totais de precipitação (diários, dois e três dias) da localidade Campinas-SP (1890-2012). O objetivo secundário foi avaliar a presença de tendências climáticas e a correlação serial nessas variáveis. A função autocorrelação e o teste de Mann-Kendall não indicaram a presença de significativas tendências climáticas e correlações serias em nenhuma das séries. Com base nos resultados dos testes de aderência concluiu-se que as distribuições Kappa e Generalizada dos Valores Extremos apresentam os melhores desempenhos na descrição probabilística das séries do estudo. Using multi-parameters distributions to assess the probability of occurrence of extreme rainfall data Gabriel C. Blain1 & Monica C. Meschiatti2 1 IIAC/APTA/SAA. Campinas, SP. Fone: (19) 3202-1689. E-mail: gabriel@iac.sp.gov.br (Autor correspondente) 2 IIAC/APTA/SAA. Campinas, SP. Fone: (19) 3202-1689. E-mail: monicameschiatti@hotmail.com


Introduction
Parametric distributions have been widely used to describe the stochastic behavior of a process. By fitting a theoretical distribution to a rainfall series, one is able to estimate the quantile (rainfall amount) associated with a high return period (e.g. 500-years) even though the local rainfall records comprise a shorter period of time.
Currently, the great capacity of ordinary personal computers has allowed the use of complex and flexible parametric distributions to assess the probability of occurrence of several agrometeorological events. The use of the 4-parameter Kappa distribution (KAP) or 5-parameter Wakeby distribution (WAK) is, certainly, an example of this fact. According to Morgan et al. (2011) the KAP and WAK are able to exhibit shapes that other distributions cannot mimic. Because of its flexibility, these two distributions can be used to describe the probabilistic structure of a natural process, such as rainfall, which has several contributing factors (Sen & Niedzielski, 2010;Santos et al., 2011). Accordingly, these multi-parameter distributions have been used in several hydrological and meteorological studies. Sen & Niedzielski (2010) fitted the WAK to river flow data obtained from the Odra River basin in Southern Poland. Su et al. (2009) used the WAK to simulate extreme rainfall data (Preextrem) over the Yangtze River basin. Parida (1999) used the KAP distribution to model the stochastic behavior of the rainfall events observed during the summer season in India. Park et al. (2001) applied the WAK to model summer Pre-extrem data over the Korean Peninsula. This last study was based on time series of maximums of annual daily and 2-day rainfall amounts. For studies carried out in the State of São Paulo, the analysis of annual maximums of 3-day rainfall amounts is also of great R. Bras. Eng. Agríc. Ambiental, v.18, n.3, p. 307-313, 2014. interest (http://www.defesacivil.sc.gov.br/images/stories/PDF/ III%20seminario%20Jica/4_experincias_e_lioes_aprendidas_ com_o_sistema_de_alerta_de_escorregamento_no_estado_ de_so_paulo_modo_de_compatibilidade.pdf).
In spite of all the above-mentioned arguments in favor of the use of the WAK and KAP distributions, the General Extreme Value distribution (GEV) has also been successfully used to assess the probability of occurrence of Pre-extrem values observed in several parts of the globe. After Blain & Moraes (2011) evaluated the fit of several distributions (normal, lognormal, gamma and GEV) to daily Pre-extrem values obtained from distinct locations of the State of São Paulo, they recommended the use of the GEV for the aforementioned purpose.
Blain (2011) and Blain & Camargo (2012) also stated that the GEV function may be used to assess the probability of occurrence of daily Pre-extrem values obtained from the weather stations of Campinas-SP and Ubatuba-SP, respectively. Pujol et al. (2007) used the GEV to describe the probabilistic structure of monthly and annual Pre-extrem series of the French Mediterranean region. Garcia et al. (2007) applied the GEV to evaluate trends in Pre-extrem daily series of the Iberian Peninsula . El Adlouni et al. (2007) and Cannon (2010) applied the GEV function, in its non-stationary form, to model the Pre-extrem data recorded at Randsburg-USA.
The GEV is a generalized function in the sense that it joins the three existing extreme-value distributions (Gumbel, Fréchet and Weibull) into a single model (Queiroz & Chaudhry, 2006). The use of the GEV relies on the Extremal Type Theorem that states that if exist sequences of constants {b and a > 0} such that Pr{[(M-b)/a]≤z} → G(z), where G is a non-degenerate distribution and the maximums (M) of this process are independent and identically distributed (idd), then this function belongs to one of the three above mentioned extreme-value distributions (Coles, 2001;Wilks, 2011).
According to Nadarajah & Choi (2007) the main drawback of studies that use the KAP and WAK is that they are not extreme value distributions. Consequently, there is no theoretical justification to model annual Pre-extrem data by using these distributions (Nadarajah & Choi, 2007). However, according to Su et al. (2009) the use of the KAP and WAK may be seen as one of the recent developments in statistical analysis of Pre-extrem data. Moreover, according to authors such as Wilks (2011), among many others, the appropriateness of the GEV to each location should be evaluated along with other distributions.
In spite of the different methods used by the aforementioned studies, all of them agree on the need to carry out studies that improve the assessment of the probability of occurrence of extreme rainfall values. According to Hay (2007) disruption to critical agricultural activities, soil moisture saturation, runoff and soil erosion, are among the agricultural impacts associated with extreme rainfall events.
Considering this background, the main goal of this study was to compare the performance of the WAK, KAP and GEV in fitting the annual maximums of daily, 2-day and 3-day rainfall amounts obtained from the weather station of Campinas, State of São Paulo, Brazil (1890-2012. As a secondary aim and also considering that the distributions were calculated under the idd assumption of the extremal type theorem, the presence of climate trends and serial correlation in these pre-extrem series was also evaluated.

Material and Methods
Annual maximums of daily, 2-day and 3-day rainfall amounts were used from the weather station of Campinas, State of São Paulo, Brazil, (1890-2012; 22º 54' S; 47º 05' W; 669 m). According to the Köppen system, the climate of Campinas is CWa. The consistency of the series has already been assessed by Blain (2011). All statistical methods were performed at the 0.05 significance level. Descriptive statistics such as the mean, median and standard deviation were also calculated.
According to Maia et al. (2007), fitting a parametric distribution to a given data set is only appropriate for uncorrelated process. Thus, the auto-correlation function (acf) was used to verify if the data sample is free from such a component. The acf coefficients were estimated up to lag 4 (years). Although the number of lags (#Lags) is an arbitrary choice, the upper limit #Lags ≤ N/4 was respected (N is the length of the series). Further information regarding the acf can be found in several studies including Wilks (2011). The Mann-Kendall (MK) test (Kendall & Stuart, 1967) was used to evaluate the presence of climate trends in the Pre-extrem data. According to Önöz & Bayazit (2011) the MK test can be seen as one of the most used non parametric methods for trend detection.

The Wakeby distribution
The WAK is explicitly defined only by its quantile function (Eq. 1) in which ξ, α, β, γ and σ are the parameters of this distribution.
The range of the possible values calculated by Eq. 1 is given by Eq. 2.
The parameters of Eq. 1 can be estimated by means of several methods such as moments (mom), maximum likelihood (mle), probability weighted moments (pwm) and L-moments (lmom). According to Su et al. (2009), the estimates calculated from the pwm are less complicated and sometimes more accurate than those obtained from the mle. According to Park et al. (2001) for (1) (2) Pre extrem ; if 0 and 0 Pre extrem ; if 0 and 0 the WAK, the mle is not easily obtained because its probability distribution, as previously described, is not explicitly defined. Also according to Park et al. (2001) the lmom are frequently more tractable than the mle and it also gives reliable estimations. Finally, according to Chen et al. (2006), the pwm and lmom gives equivalent results. Thus, by following Park et al. (2001), Su et al. (2009), Sen & Niedzielski (2010 and Morgan et al. (2011) the WAK parameters were estimated by using the lmom.

The Kappa distribution
The cumulative distribution function (cdf) of the KAP is described by Eq. 3. methods usually provide equivalent estimations. The parameters of Eq. 6 were estimated by means of the mle, as described in Coles (2001); Pujol et al. (2007); Nadarajah & Choi (2007) ;Furió & Meneu (2010), and by means of the lmom, as described in Queiroz & Chaudhry (2006). This last study provides interesting information regarding the understanding and the use of the lmom in hydrological studies. Hereafter, for the sake of brevity, all estimates will be referred to as parameters.

Goodness-of-fit tests
The chi-square test (χ 2 ) and the Kolmogorov-Smirnov test (KS) are frequently used goodness-of-fit tests (Wilks, 2011). However, the χ 2 test operates more naturally for discrete random variable because to calculate it, the range of the data must be divided into discrete classes. In this view, the KS test compares the empirical and the cdf functions. Consequently, for continuous distributions, the KS test is often more powerful than the χ 2 (Wilks, 2011).
Nevertheless, if (and only if ) the parameters of the parametric distribution have not been estimated from the same batch of data used to evaluate the fit of the parametric distribution, the original algorithm of the KS test is applicable (Wilks, 2011, Steinskog et al., 2007and Vlček & Huth, 2009. Given that the parameters of the WAK, KAP and GEV were fitted using all available data, the KS test had to be modified. This adapted method is called Kolmogorov-Smirnov/Lilliefors test (KS-L). The statistical simulations required for calculating the KS-L test were based on the procedure called "non-uniform random number generation by inversion". By following Wilks (2011) Nb was set to 10000.
According to Wilks (2011), although the KS-L test is able to indicate an inadequate fit, it is not capable of indicating the specific nature of the problem. Moreover, according to Sansigolo (2008), the KS-L is only appropriated for evaluating the central part of the distributions. Since this study deals with extreme rainfall amounts, special attention must be given for the upper tail of the distributions. Thus, the quantile-quantile plots (QQ), as described by Wilks (2011), were also used for comparing the fitted distribution to the observed data. By using the QQ plots one is able to verify how and where the cdf-summary is not adequate (Wilks, 2011). Naturally, the QQ plots were used only for the distributions that, according to the KS-L, can be used to describe the probabilistic structure of the variable under study.
The square root of the mean squared error (RMSE) and the modified index of agreement (dmod; Willmott et al., 1985) were used to evaluate the results depicted in the QQ plots. The RMSE calculation was based on the average squared difference between the Cartesian points of the QQ plots. Since the RMSE was calculated by squaring the difference between the observed and the estimated rainfall value, it is sensitive to larger errors (Wilks, 2011). The RMSE has the same physical dimension of the data under analysis. Accordingly, it can be thought as a magnitude for the estimates errors (Wilks, 2011). The dmod is dimensionless and it is bounded by 0 and 1. A perfect fit would The upper bound of Eq. 3 is given by: The lower bound of Eq. 3 is given by: By following Parida (1999), Morgan et al. (2011) and Santos et al. (2011) the parameters of Eq. 3 (ξ, α, k, and h) were also estimated by using the lmom.

The generalized extreme value distribution
The cdf of the GEV function is described by Eq. 6.
The range of Eq. 6 is given by According to Wilks (2011), the GEV is frequently fitted by using either the mle or the lmom methods. Also according to this author, for moderate and large sample sizes, these two lead to dmod = 1. When applied to a given model, dmod will be lower or, at most, equal to the original index of agreementdmod (Willmott et al., 1985).
The 95% confidence interval (CI) for each RMSE and dmod value was obtained from a bootstrap technique described in several studies including Willmott et al. (1985). The introduction of the CI on the estimates of each RMSE and dmod allowed to verify whether the performances of the distributions are statistically (and not just numerically) different. An example of the use of the bootstrap is given in the next section.

Dealing with uncertainties
The CI for a given quantile, associated with a return period (1/F), can be obtained from several methods. In this study, the CI were estimated by means of a bootstrap resampling technique. Generally speaking, this technique consists of generating a large number (Nb) of bootstrapped samples by drawing at random (and with replacement) from the original Pre-extrem dataset. Each bootstrapped sample has the same length of the original series. The parameters of Eqs. 1, 3 and 6 are then obtained from each of these Nb bootstrapped samples. The new parameters are used to generate Nb quantiles (q-boot) associated with a given return period. The q-boot values are put into ascending order and the 95% CI for each (real) quantile value is obtained from these Nb q-boot distributions. By following Park et al. (2001) Nb was set to 2000. Further information regarding this last procedure can be found in several studies such as Park et al. (2001), Khaliq et al. (2006) and Wilks (2011).

Results and Discussion
The results obtained from the acf function indicated the presence of no serial correlation in the daily, 2-day and 3-day Pre-extrem amounts. As one may note (Table 1) all acf coefficients remained within the white noise limits. Thus, it may be assumed that using the parametric distributions to assess the probability of occurrence of the Pre-extrem data will result in loss of no significant information due to the presence of serial correlation (Maia et al., 2007). This lack of temporal persistence also allowed us to apply the MK test in its original form (Önöz & Bayazit, 2011, among many others).
The MK test also indicated the presence of no significant trend in all series (Table 1). Concerning the assumptions associated with the use of the Extremal Type Theorem, the results obtained from the acf function and MK provided evidences in favor of the hypothesis that the daily, 2-day and 3-day Pre-extrem data are independently and identically distributed. In addition, it may be assumed that the probabilistic structure of the series under evaluation did not significantly change over the years of 1890-2012. This lack of climate trends in annual extreme rainfall data is consistent with the results found by Sansigolo (2008) for the weather station of Piracicaba-SP (1917, by Blain (2011) also for the weather station of Campinas (1890-2010) and by Blain & Moraes (2011) for several locations of the State of São Paulo .

Goodness-of-fit tests and parameter estimates
In spite of its abovementioned flexibility, the WAK cannot be used to assess the probability of occurrence of the 3-day rainfall amounts. As one may easily note (Table 2), the lowest (observed) 3-day value violates the range of possible values in which the WAK can be computed (Eq. 2; 66.91 mm). Thus, no further consideration concerning the use of the WAK to describe the probabilistic structure of the 3-day rainfall amounts could be carried out.
As described in the study of Wilks (2011), among many others, the null hypothesis of the KS-L test state that the Preextrem data were drawn independently from the parametric distribution under analysis. In this view, the results obtained  from the KS-L test (Table 3) indicated that the WAK failed to correctly describe the probabilistic structure of the daily rainfall amounts because the KS-L value is greater than the critical value (the Ho could not be accepted). The Ho of the KS-L was accepted for all other cases (Table 3). Therefore, from statistical inference theory, it may be inferred that the KAP, GEV-mle and GEV-lmom provide a satisfactory assessment of the probability associated with the daily, 2-day and 3-day Pre-extrem data. The WAK distribution was only suitable for assessing the probability of occurrence of the 2-day Pre-extrem data.
After evaluating the results depicted in Figure 1, it is worth emphasizing that the values of the RMSE and dmod agree with Table 3. Kolmogorov-Smirnov/Lilliefors test applied to annual maximums of daily, 2-day and 3-day rainfall amounts obtained from the weather station of Campinas, State of São Paulo, Brazil (1890-2012 the results of the KS-L test in the sense that the Kappa ( Figure  1G, H and I), GEV-mle ( Figure 1A, B and C) and GEV-lmom ( Figure 1D, E and F) can be used to assess the probability of occurrence of the Pre-extrem values. For instance, the highest RMSE value, observed for the daily Pre-extrem series, is 3.50 mm [2.05 : 5.10 ] (Figure 1 A). Although this value is the largest error shown in Figure 1, it is considerable lower than minimum value of this daily series (Table 1). Moreover, all dmod values are larger than 0.90.
The distributions under analysis are (A to F) General Extreme Value distribution (GEV) and (G to I) Kappa. The parameters of the GEV distribution were estimated by using the methods of l-moments (lmom) and maximum likelihood (mle). The parameters of the Kappa distributions were estimated by using the lmom. The modified Willmott's index of agreement (dmod) and the square root of the mean squared error are also shown By considering the 95% CI presented in Figure 1, one may easily note that the performances of the KAP, GEV-mle and the GEV-lmom for describing the daily and 3-day series were similar to each other. In addition, for the 2-day Pre-extrem amounts, all distributions ( Figure 1B, E and H and Figure 2) have shown similar performances.
The distribution under analysis is the Wabeby. The parameters were estimated by using the l-moments method. The modified Willmott's index of agreement (dmod) and the square root of the mean squared error are also shown.
From the results depicted in Figures 1 and 2 Table 3, it may be indicated that the GEV-mle, GEV-lmom and the KAP distributions have presented the best performance in all cases. Finally, it may also be pointed out that the lmom was as good as the mle for providing reliable estimates of the GEV parameters. This last result agrees with Wilks (2011) in the sense that the mle and lmom provide equivalents estimates of the GEV parameters for moderate to large data samples.

and in
The quantiles or rainfall amounts associated with different return periods are shown (Table 4) as an example of a practical use of the results obtained in this study. After Coles & Pericchi (2003) stated that uncertainties in extreme value analysis are substantial, they indicated that it is essential to design with such uncertainties taken into account. The results shown in Table 4 agree with this statement especially for those quantiles associated with high return periods (100 and 500 years).
The results presented in this study agree with previous results found by Blain (2011) in the sense that the GEV-mle can be used to assess the probability of occurrence associated with daily Pre-extrem data obtained from the weather station of Campinas. Nevertheless, this study indicated that the GEVmle can also be used to evaluate the probability of occurrence associated with 2-day and 3-day Pre-extrem data obtained from the same location. In addition, it was also observed that the 4-parameter KAP distribution and the GEV-lmon are as good as the GEV-mle for assessing these aforementioned probabilities.

Conclusions
1. The annual maximums of daily, 2-day and 3-day rainfall amounts obtained from the weather station of Campinas, State of São Paulo, Brazil (1890-2012 have shown the presence of no climate trends and serial correlations. 2. Considering the parametric distributions evaluated in this study (Wakeby, Kappa and Generalized Extreme Value), it was verified that the Kappa distribution, with its parameters estimated through the method of l-moments, and the Generalized Extreme Value, with its parameters estimated through both methods of maximum likelihood and l-moments, have shown the best performance in fitting the abovementioned series.
3. The use of the two abovementioned distributions to assess the probability of occurrence of daily, 2-day and 3-day rainfall amounts in Campinas is recommended.