Simulation of the occurrence of drought events via copulas

This study presents a method based on Archimedean and Gaussian copulas to simulate the occurrence of hydrological droughts. The droughts were characterized by theory of runs for four threshold levels and six univariate probability distributions were evaluated to represent the probabilistic behavior of their severities and durations. The Akaike Information Criterion was used to select the better univariate probabilistic models, while their hypotheses of goodness-of-fit to the historical data were evaluated by Kolmogorov-Smirnov test. Based on the univariate probability distributions of severities and durations, Archimedean and Gaussian copulas were used in the bivariate analysis of the drought events. The proposed method proves to be a useful tool to simulate the occurrence of drought events, preserving the laws of probability of the severities and durations and the dependency between both.


INTRODUCTION
Droughts are complex phenomenon and their impacts vary from region to region and generally are associated with periods whose precipitation levels are insufficient. The droughts are perceived in different ways in various parts of the world. According to Hudson & Hazen (1964), this diversity of perceptions can be observed in the periods used to characterize the droughts, e.g.: (a) any period of 6 days or more without precipitation in Bali, Indonesia; (b) 2 years or more without precipitation in Egypt; (c) any year on the Nile River without flooding, regardless of any precipitation level.
The deficiency of precipitation is considered as the primary cause of the droughts while the effects on water resources system of a region are considered as secondary causes or drought impacts, e.g., low river streamflow and low soil moisture. Thus, if the interest is to study the causes of the droughts, attention should be focused on precipitation, but if the interest is to study the impacts of the droughts, attention should be focused, e.g., on the streamflow or soil moisture data (Dracup et al., 1980a). In this context, the droughts can be classified according to the nature of the deficits (Dracup et al., 1980a;Beran & Rodier, 1985;Wilhite & Glantz, 1985;Mishra & Singh, 2010): (a) meteorological droughts, when a lack of precipitation occurs during a period over a region; (b) agriculture droughts, when the soil moisture is insufficient to support crops; (c) hydrological droughts, when the surface and underground water resources are inadequate to support their respective uses and; (d) socio-economical droughts, when the economic activities and people's lives are impacted by the lack of water supply .
The impact of the droughts depends on the several factors, e.g., social vulnerability of the region, the weather, the multiple water uses across the river basin or the region, etc. Generally, a drought is derived from low-impact cumulative events over time, as opposed to floods, earthquakes and cyclones, which are high-impact extreme events (Cunha et al., 1983).
The droughts can be treated from two viewpoints: conceptual and operational. From a conceptual viewpoint the definition of the drought is in relative terms, e.g., "a dry period", "a severe drought", providing little guidance. While, from the operational viewpoint, the definitions use the frequency analysis of the severities or durations for a given historical period (Mishra & Singh, 2010). Thus, to study the droughts' impacts, a clear definition is needed, e.g., which indicates when a drought event starts and ends, and what its impact.
In 1967, Yevjevich published the paper entitled "An objective approach to definitions and investigations of continental hydrologic droughts", where he introduced a threshold procedure denominated of "theory of runs" to identify and characterize hydrological droughts. Since then, the threshold procedure has been used in drought studies, regardless of its nature (meteorological, hydrological and agricultural).
The theory of runs allows defining the main characteristics of a drought event. According to Yevjevich (1967), from a threshold level it is possible to define the characteristics of a drought event, e.g., its duration and its severity. The duration corresponds to the period in which the interest variable (e.g., streamflow or precipitation) is below the threshold level, while the severity is the cumulative volume of the variable associated with the duration.
The main variables used to describe the drought events, severities and durations, are strongly correlated and this characteristic is not considered in univariate analyze. In this context, the multivariate analyze allows us to consider the correlation between these variables.
Thus, the multivariate analysis of droughts has been the focus of attention by many researchers. Shiau & Shen (2001) and Salas et al. (2005) obtained joint probability distribution of severities and durations based on the product of the severities' conditional probability distribution (for a given duration) by the durations' probability distribution. In this case, to obtain the conditional probability distributions, a larger data samples are required and it is not always available in the historical data. Thus, the researchers used techniques such as generation of time series to increase sample size.
The copulas, introduced by Sklar in 1959, have been an alternative used by several researchers for the multivariate analysis of droughts. Copulas are functions that link univariate probability distributions of random variables preserving the dependency between them (Sklar, 1959;Nelsen, 2006;Joe, 1997).
The concept of copula has started to be used in hydrology from 2003 with the first application in rainfall modeling (Michele & Salvadori, 2003). Important contributions on the subject also can be found in ,  and Genest & Favre (2007). In the study of the meteorological droughts can be mentioned, among others, the works of Michele & Salvadori (2003), Shiau (2006) She & Xia (2018) and Azam et al. (2018).
The objective of this study is to present a method to simulate the occurrence of drought events using a bivariate analysis of the severities and durations based on Archimedean and Gaussian copulas.

Study and area data
In this study, the characteristics' droughts of the Camanducaia River, located in the southeast region in Brazil, were evaluated. The Camanducaia River basin is located in the Unit of Management of Water Resources 5 (UGRHI 5) composed of the river basins of the Piracicaba, Capivari and Jundiai Rivers. The UGRHI 5 is part of the Tietê River basin located in the State of São Paulo (Figure 1).
The Camanducia River basin has a drainage area of 1,032 km 2 , where 860 km 2 are located in the State of São Paulo, and the other part located in the State of Minas Gerais. The Camanducaia River has no storage reservoir and is responsible for the total or partial water supply of 11 cities in the region (Figure 2).
The Camanducaia River is also the main stream affluent of the Jaguari River, which is an important component of water supply in the region and is part of the Cantareira System that supplies water to the Metropolitan Region of São Paulo with more than 10 million people.  The monthly streamflow data of the streamflow gauging station 3D-002 located in Monte Alegre do Sul city, Latitude 22º 41' 44" and Longitude 46° 40' 25" were used. The streamflow data has 876 monthly records from January 1944 to December 2016 (73 years). Only eight monthly streamflow records required estimations. Figure 3 shows historical data of average monthly streamflow from the streamflow gauging station 3D-002 located in Monte Alegre do Sul city. The main statistics of the streamflow data for the streamflow gauging station 3D-002 are shown in Table 1

Characterization of drought events
Let 0 Q the threshold level or truncation level, a drought event starts when the streamflow ( ) 0 When the threshold level is applied in the streamflow time series, a sample containing multiple drought events is generated, where each event is characterized by its severity ( k S ) and duration ( k D ). Figure 4 shows the application of the theory of runs for the definition and characterization of the drought events considering a constant threshold level equal to 0 Q .
where k D is the duration for the k -drought event, , initial k t and , end k t are the start and end time of the k-drought event, and t D is the time step of t (daily, weekly, monthly or yearly). For the severity of a drought event is calculated by: where k S is the severity of the k-drought event, , initial k t and , end k t are the start and end time of the k-drought event, ( ) Q t is the streamflow at time t, and 0 Q is the threshold level. The main variables (severity, duration, number of events) and, mainly, the probability distributions of the severities and durations depend on the threshold level. The threshold level may also influence the dependence structure of the drought events.
In this study, the drought events were characterized for four constant threshold levels corresponding to 98th, 95th, 92th and 90th percentiles of the flow-duration curve based on the historical data of average monthly streamflow, also called P98, P95, P92 and P90, respectively. The threshold levels are represented by low values of streamflow with high probability of being exceeded. Figure 5 shows the scatter plot of the severities and durations of the drought events for the Camanducaia River considering the threshold levels P98, P95, P92 and P90 for the monthly streamflow data from 1944 to 2016. Table 2 shows the main characteristics of the drought events of the Camanducaia River obtained by the theory of runs, as the number of events, the average (Ave), maximum (Max) and minimum (Min) values of the severities and durations and the concordance measures between both, represented by Kendall's t and Spearman's r.
The use of the theory of runs to obtain drought events can generate events that are very close, and therefore, dependent on each other. According to Zelenhasic & Salvai (1987), the dependence between these drought events can be deal by arbitrary simplifications, such as not considering events with insignificant severity and/or considering subsequent events as a single event, provided there is a relatively short time interval between them where ( ) 0 Q t Q > . The Box-Pierce and Ljung tests were used to evaluate the hypothesis of independence of the observations in series for the severities and duration of the drought events from the historical data considering a 95% significance level. According to the p-values of the tests, all greater than 0.05, the hypothesis of independence   for the observations in series cannot be rejected (Table 3). Thus, the use of the simplifications proposed by Zelenhasic & Salvai (1987) was not necessary.

MARGINAL DISTRIBUTION OF PROBABILITY OF THE SEVERITIES AND DURATIONS
In this study, six-univariate probability distributions for severities and durations of the drought events of the Camanducaia River were tested: 2-parameters Gamma (GAM2), Exponential (EXP), Weibull (WBL), Log-Normal (LOGN), Generalized Pareto (GP) and Generalized Extreme Value (GEV). Table 4 shows the univariate probability distributions, their density functions with their parameters.
The parameters of the univariate probability distributions were estimated by Maximum Likelihood Estimation (MLE) method. The selection of the better univariate probabilistic model and the verify of its hypothesis of goodness-of-fit to the historical data were made by AIC and KS test, respectively. The KS test's p-value was calculates by Monte Carlo simulation with a significance level of 95%. Thus, for p-values greater than 0.05 the hypothesis of goodness-of-fit cannot be rejected.

COPULAS
Copulas are functions that link univariate probability distributions to form a multivariate probability distribution and it were first suggested by Sklar in 1959.   Based on the Corollary 1, the copula is a multivariate probability distribution, with univariate probability distributions, which contains the information on how the variables depend on each other.
The copulas are divided into families, which try to model the most varied types of dependency between variables, such as: Elliptical (Normal, or Gaussian, and t); (b) Archimedean (Clayton, Gumbel, Frank, Ali-Mikhail-Haq and Joe); (c) Extreme Value (Gumbel, Galambos, Husler-Reiss, Twan and t-EV); and (d) and others (Plackett and Farlie-Gumbel-Morgenstern). These copulas can be found in R package "copulas". Others packages provide many more alternatives.
A common fact in the drought studies is the reduced number of events observed from historical data. The reduced number events can indicate a false perception of water safety, since the observed events from historical data may not be critical in terms of durations, severities or both. Thus, as the severities and durations of the droughts are highly correlated, the copulas are presented as alternatives to simulate the occurrence of events, preserving the correlations and the probability laws these variables.
In this study, six copulas were tested for the bivariate analysis of the severities and durations of the drought events for the Camanducaia River, being five Archimedean copulas, Clayton (CLA), Frank (FRA), Gumbel (GUM), Joe (JOE) and Ali-Mikhail-Haq (AMH), and one Elliptical copula of the Gaussian type (GAU). Table 5 shows the copulas used in this study, their functions, and their parameters.
The copulas parameters were estimated by the Maximum Pseudo-Likelihood (MPL) method. The selection of the better bivariate model and the verify of its hypothesis of goodness-of-fit to the historical data were made by AIC and Goodness-of-Fit test (GOF), respectively. The GOF test's p-value was calculate by a parametric bootstrap according to Genest & Rémillard (2008) with a significance level of 95%. Thus, for p-values greater than 0.05 the hypothesis goodness-of-fit cannot be rejected.
Note: a, b, l, m, s, and x parameters. Table 5. Copulas tested to model the dependence between severities and durations of the drought events for the Camanducaia River.  x 2 xy y u u 2 1 1 2 2

Copula Copula's function and parameter ranger
According to Sklar (1959), if H is a joint probability distribution with margins 1 F and 2 F , then there exists a twodimensional copula C such that for all ( )

Univariate analysis of the severities and durations
Figures 6 and 7 compare the six theoretical cumulative distribution functions (CDF) with the empirical CDF calculated from the observed data of severities and durations of the drought events for the Camanducaia River from 1944 to 2016. Table 4 and tested to the severities and durations of the drought events. The p-values shown in Table 6 indicate that the hypothesis of goodness-of-fit of all probability distributions cannot be rejected for the severities considering all threshold levels under study. Figure 6 corroborates with the p-values of Table 6, showing that the theoretical CDFs are very close to the empirical CDFs. Thus,   Table 6, the probability distributions with better fit to the severities of the drought events for the Camanducaia River are LOGN for the P98, P95 and P92 threshold levels and GP for the P90 threshold level.

Tables 6 and 7 show the AIC values and the KS test's p-values for each probability distribution presented in
Due to the discrete nature of the durations, not all probability distributions represent well its probabilistic behavior (Figure 7). For the durations, it was not possible to obtain a suitable fit for the GEV probability distribution for all threshold levels under study. Excluding GEV, Table 7 shows that the hypothesis of goodness-of-fit of all the probability distributions cannot be rejected for the P95 threshold level. For the P98, P92 and P90 threshold levels, the hypothesis of goodness-of-fit cannot be rejected only for the EXP and GP probability distributions. According to the AIC values, the probability distributions with a better fit to the durations of the drought events for the Camanducaia River are EXP for the P98, P92 and P90 threshold levels and LOGN for the P95 threshold level. Table 8 shows, for each threshold level, the probability distributions selected and their estimated parameters to represent the probabilistic behavior of the severities and durations of the drought events for the Camanducaia River.
Thus, for the threshold levels under study (P98, P95, P92 and P90), the LOGN and EXP probability distributions are the most adequate to represent the probabilistic behavior of the severities and durations of the drought events for the Camanducaia River, with the exception of the P90 threshold level where the GP probability distribution was more adequate to the severities.    Table 2 shows the main characteristics of the severities and durations of the drought events for the Camanducaia River. Kendall's t and Spearman's r indicate a significant dependence between the severities and durations, suggesting the joint modeling of these variables. Based on the univariate probability distributions in Table 8, the copulas presented in Table 5 were tested to represent jointly the severities and durations of the drought events for the Camanducaia River. Table 9 shows the AIC values and the GOF test's p-values for each copula tested. According to the AIC, the copulas with better fit are JOE for the P98, P92 and P90 threshold levels and GAU for the P95 threshold level.

Bivariate analysis of the severities and durations
According to the GOF test's p-values (Table 9), the hypothesis of the severities and durations being represented jointly by FRA, GUM, JOE and GAU copulas cannot be rejected for all threshold levels under study. The hypothesis of severities and durations being represented jointly by a CLA copula cannot be rejected for the P90 threshold level and by an AMH copula for the P95 threshold level. The results show that asymmetric copulas with upper tail dependency and symmetric copulas without tail dependency can represent suitably the joint behavior of the severities and durations of the drought events for the Camanducaia River.
Table 10 shows, for each threshold level, the estimated copulas' parameters selected (with better fit) to represent jointly the severities and durations of the drought events for the Camanducaia River. Figure 8 shows the severities and durations observed during the drought events for the Camanducaia River, their joint probabilities (isolines) and the 2000 events simulated (pseudo-observations) based on the copulas in Table 10. Table 11 shows the 95% confidence intervals for the severities and average durations, and for the concordance measures between the severities and durations (Kendall's τ and Spearman's r). The confidence intervals were calculated from Monte Carlo simulations, where 100 samples, each with 2000 drought events, were generated from the Table 10 copulas. For the confidence intervals, the Shapiro-Wilk test was used to verify the normality hypothesis of the data of severity and average duration, Kendall's τ and Spearman's r, obtained from Monte Carlo simulations. The normality hypothesis cannot be rejected, because all p-values were greater than 0.05. Table 12 shows the average (Ave), maximum (Max) and minimum (Min) values of the severities and durations for the 2000 simulated drought events for the Camanducaia River (pseudo-observations) and their respective Kendall's τ and Spearman's r.    Box-Pierce and Ljung tests were used verified the hypothesis of independence of the data in series for the simulated severities and durations. According to the p-values of the tests, all greater than 0.05, the hypothesis of independence of the simulated data in the series cannot be rejected.
Comparing Tables 2, 11 and 12, in some cases, the expected values of severities and durations observed are not contained in confidence intervals calculated from the Table 10 copulas. It is occurs because the MLE used in this study to estimate of the probability distributions parameters, consists in maximizing the log-likelihood, unlike the Moment Matching Estimation (MME) method that consists in equalizing theoretical and empirical moments.
Thus, based on the values of the Tables 2, 11 and 12 the method presented in this study is able of simulating the occurrence of drought events for the Camanducaia River located in the southeast region of Brazil, preserving the laws of probability of the severities and durations and the dependency between both.

DISCUSSION
Considering the threshold levels under study, the results presented in the study show that the hypothesis of goodness-of-fit of the six probability distributions (GAM2, EXP, WLB, LOGN, GP and GEV), cannot be rejected to represent the behavior probabilistic of the severities for the Camanducaia River. For the durations, this fact occurs only for the EXP and GP.
With respect to the probability distributions with better fit to the historical data, we have for the severities the LOGN for P98, P95 and P92 threshold levels and GP for the P90 threshold level. For the durations, we have EXP for P98, P92 and P90 threshold levels and LOGN for the P95 threshold level.
It is important to note that due to the discrete nature of durations not all probability distributions tested in this study to represents suitably the probabilistic behavior of the variable. For the GEV probability distribution, it was not possible to obtain a suitable fit for all threshold levels under study. Researchers such as Michele et al. (2013) proposed a randomization technique to circumvent this problem, generating continuous samples of durations.
With regard to the joint representation of the probabilistic behavior of severities and durations of the drought events for the Camanducaia River, the use of the copulas JOE and GUM, asymmetric copulas with upper tail dependency, FRA and GAU copulas, symmetrical copulas without tail dependency, cannot be rejected for the all threshold levels under study. The CLA and AMH copulas can represent jointly the severities and durations only for the P95 and P98 threshold levels, respectively.
The results of the bivariate analysis show that, for the P98, P92 and P90 threshold levels, the joint representation hypothesis of the severities and durations cannot be rejected for JOE copula. For these same threshold levels, the second copula with the better fit is a GUM, followed by GAU and FRA copulas, respectively. For P95 threshold level, the GAU copula is the one with better fit, followed by GUM, FRA and JOE copulas, respectively.
Thus, it is important to emphasize that if the interest is in the extreme events, the asymmetric copulas with upper tail dependency are more suitable. If the extreme events are not relevant, the symmetric copulas without tail dependency can be used.
The limited number of historical observations of drought events can indicate a false perception of water safety, not indicating the possible risks of supply failure for human or other uses. The method presented in this study showed a strong potential to subsidize the risk analysis and, consequently, to assist in the planning and management of droughts in a river basin located in the southeast region of Brazil.
The results this study show that the coupling of the univariate probability distributions allowed the generation of a large number of drought events, with several combinations of severities and durations in comparison to the events observed from the historical data.
For the threshold levels under study, most observed drought events have durations of 1 to 2 months (from 70% to 86%). This fact may lead decision-makers to a false perception of water safety. For example, for the P98 threshold level, considering the monthly streamflow data from 1944 to 2016, 80% of the drought events have a duration of 1 and 2 months, with only two events with durations of 3 and 7 months and severities of 0.89 and 4.94 m 3 /s x month, respectively. However, when considering the sample of events generated by copula (in this case a Joe copula), a number of 348 events can occur to this durations with severities ranging from 0.004 to 14.49 m 3 /s x month. For the P95 threshold level, the observed data present only 3 drought events with durations of 3 to 9 months and severity ranging from 0.74 to 7.97 m 3 /s x month. Considering the events generated, 318 events may occur with durations of 3 to 9 months and with severity ranging from 0.09 to 32.99 m 3 /s x month. For the threshold level P95, the copula with better fit was to Gaussian.
In summary, the method presented in this study successfully reaches its goal, presenting itself as a useful tool for simulated the occurrence of drought events for the Camanducaia River located in the southeast region of Brazil.