INTRODUCING A CAUSAL PAR(p) MODEL TO EVALUATE THE INFLUENCE OF CLIMATE VARIABLES IN RESERVOIR INFLOWS: A BRAZILIAN CASE

The Brazilian electricity energy matrix is essentially formed by hydraulic sources which currently account for 70% of the installed capacity. One of the most important characteristics of a generation system with hydro predominance is the strong dependence on the inflow regimes. Nowadays, the Brazilian power sector uses the PAR(p) model to generate scenarios for hydrological inflows. This approach does not consider any exogenous information that may affect hydrological regimes. The main objective of this paper is to infer on the influence of climatic events in water inflows as a way to improve the model’s performance. The proposed model is called “causal PAR(p)” and considers exogenous variables, such as El Niño and Sunspots, to generate scenarios for some Brazilian reservoirs. The result shows that the error measures decrease approximately 3%. This improvement indicates that the inclusion of climate variables to model and simulate the inflows time series is a valid exercise and should be taken into consideration.


INTRODUCTION
The Brazilian electricity generation is mainly composed by hydroelectric plants owned by multiple players; the Brazilian National Interconnected System (NIS) integrate power and trasmission lines from the South, Southeast, Midwest, Northeast and part of the North region. Only 1.7% of the country's electricity production capacity relies outside the NIS, in small isolated systems located mainly in the Amazon region [21].
Planning the Brazilian energy sector means, basically, making decisions about the dispatch of hydroelectric and thermoelectric plants, with the risk of financial losses or energy rationing, as happened such strong in 2001 [25], affecting almost all Brazilian regions. There are, basically, two approaches to predict the natural inflow: physical and statistical models, where the first one includes the rainfall-runoff hydrological model and the second covers data-driven methods such as time series. To perform monthly forecasts and simulation the classical Periodic Autoregressive (p) model [29], has been widely used. This type of model adjusts the series using the estimated parameters of the historical data [13], and does not consider any exogenous information that could affect the hydrological regimes and, consequently, the electricity generation. Several examples of the application of PAR( p) can be found in the literature, see for example [16] who generate and forecast monthly inflows of the Ganges River with the PAR model. A quick literature search also returns an extensive set of univariate models applied to reservoir inflows (e.g. [27]).
However, models that incorporate explanatory variables, specifically climate variables, have been only recently developed. Some examples of these works in chronological order, are: [32] studied the relation between the Sea Surface Temperature (SST) pattern over the Atlantic and the Pacific Oceans and the variability of water availability in the Amazon Basin; [8] introduced a procedure for further conditioning the inflow probability distributions by considering the recent measurements of climatic variables, [28] developed a semiparametric approach for forecasting inflow at multiple gaging locations on climate precursors; [14] applied Artificial Neural Network to model the complex relationship between inflow and climatic phenomenon; [10] investigated the potential of the Bayesian dynamic modelling approach through an application to forecast a hydrologic time series using relevant climate index information; [11] included climate information in a periodic auto-regressive model in order to provide monthly inflow forecasts for 54 hydropower sites in Brazil; and [12] applied Bayesian Dynamic Models to model and forecast the water inflow for Brazilian hydropower reservoirs, and concluded that the incorporation of climate variables such as rainfall precipitation and El Niño variables, increased the accuracy of both modelling and prediction.
Considering the context and the relevance of the subject, this paper aims to investigate and propose methodological advances in time series modelling and stochastic simulation to generate synthetical hydrological scenarios to model the Brazilian hydrothermal dispatch.

109
The proposed method, called causal PAR(p), intends to include, exogenously, the meteorological phenomena influence by adjusting a Dynamic Regression (Autoregressive Distributed Lags Model) to the traditional PAR(p) residuals and climate series, incorporating the regression coefficient in the traditional modelling. Several preliminary analyses will be concieved in order to obtain a greater understanding of the involved series and its applicability into the proposed method.
Besides this main goal, which is to present a new approach to model the inflow series, this work also intends to generate synthetical scenarios that better represent the original historical series as well as confidence intervals for out-of-sample forecasts.
The paper is organized as follows: section 2 presents the theoretical background, with a brief description of the Periodic Autoregressive model, the mathematical details of the proposed approach (causal PAR(p)), and a short description of the Bootstrap technique, used to generate synthetical scenarios. Section 3 describes the input variables, their connection with reservoir inflows the underlying system and the Brazilian case study; section 4 presents a exploratory analysis of the available variables. The results from the traditional model and the proposed approach are shown on section 5, and section 6 sums up the work and summarizes its conclusion and final remarks.

THEORETICAL BACKGROUND
The proposed framework to include the climate series behaviour in the PAR(p) modelling, named as causal PAR(p), is composed using two main techniques: PAR(p) and Dynamic Regression Model. These techniques are briefly presented in what follows.

Traditional PAR(p)
Periodic Autoregressive models can also be referred to PAR(p), where p corresponds to the order of the model, in other words, the number of autoregressive terms identified in the model. The PAR(p) model fits to each series period an AR(p) model. Generally, p is a vector, p = [ p 1 , p 2 , . . . , p 12 ], where each element provides the order for each period (month, in case of monthly series). For more details about the PAR(p) model see [7]. PAR(p) model is mathematically described as follows: where, Z t is the seasonal series of period S.
S is the number of periods (S = 12 to monthly series).

Dynamic Regression Model
A Dynamic Regression model can be described by the following general equation: Where a t is the dependent (or output) variable; X i,t are the explanatory (or inputs) variables; are finite order lag polynomials of degrees r, s 1 , . . . , s k , respectively, and ε t is assumed to be white noise. Such a formulation can be seen in [23] and for more mathematical details see [5].

Causal PAR(p)
The step by step sequence to perform the causal PAR(p) follows the three steps described below: 1. Estimate the traditional PAR(p) model; 2. Find the significant explanatory variables by applying the Dynamic Regression model; and 3. Estimate the causal PAR(p) model.
In the first step the traditional PAR(p) is estimated and the residuals series are extracted to be used in the second step, that fits the Dynamic Regression model. In this step, the exogenous variables are one of the inputs and the residuals generated by the traditional PAR( p) are the outputs. Then, the coefficients obtained in step two and the exogenous variables are inserted in the mathematical formulation of the traditional PAR(p), generating the causal PAR(p).

111
Where, X i,t are the exogenous variables, β i the coefficient and e t is the white noise.
Therefore the innovation at the proposed approach is to introduce exogenous variables in the PAR(p) modelling. Further studies would consider the development of some kind of periodic transfer function.
Afterwards, the synthetical scenarios generation are carried out applying the Booststrap technique to the residual series as well as in obtaining confidence interval. The methodology is detailed in the next section.

Synthetical scenarios' generation and confidence interval
In order to simulate synthetical scenarios the Bootstrap technique is used. This technique, first developed by [6], is a method of sampling with replacement the observations of a random sample that allows the assessment of the variability of an estimator. Such technique generates as many new samples as one wishes, called "Bootstrap sample", usually with the same size of the original sample. In the context of time series, there are basically two ways to apply this technique: Bootstrap in the residuals and the method called Moving Blocks [9].
In this paper Bootstrap is used in the residuals, due to the fact that for all the studied series it is possible to extract residuals, thus ensuring the hypotheses of a random sample (i.e., independent and identically distributed observations) a required condition to apply Bootstrap.
A formal description of the method is: consider R 1 , ..., R N the random sample and B the number of residuals series to be generated. B residual series are drawn with replacement from the original sample, generating B Bootstrap residuals series of size N each: Anderson et al. [2] computed Gaussian prediction intervals for the estimated parameters of a periodic autoregressive moving average (PARMA) models. In the present study the 5% confidence interval is developed from the simulated series by calculating the quantiles 2.5% and 97.5%.

DATA ANALYSIS
In Brazil, the fifteen major river basins have an installed capacity of approximately 92 Giga Watts [GW]. The Parana river basin has the highest hydroelectric potential (around 54 GW). In these major rivers there are around 190 hydroelectric power plants currently in operation [22], and these plants operate in a cascade scheme. As an illustration Figure 1 displays the cascade scheme for Paranaíba and Grande basins with 19 reservoirs, represented by triangles, and 15 hydroelectric with no reservoir (circles).
This way decisions taken at the upstream reservoirs will impact the inflow of the downstream reservoirs. The historical data available is the natural inflow 4 for each reservoir, on a monthly The climate variables were selected trough a literature search [17]. However, the authors intend to broaden this search for other variables to be considered in the model building. The selected variables are basically related to El Niño and the Sunspots numbers; the variables representing El Niño/La Niña phenomenon are: Southern Oscillation Index (SOI), Equatorial SOI, Niño variations and ONI.

INTRODUCING A CAUSAL PAR(p) MODEL TO EVALUATE THE INFLUENCE OF CLIMATE VARIABLES
The Southern Oscillation Index (SOI) is calculated based on the difference between the atmospheric pressure at sea level in the regions of Tahiti (in the Western Pacific) and Darwin (Australia, Western Pacific). The Equatorial SOI measures the average difference of atmospheric pressure at sea level between two regions centered on the equator: Indonesia and East Pacific. The sea surface temperature anomaly is a proxy for El Niño and La Niña. Thus, this index is used to classify and quantify such phenomena in four Niño regions: Niño 1+2, Niño 3, Niño 4 and Niño 3.4, defined as follows by NOAA in 2014. Through the location of the Niño regions it is possible to conclude that regions Niño 1+2 and Niño 3 better identify temperature anomalies for the Eastern Pacific Ocean sea surface and region Niño 4 for the Western Pacific. The Niño 3.4 region is centralized in the Pacific, which allows a better understanding of anomalies across it. Therefore, currently the Niño 3.4 region is the official measure used to represent SST (Sea Surface Temperature). However, depending on the study, other regions may be a better alternative.
The threshold for the normal state of this index is between −0.5 • C and +0.5 • C. The criteria commonly used to define an El Niño phenomenon consists of five consecutive averages of SST anomalies above +0.5 • C. Similarly, for La Niña, this criterion remains, but now the SST anomaly should be below −0.5 • C. The time series for all regions are provided by NOAA, on a weekly and monthly basis. For the weekly series, the data starts in 1990 and ends at the current week and the monthly data starts at 1982 and ends up at the current month.
The Oceanic Niño Index (ONI) measures the average sea surface temperature anomalies for the region Niño 3.4, removing the existing warming trend on it. The ONI uses multiple periods based on thirty years to perform the calculation for five successive years. The base periods uses a fifteen year interval, for the lower and upper bound, for example, for 1950 and 1955 the base period considered starts in 1936 and ends in 1965. The El Niño and La Niña are indicated in the same manner as the SST index, the time series is monthly and is provided by NOAA.
Sunspots comprehends solar surface regions of high magnetic field, which have considerably lower temperature than its surroundings and thus appears as a dark area. The magnetic flux amount on the sun surface varies over eleven year periods, known as sunspot and solar cycles. During this cycle there is a minimum and a maximum magnetic flux, which is not only difficult to identify the sunspots and but also they appear almost all the time. The cycle reachs its maximum aproximately every eleven years, therefore the observed cycle duration corresponds to eleven years.
The daily and monthly number of sunsposts calculation is accomplished with the Relative Index American number of sunspots. This index indicates the solar phenomenon occurrence taking into account their relationship with the Earth, including geomagnetic variations and ionosphere effects. The Solar Division from American Association of Variable Star Observers coordinates the data collection program and the analysis of this phenomenon. Thus, the National Geophysical Data Center (NGDC), provides the historical data from the number of sunspots per month since 1749 and forecasts have been produced until December 2019.

EXPLORATORY ANALYSIS
In order to provide a better understanding of the series used in this research, this section presents an exploratory analysis that includes: evolution of the series through time using line's graphic; variables' probability distribution using the histogram; descriptive statistics and correlation between the reservoir series and the climate variables.
Also, to reduce the problem dimension (currently with 192 reservoir inflow series), were selected the eight basins that present the biggest correlation with each one of the climate variables. As an example, for the SOI Standard variable the reservoir Curuá-una was the one with the highest correlation among all 192 possibles. Table 1 presents the selected hydroelectric power plants and the Pearson's correlation 5 with the climate variables. Note that the biggest values were found between the reservoirs Quebra Queixo and Monjolinho with variable Niño 1+2 (0.369 and 0.353, respectively). This does not mean that in the climate variables selecting process this particular variable necessarily will be picket up in the model formulation. This is due to the fact that the variable selection is carried out via a backward process that selects only the significant variables.
The descriptive statistics for each one of the selected reservoirs is shown in Table 2. See that Balbina contains the higher values for mean, median, standard deviation and quartiles. On the other hand Lajes is the one with the lowest values.
Observing Figure 2, which shows the reservoirs time series plot, it is possible to identify, a strong periodicity in all inflow series.
In the first and third columns of Figure 3 the histogram of each reservoir series is presented. It seems quite clear that there is a strong asymmetry indicating that the data might follow a Weibull distribution, which is the distribution that usually models natural events. The second and fourth columns shows the Autocorrelation Functions (ACF) and from them it is possible to confirm the presence of periodicity by observing significantly picks each six months.
Moving now to the climate variables, in Table 3    One important point to check is the presence of significant correlations between the various El Niño series. Table 4 shows the Pearson correlations 6 , and as expected, there is a negative correlation between the Equatorial SOI and the SOI index with the others, as for those, El Niño is represented by negative values, while for other variables, this same phenomenon is represented by positive values. Also note that ONI and Niño 3.4 show the highest correlation, a consistent result with their concepts, since both measure the sea surface temperature anomaly in the same region. Figure 4 shows each climate variable evolution through time, highlighting that for the variables representing El Niño events, the extreme values coincide with stronger El Niño or La Niña occurrences. It is also observed the strong 11 year cycle presented in the time series of Sunspots.  The last analysis presented is a map that contains the localization of the selected reservoirs and the climate variables measured area, that is where the El Niño/La Niña proxy is measured. From   In Figure 6 the SOI regions are represented, i.e.: the measurement areas for the SOI Standard and the Equatorial SOI index, together with the reservoir localization.

RESULTS
In order to evaluate the proposed methodology, the causal PAR(p) approach was applied to model the eight reservoirs inflow series previously presented. The set of possible input variables are the variables representing El Niño/La Niña and Sunspots phenomenon. Since the proposed approach has been developed to improve the current model it will be shown comparisons between the traditional PAR(p) and the causal. The last result obtained is the generation of scenarios with both approaches and a comparison with real values.
In order to carry out an out-of-sample analysis, the last year of the available data (2014) was omitted from the initial analysis. So, the modelling set is the data ranging from January 1982 up to December 2013, resulting in 384 observations or 32 years. It is worth mentioning that all the results were generated using the R software [26] and the packages pear [15], TSA [4], tseries [30] and dynlm [33].
To estimate the models performance, three metrics were considered: the Mean Absolute Scaled Error (MASE), given by the Mean Absolute Percentage Error (MAPE), given by and the Root Mean Squared Error (RMSE), given by where Y t is the observed value, F t is the fitted value and n is the sample size.

Modelling reservoirs inflow series
As detailed before, the first step of the model is to fit the traditional PAR(p) to each of the reservoir inflow series. There are many ways to select the best PAR(p) model, i.e. the model's order, [31], for example, applied a genetic algorithm, while [19] uses the Bootstrap technique. Using information criterion is a common approach to select the best model, more specifically Akaike's Information Criterion (AIC) and Schwarz's Bayesian Information Criterion (BIC) [1]. Both criteria are based on the likelihood function and a penalization term for the number of parameters in the model, however AIC is better for prediction as it is asymptotically equivalent to cross-validation, while BIC is best for explanation as it allows consistent estimation of the underlying data generating process. Since the ultimate goal is to analyse the out-of-sample model performance, in this work it is used the AIC criterion to select the PAR(p) orders.
To better explain the causal PAR(p) model fitting step by step, Monjolinho reservoir is used as example series in what follows. In Table 5 the model orders obtained and their respective weights are displayed.
From the previous step the residuals series is obtained and with these numerous tests is performed using dynamic regression to select the best climate variables. The idea is to select only the significant climate variables (p-value < 0.05) in the fitting exercise. By that all the explanatory variables are considered as candidates and, those that show no statistical significance are eliminated one at a time until the final formulation is reached.
Using again the Monjolinho, Table 6 displays the causal variables selecting process, where the "x" means the variable is selected and "-" means it was removed. In Model 1 all the variables are included since the backward approach is performed, in Model 2 the most insignificant variable were removed from the previous Model and so on until the last model where only significant variables remain: SOI Standard, Equatorial SOI and Equatorial SOI (-1). The "(-1)" means that a lagged variable is used.
Thus, Table 7 shows which variables were selected to join the causal PAR(p) modelling, for each reservoir inflow series. Note that some of them consider lags and also it is possible to visualize that the variable Equatorial SOI was the most used, whereas Niño 1+2 and Niño 3 were not at all picked by any reservoir.

Moving to
Step 3 of the model building, the causal model must be estimated in accordance with the explanatory variables selected. In Table 8 the three error measurements statistics obtained to both, PAR(p) and causal PAR(p) models are displayed. Note that the inclusion of climate variables leads to a small reduction in the error statistics, i.e., 1% in the MASE, 5% in the MAPE and 2% in the RMSE.

Generating synthetical scenarios
The synthetical scenarios generation is performed, with the residuals series, using the Bootstrap technique. First, the BDS test [3] confirms that the residuals obtained by both, traditional and causal PAR(p) models are white noise (p-value greater than 0.05).
The first procedure in this section is the generation of in-sample scenarios and the respective confidence intervals for the period 1982 up to 2013. Table 9 shows the error statistics comparing the estimated means of the scenarios generated using the traditional PAR(p) and the causal PAR(p) with the actual value of the series. Note that, in general, the causal PAR(p) slightly outperforms the traditional PAR(p): 3% in MASE, MAPE and RMSE.
Next, for the generation of synthetical reservoir inflow scenarios, one should ensure that all simulated values are positive; as a negative value does not make sense. The initial approach does not guarantee this constraint, so the strategy employed was to resample with replacement the corresponding residuals until a positive value for it is obtained, similar to the proposal of [20].     These results revealed that the causal PAR(p), produces better fit and coverage in comparision with the traditional PAR(p). This results is aligned and corroborates the findings of [12]. The improvement provided by the proposed approach compared with Lima's work relies on the fact that in this study eight climate variables were tested for each reservoir while Lima's study includes only two variables (Niño 3.4 and precipitation).

CONCLUSIONS AND FINAL REMARKS
The PAR(p) is currently used to estimate the operational costs of the Brazilian hydro-thermal optimal dispatch, and it does not take into account any possible exogenous information that may possibly affect the hydrological regimes and therefore the power generation. Causal PAR(p) is a novel approach that proposes the inclusion of explanatory variables exogenously, in the PAR(p) model, using Dynamic Regression. In this study, the inclusion of climate variables related to the El Niño and Sunspots variable were considered to model the reservoirs inflow series. This new approach was able to generate better results when compared with the current PAR(p).
Another important contribution of this research is the literature review made to identify possible climate variables that could influence the hydrological regime, resulting in eight possible variables; seven related to the El Niño phenomenon and one representing the Sunspots.
The major contribution of this paper is the upgrading of the current approach by including climate variables, that result in better results than the traditional PAR(p). Depending on the location of the reservoir different climatic variables were considered in the model fitting.
Finally, the causal PAR(p) was used to generate the synthetic scenarios and its comparison with the measurements each month of 2014 shows that the averages generated with the simulated scenarios reproduce quite well the future values.
In general the proposed approach performs slightly better than the currently used methodology. As a further improvement of this research one can envisage the construction of a model similar to a periodic transfer function, in such way that the explanatory variables are related to the original series and not only with the residuals generated by the PAR(p). Besides that, the search for other variables that could influence the hydrological regime is another possible improvement. Another possible future step is to aggregate the reservoirs by major basins, aiming to analyse if the climate variables can further improve the results by increasing the range of inflow data.
As a final word, although the causal PAR(p) produces only minor improvements this result is very important as it may result in different electricity prices when introduced in the optimization model used in the Brazilian hydrothermal dispatch.