A regional similarity-based approach for sub-daily rainfall nonparametric generation Uma abordagem de similaridade regional para geração não paramétrica de precipitação subdiária

Rainfall time series with high temporal resolution are required for estimating storm events for the design of urban drainage systems, for performing rainfall-runoff simulation in small catchments and for modeling flash-floods. Nonetheless, large and continuous sub-daily rainfall samples are often unavailable. For dealing with the limited availability of high-resolution rainfall records, in both time and space, this paper explored an alternative version of the k-nearest neighbors algorithm, coupled with the method of fragments (KNN-MOF model), which utilizes a state-based logic for simulating consecutive wet days and a regionalized similarity-based approach for sampling fragments from hydrologically similar nearby stations. The proposed disaggregation method was applied to 40 rainfall gauging stations located in the São Francisco and Doce river catchments. Disaggregation of daily rainfall was performed for the durations of 60, 180 and 360 minutes. Results indicated the model presented an appropriate performance to disaggregate daily rainfall, reasonably reproducing sub-daily summary statistics. In addition, the annual block-maxima behavior, even for low exceedance probabilities, was relatively well described, although not all expected variability in the quantiles was properly summarized by the model. Overall, the proposed approach proved a sound and easy to implement alternative for simulating continuous sub-daily rainfall amounts from coarse-resolution records.


INTRODUCTION
Rainfall time series with high temporal resolution are often required for estimating storm events for the design of urban drainage systems, for performing rainfall-runoff simulation in small catchments and for modeling flash-floods. Nonetheless, large and continuous sub-daily rainfall samples are, most often than not, unavailable. As compared to daily records, the fine-scale rainfall measurements are usually smaller in length and more affected by missing data, which may severely compromise the statistical inference of sub-daily precipitation block-maxima . In addition, the sub-daily pluviograph network is usually not sufficiently dense for properly describing the spatial variation of short-duration storm bursts and, as a result, it is often unfeasible to obtain a full picture on the time-space dependence structures of sub-daily rainfall events (Li et al., 2018).
As a means for dealing with the limited availability of high-resolution rainfall records, in both time and space, a number of techniques for obtaining continuous sequences of short-duration precipitation events from the disaggregation of daily rainfall amounts has been discussed in the literature. According to Sharma & Mehrotra (2010) and Diez-Sierra & del Jesus (2019), the most common models for this purpose encompass: (i) parametric point-process models using Poisson clustering, such as the Newman-Scott and the Bartlett-Lewis rectangular pulses (Koutsoyiannis & Onof, 2001;Yusop et al., 2014); (ii) the self-similarity approach, which resorts to random cascades (Kang & Ramirez, 2010;Müller & Haberlandt, 2018) or fractals/ multi-fractals (Serinaldi, 2010); and (iii) nonparametric models based on resampling techniques, such as the method of fragments Li et al., 2018).
Point-process models are reported to preserve most summary statistics of sub-daily rainfall amounts and properly reproduce extreme events (Diez-Sierra & del Jesus, 2019). However, they require the estimation of a relatively large set of parameters (Pui et al., 2012), which should increase modeling uncertainty. In addition, such models were originally designed for generating fine-scale rainfall sequences and, for employing them for disaggregation, it becomes necessary to introduce "adjusting procedures" for correcting the simulated rainfall amounts (e.g., Koutsoyiannis & Onof, 2001). On the other hand, random cascade models, although parsimonious (Sivakumar & Sharma, 2008), are frequently unable to properly reproduce the observed variance, the extreme rainfall amounts and the temporal dependence structure of the full sequence of sub-daily rainfall, which, to some extent, constrains their application for flood risk assessment and design purposes (Diez-Sierra & del Jesus, 2019). Moreover, both approaches often require sub-daily rainfall information at the target site itself for parameter estimation, which makes it difficult to properly validate models and to use them for simulation in locations where only daily records are available.
Nonparametric models, in turn, may provide flexible means for simulating rainfall disaggregation, despite of their limited ability for extrapolation and reproduction of quantiles' variability (Costa et al., 2015). First, such models are able to reasonably reproduce samples' summary statistics and, to some extent, the block-maxima behavior for the finer-scale precipitation increments (Lu & Qin, 2014;Li et al., 2018;Diez-Sierra & del Jesus, 2019). In addition, the very nature of nonparametric models avoids establishing assumptions regarding both the underlying distribution of the stochastic process from which sub-daily rainfall stems and the functional relationship between continuous and coarse-aggregated rainfall amounts, which may reduce modeling uncertainty (Pui et al., 2012;Sivakumar, 2017). Finally, once some degree of hydrological similarity between gauging stations is established, the aggregation of regional information for inference in the target site is straightforward Li et al., 2018), which may be useful for circumventing the requirements of fine-scale information at the referred location.
One of the earliest nonparametric techniques comprises the so-called method of fragments, originally proposed by Svanidze, in 1961, for stochastic generation of streamflows (Svanidze, 1980). Since its introduction, the method of fragments has been widely used for precipitation and streamflow disaggregation (see, for instance, Li et al., 2018 and references therein). In the context of rainfall disaggregation, the method of fragments is often coupled with the k-nearest neighbors approach discussed in Lall & Sharma (1996), which resorts to the bootstrapping technique (Efron, 1979) for sampling rainfall amounts from the most similar events in a given sample (for the sake of brevity, the outlined coupled model is hereafter termed KNN-MOF).
Despite its widespread use, the KNN-MOF approach, as originally designed, presented two theoretical underpins: (i) the continuity at the boundaries, i.e., in successive days, is not warranted, which may entail an unsuitable reproduction of clustered rainfall events; and (ii) high-resolution rainfall information is required at the target site. For addressing these issues, in this paper we explore an alternative version of the KNN-MOF model, which utilizes a state-based logic for simulating consecutive wet days (Sharma & Srikanthan, 2006) and a regionalized similarity-based approach for sampling fragments from hydrologically similar nearby stations Mehrotra et al., 2012), eliminating the needs of sub-daily records in the location of interest. By resorting to the outlined rationale, one may expect to obtain a well-suited disaggregation model, which could provide a reasonable framework for estimating short-duration design events from coarse-resolution records.
The remainder of the paper is organized as follows. The next section presents the material and methods, which encompass a brief description of the study area and the dataset, and describes with detail the simulation algorithm, the measure of hydrological similarity between rainfall stations and the strategy for model validation. Next, the case study is discussed. Finally, the conclusions and envisaged research developments are addressed.

Study area and dataset
The proposed disaggregation method is applied to a set of 40 rainfall gauging stations in sub-basins 40 and 41, which are included in the São Francisco river catchment, and in sub-basin 56, which is a component of the Doce river catchment. Sub-daily rainfall data were made available by the Brazilian Geological Service (CPRM), which operates the selected gauging stations. The data set presents data resolution of 1 minute and the average size of the utilized sub-daily time series is 6 years, with average yearly precipitation amounts varying from 800 to 1650 mm. Those days with missing values were simply discarded from the analysis.
The physical attributes of the catchments used in this study were obtained from the digital platform of the Brazilian Water National Agency (Agência Nacional de Águas, 2019). To identify the spatial distribution of those stations a GIS platform is employed.
For validating the disaggregation model, we utilized the rainfall gauging station with the longest record of sub-daily rainfall in the study area, namely, the Caeté gauging station (code ANA 01943010), located in the sub-basin 40. High-resolution rainfall data in such a station encompasses 28 years, spanning from 1990-2017, with average yearly precipitation amount of 1260 mm. However, 5 years of this data were removed due to high number of missing values. Figure 1 presents the study area, highlighting the target site of Caeté.

Stochastic simulation of sub-daily precipitation with the KNN-MOF model
In this paper, we utilize a variant of the k-nearest neighbors resampling method developed by Lall & Sharma (1996), which includes a regional similarity-based approach, as discussed in Westra et al. (2012), for synthetically generating sub-daily rainfall sequences. The rationale behind the proposed approach is that, for the set of wet days, one may consider that the joint distribution of some attribute of sub-daily rainfall, associated to a particular duration, and the daily rainfall amounts is the same at the target site and at a set of nearby stations, entailing hydrologic similarity. Rainfall attributes must be defined as to uncover as much information as possible regarding the full sequence of sub-daily rainfall increments. Once hydrologic similarity is established, logistic regression models are employed in order to identify the neighboring gauging stations with higher probability of similarity with the (ungauged) target site, on the basis of the catchments' physical attributes, which, in turn, allows information transfer for rainfall disaggregation.
The resampling algorithm is based on the random selection of sub-daily rainfall "fragments" from hydrologically similar nearby gauging stations. The so-called "fragments" are defined as vectors containing the full sub-daily rainfall sequence, aggregated with respect to a particular duration and normalized by the total rainfall amount observed in a given day. For defining hydrological similarity, we employ the following hydrological attributes, as computed for the durations of 60, 180 and 360 minutes, for each wet day : • The maximum rainfall intensity, for each duration, normalized by the total rainfall amount for that day; • The fraction of zeros, i.e., for a particular duration, the percentage of intervals with no rainfall; and • The timing of maximum rainfall intensity, which accounts for the time of the day in which the rainfall burst of maximum intensity occurs.
For assessing the hydrological similarity between two stations, we resort to the two-dimensional two-sample Kolmogorov Smirnov test (2D2S KS), as proposed by Fasano & Franceschini (1987), at the significance level of 5%. Such a test provides evidence as to whether the joint distributions of each of the hydrological attributes and the daily rainfall amounts time series are statistically similar in a pair of gauging stations. The 2D2S KS test statistic is given by the following equation (Press & Teukolsky, 1988): (2) Figure 1. Location of the gauging stations for this case study.
A regional similarity-based approach for sub-daily rainfall nonparametric generation 4/16 N 1 and N 2 denote, respectively, the number of sample points in samples 1 and 2, D is the maximum difference of the integrated probabilities along the four quadrants defined over a given sample point (see Press et al., 1992 for details), and r is the complement of the average Pearson correlation coefficients, as estimated from both joint distributions. For computing the term in the right-hand side of Equation 1, one must numerically approximate the following expression : For additional information on the 2D2S KS test, the reader is kindly referred to Press & Teukolsky (1988) and Press et al. (1992). The similarity between two stations is summarized by the 2D2S KS test result, which comprises a binary response u, with value of 1 being attributed to statistically similar stations and a value of 0 otherwise.
For modeling the binary response, as obtained from the hydrological attributes' similarity evaluation, a logistic regression model, which utilizes physical catchments' attributes as predictors, is established. Here, we consider a set of 4 potential predictors, namely, the absolute differences in latitude, in longitude, in elevation and in the product between the differences in latitude and longitude, which is intended to be a proxy for the Euclidian distance between the two gauging stations. In formal terms: in which, with vector β encompassing the regression coefficients and vector ν the previously outlined predictors. Equation 4 provides the probability that two rainfall gauging stations are similar, conditioned on the physical attributes of the catchments. For disaggregating the daily rainfall amounts in the target site, a set of nearby stations with higher values of ( ) = P u 1 must be retained. It should be noted that no formal procedures for variable selection are employed in regression analysis since we intend to evaluate the influence of each of these predictors with respect to the considered rainfall durations.
Following Westra et al. (2012), we select the neighbors on the basis of the average probability of similarity stemming from the three hydrological attributes. Seasonality is dealt with by assessing similarity for different periods defined along the water year (Oct.-Sep.), which, for our purposes, were defined as to coincide with the 4 seasons in the southern hemisphere. After identifying the most similar nearby stations, the disaggregation algorithm can be applied to the daily rainfall time series at the target site. In short, the algorithm works as follows: 1. The sequence of daily rainfall amounts at the target site is derived from the historical record; 2. Sequences of daily rainfall are obtained in the nearby stations identified with the previously described regional similarity-based approach. Dimensionless fragments (fr), for each season, day and duration, are then estimated with Equation 6: denotes the sub-daily rainfall amount in the gauging station s, aggregated along a duration m, in a given day i; 3. For each wet day, similar daily rainfall depths are sought at the nearby stations. For accounting for seasonal characteristics in sub-daily sequences, the referred daily amounts are only evaluated within a moving window of ±15 days (Pui et al., 2012;Westra et al., 2012), centered at the day of interest. Furthermore, for allowing the proper simulation of successive wet days, the sampling strategy is also conditioned on the previous and next days' states (wet or dry); 4. For a given day, all sample points along the period of record located inside the moving window and with similar wetness previous and next-day states are extracted from the set of neighbors. Such sample points are then ranked with respect to the absolute deviation from the daily amount at the target site, and the k-nearest neighbors are kept for resampling; 5. A random number is drawn from a standard uniform distribution and compared to the probability expressed by Equation 7: in which j denotes the rank of each of the k retained fragments. This allows the random selection of a fragment; and 6. The sub-daily rainfall amount is then computed by multiplying the fragment, as selected in previous step, by the daily precipitation depth at the day of interest.
After identifying the disaggregation models, for each season and duration, we utilize a rainfall gauging station with measurements encompassing a long period-of-record for assessing their predictive abilities. At each sub-daily scale, a set of 100 model runs  is simulated at the target site with the disaggregation algorithm. The performance of the models is then evaluated by comparing the mean values of a set of summary statistics for wet days, which encompasses the mean, the variance, the coefficient of skewness and the quantile curve, for the sub-daily temporal scales, with the observed counterparts. Also, for the quantile curves, the model ability in synthetizing the variability of the maximum rainfall amounts is assessed through the computation of the 95% confidence intervals.

RESULTS AND DISCUSSION
After obtaining the sub-daily rainfall data in the 40 rainfall gauging stations, for the durations of 60, 180 and 360 minutes, the hydrological attributes -maximum intensity, fraction of zeros and timing of the maximum intensity -were estimated for each site.
Next, the 2D2S KS test was employed for assessing the similarity of pairwise stations with respect to each of the hydrological attributes and each season.
Results from the 2D2S KS test demonstrated that, for the maximum intensity and time of maximum intensity attributes, the model presented an appropriate performance, identifying similarity for a number of stations. On the other hand, for the remaining attribute, i.e., the fraction of zeros, similarity was not readily established, except for the period spanning from June to August, which may be partially ascribed to the small number of wet days in these months and to the relatively small precipitated amounts in such days. A similar problem was reported by Westra et al. (2012) when disaggregating daily rainfall amounts in stations spread across Australia, which may, to some extent, point to a flawed proposal of hydrological attributes for defining similarity. As a result, the fraction of zeros attribute was not considered for subsequent analyses.
The next step was identifying the logistic regression models, which relate the binary response of each hydrological attribute to the catchments' physical characteristics. Table A1 (Appendix A) presents a summary of the elevation, latitude, longitude for the gauging stations, including those of the target site. These attributes were used as predictors in the regression model (which does not consider the Caeté station). Regression coefficients were then estimated by means of the iteratively reweighted least squares approach. Table 1 presents the regression coefficients for each of the hydrological attributes for the duration of 60 minutes and for each season. Results for the remaining durations are presented in Tables A2 and A3 also in Appendix A. One may notice that the difference in longitude is the physical characteristic with higher coefficient values (significant for most durations at the 5% level), which suggests that, for the study area, it is the most influencing physical factor to identify similarity.
For each hydrological attribute, the probability of a given nearby station of being similar to the target site was computed with the obtained regression model. For identifying the donor sites, the mean value of the two probabilities was estimated in each season, and the best ranked stations were kept for rainfall disaggregation. Table A4 (Appendix A) presents the averaged probabilities, for the duration of 60 minutes, with respect to the Caeté station. As it can be seen, the probabilities may vary depending on the season, presenting higher values in less rainy periods, making it easier to correlate the attributes. Also, Table A4 indicates the need to separate the model by seasons, as different stations were selected for each one.
After defining potential donor sites, we proceeded to the calibration of the disaggregation model. This process requires the specification of the k-nearest neighbors (k), the number of donor gauging stations (S) and the maximum absolute deviation of the daily rainfall amounts with respect to that observed in the target site. The summary statistics for the wet days were calculated to evaluate model performance.
The first parameter to be estimated was the number of neighbors, k. For assessing the impact of this parameter in the model, we fixed the remaining parameters, using S = 10 and the deviation = 10%. Results are depicted in Table 2, which also presents the relative error between the observed records and the mean simulated value (i.e, the mean value of the predictions' ensemble), for each statistic, for the 60 minutes duration. Positive values indicate overestimation and negative values underestimation.
From Table 2, one may observe that, in this case study, the usage of the number maximum of neighbors does not have a substantial impact in the model, due to the difficulties to obtain more than 10 days that are within the maximum absolute deviation of 10%. Considering the number of data and gauging stations that Westra et al. (2012) had available for their work, this parameter would be essential in order to limit the number of days to resample. Therefore, considering the mentioned aspects, this parameter will not be employed for further analyses and durations. The second parameter to be estimated was the maximum absolute deviation of the daily rainfall amounts. For the duration of 60 minutes and  A regional similarity-based approach for sub-daily rainfall nonparametric generation 6/16 keeping S = 10, we tested deviations of 5% and 10%, in order to assess which one would entail a better performance. Results are presented by Table 3. One may notice that decreasing the deviation to 5% did not result in substantial changes in the statistical properties and also caused a reduction of model variability, as there are less fragments selected to be resampled. In view of this result, we defined the deviation of 10% for the remaining simulations.
Lastly, to estimate the number of gauging stations (S) to be used for disaggregation of the target site, we considered the values of 5, 10 and 15. Results for the 60 minutes duration are presented in Table 4 and those for the remaining durations are located in Tables A5 and A6 in Appendix A. One may observe that, for the duration of 60 minutes, the model was more efficient in reproducing the observed values when 15 stations were used in the disaggregation. However, when increasing the duration to 180 or 360 minutes, the model was able to disaggregate the series at the target site with less neighboring stations.
Considering the results obtained from the calibration step, which are the maximum number for the k-nearest neighbors, the maximum absolute deviation of 10% and number of donor gauging stations (S) of 15 for 60 minutes duration and 5 for durations of 180 and 360 minutes, the final simulations were carried out. 100 model runs were performed and, on each run, the observed daily time series was disaggregated. As the fragments are randomly chosen (step 5 of the disaggregation algorithm), one may expect different outcomes at each run, which, in turn, should converge for the observed summary statistics on average terms. Figure 2 presents the results of each one of the 100 model runs for the 60 minutes duration, considering the assessed statistical    Figure 3, it represents the deviations from the observed values for each iteration, for the duration of 60 minutes. For other durations, the reader is kindly referred to Figures A1 to A4 located in Appendix A. For all statistical properties and durations, the model resulted on deviations of ±10%, which indicates an appropriate performance. It is important to highlight that, for the duration of 60 minutes, the model can better reproduce the observed mean and variance, as compared to the other durations, which tend to overestimate them. Other important aspect is that this work considered considerably less data than Westra et al. (2012) to resample the fragments for each simulation and yet presented reasonable results. The mentioned authors had available about 250 years of data among the selected gauging stations for resampling, while, in this work, the selected number of stations resulted in approximately 90 years of data for 60 minutes duration (15 stations with 6 years each) and 30 years of data for 180 and 360 minutes duration (5 gauging stations with 6 years each).
To evaluate the model ability in synthetizing the variability of the annual maximum rainfall amounts, the quantile curves were constructed considering the 95% confidence intervals (CI), as shown in Figures 4 to 6. The hollow points indicated the recorded values from the gauging station, whereas the continuous line represents the median of the 100 simulations and the dashed lines represent the confidence bands.
Overall, the results indicate an appropriate model performance to estimate the annual maximum rainfall amounts, mainly for smaller durations and higher exceedance probabilities.   A regional similarity-based approach for sub-daily rainfall nonparametric generation 8/16 However, the model tends to overestimate the rainfall amounts with lower exceedance probability for the 60 minutes duration and underestimate them for the 360 minutes duration.
It is worth highlighting that, for the durations of 60 and 180 minutes, most of the observed values fall inside the confidence bands, which suggests that, to some extent, the resampling algorithm is able to capture the expected estimation variability that stems from sampling errors. This was also observed by Westra et al. (2012) and Li et al. (2018). For the 360 minutes duration, despite enclosing most observed points, the model worsens its performance, with narrow confidence intervals, which are not being able to resample with variability. Further analysis must be made in order to define a method to compensate and deal with this aspect.
In order to visualize the spatial distributions of the selected gauging's obtained from the similarity probabilities, Figures A5 and A6 (Appendix A) were produced. The stations employed to disaggregate the target site daily time series for the duration of 60 minutes are indicated for each season, with the rank of each station located previous to its code. As the observed behavior is similar for the other durations, we presented in this paper the figures only for the mentioned duration.
As it can be seen, for the period of December to February (DJF) the model selects the geographically nearest stations, which are probably affected by similar meteorological systems, as the most similar -even though the climate conditions are not expected to vary considerably due to the relatively small size of the study area. The behavior for the periods between March to May (MAM) and September to November (SON) are quite similar and, in these cases, the model selects more distant stations -yet still in the same catchment of the target site. For the period between June and August (JJA), which is the driest season of the year, it is easier to identify similarity due to the small number of rainy days. This fact explains the reason for the selection of distant stations instead of the close ones. As such, regarding the rank of similarity probability, it is interesting to observe that the highest ranked stations are often distant from the target site, indicating the hydrological attributes have higher influence over the physical counterparts for some stations.

CONCLUSIONS
This article presented a method for disaggregating daily rainfall, which is based on an alternative version of the KNN-MOF model that utilizes a state-based logic for simulating consecutive wet days (Sharma & Srikanthan, 2006) and a regionalized similarity-based approach for sampling fragments from hydrologically similar nearby stations . Similarity identification was made by considering hydrological attributes, such as maximum intensity and timing of the maximum intensity, and physical attributes, such as absolute difference in latitude, in longitude, and in the product between the differences in latitude and longitude.
Considering the hydrological attributes for identifying stations' similarity, the maximum intensity and timing of the maximum intensity performed appropriately. However, due to difficulties regarding the fraction of zeros, such an attribute was not employed in this study. Concerning the physical attributes, the one with higher influence was the longitude, which was identified through the logistic regression model coefficients.
The similarity probabilities, which are the results of combining both hydrological and physical attributes, presented different values depending on the season of the year, with higher values in less rainy periods, making it easier to correlate the attributes. Also, it is important to consider seasonality, as different stations are selected for each season.
After selecting potential donor sites, we proceeded to the calibration of the disaggregation model, identifying which parameters resulted in better model performance, considering the statistics summary for the wet days. The "optimum" model was achieved considering the non-specification of a number for the k-nearest neighbors, the maximum absolute deviation of 10% and number of gauging stations (S) of 15 for 60 minutes duration and 5 for durations of 180 and 360 minutes.
The simulations indicated an appropriate model performance when comparing to the statistics obtained for the validation gauging station of Caeté. Regarding the annual maximum rainfall amounts, the quantile curves constructed show the most part of observed values are between the 95% interval. However, the model presents narrow intervals for higher rainfall durations, which implies difficulties in representing the expected quantiles' variability for such durations. Also, the model is able to appropriately reproduce the quantiles of low exceedance probability for the 180 minutes duration, indicating its capacity to reproduce extreme rainfall amounts which are important for estimating storm events for the design of drainage systems and performing rainfall-runoff simulations. For the 60 minutes duration there is an overestimation and for the 360 minutes duration there is an underestimation of such quantiles.
We acknowledge that further analyses must be made in order to compensate and deal with the aforementioned difficulties and a broader comparison with other methods with distinct theoretical foundations and complexity levels, such as that of the disaggregation coefficients (Teodoro et al., 2014), is necessary for allowing some generalization of the proposed approach. However, the obtained results are deemed promising for addressing the problem of rainfall disaggregation, particularly in what concerns extreme events, when only daily records are available.

Authors contributions
Milena Guerra de Aguilar: Literature review, generation and discussion of results and paper writing.
Veber Afonso Figueiredo Costa: Contribution in the paper conception, results discussion and paper writing. A regional similarity-based approach for sub-daily rainfall nonparametric generation 12/16   Figure A1. Annual maximum, mean, variance and skewness results for each one of the 100 model runs for the 180 minutes duration. Dots denote model predictions and the solid line the mean value of the observed statistics. A regional similarity-based approach for sub-daily rainfall nonparametric generation 14/16