Estimation of monthly rainfall missing data in Southwestern Colombia: comparing different methods

ABSTRACT Historical rainfall records are relevant in hydrometeorological studies because they provide information on the spatial features, frequency, and amount of precipitated water in a specific place, therefore, it is essential to make an adequate estimation of missing data. This study evaluated four methods for estimating missing monthly rainfall data at 46-gauge stations in southwestern Colombia covering 1983-2019. The performance of the Normal Ratio (NR), Principal Components Regression (PCR), Principal Least Square Regression (PLSR), and Artificial Neural Networks (ANN) methods were compared using three standardized error metrics: Root Mean Square Error (RMSE), Percent BIAS (PBIAS), and Mean Absolute Error (MAE). The results generally showed a better performance of the nonlinear ANN method. Regarding the linear methods, the best performance was registered by the PLSR, followed by the PCR. The results suggest the applicability of the ANN method in regions with a low density of stations and a high percentage of missing data, such as southwestern Colombia.


INTRODUCTION
Proper water management depends mainly on the quantity and quality of hydroclimatological information.Knowledge of the spatial distribution and seasonality of rainfall is relevant for the economy and society of a region (Morales-Acuña et al., 2021).Reliable and complete data is required to take action on actions such as flood and drought risk prediction and assessment, rainfall forecasting, dry spells, desertification, climate variability studies, and water resource planning, among others (Canchala et al., 2019;Kuok et al., 2010;Miró et al., 2018).Knowing the processes involved in the water balance of a region is essential due to the influence of climate change on precipitation worldwide with increasingly variable patterns, according to the projections of the Intergovernmental Panel on Climate Change (2022).
Southwestern Colombia recurrently presents a lack of information related to climate records due to aspects such as the low density of gauge stations and social and public order problems that make it difficult to access them compared to other regions of Colombia (Canchala et al., 2019).This problem affects hydroclimatological analyzes for water resource management and planning purposes.On the other hand, this region presents a complex climatology, due to the influence of climatic phenomena such as El Niño Southern Oscillation (ENSO), Madden-Julián Waves, Chocó low-level Jet, among others (Canchala et al., 2020a;Cerón et al., 2021b;Puertas & Carvajal, 2008;Rueda & Poveda, 2006;Sedano-Cruz et al., 2013;Torres, 2012), coupled with its proximity to Ecuador and influence of the Intertropical Convergence Zone (ITCZ).
In this sense, the main objective of this research is to evaluate and compare four methods for estimating missing monthly rainfall data at 46-gauge stations in southwestern Colombia with historical data between 1983-2019, using four performance metrics.Hence, this article is arranged as follows: Section 2 describes the study area and data.Section 3 describes the methodology used.Section 4 includes the results and discussion, and finally, section 5 shows the conclusions.

Study area
The study area is in Southwestern Colombia (Department of Nariño).It has an approximate area of 33,268 km2 that occupies a geostrategic position because the Andes Mountain range crosses its limits, the Tropical Pacific Ocean.It registers significant topographic changes in small distances (Canchala et al., 2020b), which provides the regional diversity of reliefs, thermal floors and microclimates (See Figure 1).Furthermore, it is in the south of the Colombian Biogeographic Chocó; recognized as a biodiversity hotspot that shelters approximately 3% of the world's plant species (Poveda & Mesa, 2000), characterized by registering three important rainfall core that range between 7000 mm.year -1 to 9000 mm.year -1 (Cerón et al., 2021b).Likewise, it registers three natural regions where 52% is made up of the Pacific region, with high rainfall that ranges between 3000-7000 mm.year -1 , 40% belongs to the Andean region, where the presence of moors, volcanoes, and relief stand out.Rugged with rainfall that ranges between 1000-2000 mm.year -1 , and the remaining 8% belongs to the Amazon jungle region where there is high biodiversity of communities and species and rainfall that varies between 3500 mm.year -1 -4500 mm.year -1 (Canchala et al., 2019(Canchala et al., , 2020a;;Cerón et al., 2021b).

Rainfall dataset
Monthly rainfall time series with 37 years of observation  from 46-gauge stations located in southwestern Colombia were used.The data was provided by Instituto de Hidrología, Meteorología y Estudios Ambientales (IDEAM) of Colombia, and its spatial distribution is shown in Figure 1.

METHODS
Four methodologies for estimating missing monthly rainfall data in Southwestern Colombia were used: Normal Ratio (NR), Principal Components Regression (PCR), Principal Least Square Regression (PLSR), and Artificial Neural Networks.(ANN).The first three methods are linear, and the last corresponds to a nonlinear methodology.Initially, the rainfall records were organized, and the Del Castillo-Gómez et al.

3/16
exploratory and confirmatory data analysis was carried out, where the behaviour of the rainfall was determined through descriptive statistical measures, as well as the percentage of missing data in the 46 study stations.Subsequently, homogeneous rainfall groups were formed using statistical cluster analysis and similarity measures for the three linear methods.Finally, missing data were imputed, and method performance was assessed using four performance metrics: Root Mean Square Error (RMSE), Percent bias (PBIAS), Mean Absolute Error (MAE), and Pearson correlation coefficient (r).The best method for estimating missing data was selected based on these results.In the flowchart of Figure 2, the methodology used to estimate the missing data in the different study stations is registered.

Pre-processing data
Preliminary analysis of the rainfall time series was performed through exploratory and confirmatory data analysis.This analysis consists of applying graphical and quantitative statistical tools to identify patterns and anomalies in the data and observe the behaviour of rainfall in the selected stations (Zhao et al., 2011;Castro et al., 2012).
Furthermore, the consistency of the data was verified using the Spearman correlation coefficient (ρ), used to quantify the degree of correspondence between the stations and identify anomalous gauge stations.ρ ranges from -1 to 1, with the maximum (minimum) value being the perfect positive (negative) correlation.The statistical significance of the correlations was determined using the T-student test with a significance level of  =0.05.
For the NR method, the Spearman correlation coefficient was used as a quantitative criterion for selecting neighbouring stations for the estimation of missing data since it is not multiple estimation methods.Stations with values of ρ ≥ 0.5 were selected, a criterion used by Cruz-Roa & Barrios (2018), establishing that the stations with correlation coefficients constitute a homogeneous hydrological cluster.

Regionalization
Defining homogeneous rainfall regions is essential for hydrological applications, particularly for regions with high spatio-temporal variability in rainfall patterns (Zhang et al., 2016).Therefore, cluster analysis (CA) was used to classify stations with similar characteristics and properties.Estimation of monthly rainfall missing data in Southwestern Colombia: comparing different methods For clustering purposes to regionalize the monthly rainfall, we use the HCA with square of Euclidean Distance for measure the observations, represented by Equation 1, and Ward's method for the linkage rule.According to Hershey et al. (2010), Darand & Reza (2014), Gois et al. (2020) and Silva (2020) this fusion results in the most distinctive clusters.: where i x and i y are elements of comparison, which in this case represent the rainfall between stations x and y at the i-th instant of time.
Subsequently, hierarchical groups were formed by Ward's method, allowing the grouping of gauge stations with similar behaviours of rainfall.Finally, each region's spatial coherence and rainfall patterns were verified.
This study chose Ward's method because it has been the clustering technique most used in climate regionalization (Fazel et al., 2018;Canchala et al., 2022).Overall, it outperforms other algorithms in terms of segregation, providing relatively dense groups with minimums within-group variance.Ward's method recognizes the minimum variance within groups, joining elements with a minimal sum of squares between them (Hervada-Sala & Jarauta- Bragulat, 2004;Santos et al., 2015).

Normal Ratio (NR)
It is a non-multivariate conventional method initially used by Paulhus & Kohler (1952) and is currently used in numerous hydrological studies (Adilah & Hannani, 2021;Moraes Cordeiro & Blanco, 2021;Burhanuddin et al., 2016;Caldera et al., 2016;Arango et al., 2012;Puertas & Carvajal, 2008).It is used when there are missing data in a specific month for a weather station with neighbouring stations without missing records for the same months as the station of interest and with similar topographical features.For this, the expression of Equation 2 is used: where; n is the number of pluviometric stations with continuous record data close to station x, in which the record will be completed, x P is the rainfall of station x during the month to be completed, i P is the rainfall of stations 1 up to n during the month to be completed, x N is the multiannual monthly rainfall of station x, and i N the multiannual monthly average rainfall of stations 1 to n.

Principal Components Regression (PCR)
Principal Component Analysis (PCA) or Empirical Orthogonal Functions (FOEs) is a statistical technique used in climate research as a tool to analyze meteorological series with high spatio-temporal dimensionality and noise (Carvajal-Escobar & Marco, 2005;Taylor et al., 2013).Together with regression, they are multivariate techniques that allow data from multiple monitoring sites to be incorporated into a statistical model while minimizing the effects of non-correlation between the measured variables (Shlens, 2014), which is why it is widely used in hydroclimatology to extract linear relationships between variables in a data set (Lu & Hsieh, 2003), reduce dimensionality and avoid multicollinearity (Cerón et al., 2021a;Ocampo-Marulanda et al., 2022).
Principal Components Regression is a method introduced by Massy (1965) that allows transforming the p original explanatory variables into a new set of k uncorrelated variables, called Principal Components (PCs), with k p < .These variables are later used as explanatory variables in a multiple linear regression model to estimate the value of the response variable (Wyatt et al., 2020).
A subset of PCs was selected to improve the quality of the estimation of the model, and the Kaiser-Gutman (K-G) criterion was used to determine the number of these to use.The K-G criterion recommends choosing the first k PCs associated with eigenvalues greater than 1.0.In addition to the above, the empirical criterion of the percentage of accumulated variance was also used, which suggests keeping the PCs that accumulate an explained variance of approximately 80% % (Cuadras, 2007).
The relationship between the set of gauge stations was established using Equation 3, based on the multiple linear regression model: where the variable Y represents the rainfall of one of the gauge stations, 0 β is the intercept, 1 , , p β β … is the partial regression coefficients associated with the p gauge stations ( ) to be considered predictors, and e represents the random error component associated with the regression model.The predictors were transformed considering k p < linear combinations to reduce the possible effect of multicollinearity by including monitoring stations that may present a spatial correlation as predictor variables: with 1, , ,   l k = … called components, and subsequently adjusting a linear regression model (Equation 5) using them as new predictors: a a a a = + + +…+ + (5) Del Castillo-Gómez et al.

5/16
Additionally, if the coefficients jl a (loads or weights) are selected as follows , components l Z will be orthogonal, therefore avoiding potential problems of multicollinearity.The procedure described above was carried out iteratively, varying the gauge station to be considered as the response variable and using the other available stations as predictor variables.
Principal Least Square Regression (PLSR) Partial Least Squares (PLS) regression is a model for multivariate prediction or estimation of data (Andersson, 2009).It was introduced by Wold (1975) and combined features of principal component analysis and multiple regression.Like Principal Component Regression (PCR), it allows the original explanatory variables to be transformed into a new set of independent variables (components) (Wold et al., 2001), with the difference that PCR regression does not consider the response variable to determine the components, since it only maximizes the explained variance of the set of predictor variables to be introduced in the regression model; while PLS regression determines these components taking into account both the original predictor variables and the response variable (it maximizes the covariance between the original explanatory variables and the response variable(s).
PLS regression with a single response variable is called PLS1; if there is more than one response variable, the PLS regression is called the PLS2 regression (Andersson, 2009).In this research, the PLS1 algorithm was used, and the empirical criterion of the percentage was used, which suggests conserving the components that allow accumulating at least 80% of the explained variability of the response variable.

Artificial Neural Networks (ANN)
Nonlinear Principal Component Analysis (NLPCA) is a method using ANN, proposed by Scholz et al. (2005) and used in the estimation of missing hydroclimatological data by Miró et al. (2017) and Canchala et al. (2019).This method uses an auto-associative neural network approach based on decoding (second phase of NLPCA), better known as inverse NLPCA, which corresponds to a nonlinear generalization of the standard Principal Component Analysis (PCA).
The NLPCA uses a reconstruction function , where the objective of gen ϕ is to estimate the data set x approximated the target data y, minimizing the root mean square error (See Figure 3).The NLPCA toolbox used in this study is available at Scholz (2023).
The organization structure of the input and output layers is carried out through different types of architectures; the most used model is the Multi-Layer Perceptron (MLP), which has a structure with an input layer, one or several layers hidden and an output layer (Canchala et al., 2020a).Rumelhart developed this model, also called the error propagation model or back-propagation model, which uses the Delta Learning Rule learning method (Demir & Keskin, 2021).For this research, the architecture employed in Canchala et al. (2019) was used.
A learning algorithm was used for preliminary recognition of the data and identification of historical rainfall patterns of the time series associated with different weather phenomena that influence the region.This was done using the back-propagation algorithm, which propagates the error between the actual output data and the estimated output data in the neural network.The above is described by Equation 6: ( ) where E is the total error, M is the number of input data, j E is the error of the squared difference between the actual data ( ) mj x ) and the estimated data ( ˆ) mj x ) (Khalili et al., 2016).For the ANNs, the architecture [46-45-46] was used, and different iterations were evaluated in the neural network (5,000, 8,000 and 10,000).The best results were obtained using x is the output layer (Monthly rainfall dataset reconstructed).
Estimation of monthly rainfall missing data in Southwestern Colombia: comparing different methods 10,000 iterations; this greater number of iterations allowed better recognition of the database and identification of atypical patterns resulting from climatic phenomena and anomalies that influence the study area.

Performance metrics
The evaluation of the estimation of missing data of the four (4) methods used was carried out through three (3) standardized performance metrics: RMSE, MAE, and ABIAS, which allowed verifying the adjustment in each of the gauge stations studied.Also, the Pearson correlation coefficient (r) was used to measure the similarity between the observed and estimated data.The metrics used are represented in Table 1.

Descriptive statistical analysis
Table 2 shows the descriptive statistics of the gauge stations' monthly rainfall time series for the period 1983-2019.
The studied gauge stations showed altitude differences, varying from 3 m.a.s.l in the plain of the Pacific region to 3120 m.a.s.l in the Andes Mountain range.The fluctuations of rainfall in the territory are mainly due to the altitude differences, in addition to orographic aspects, presence of the Andes mountain range, limits with the Pacific Ocean, proximity to the Amazon basin, and influence of climatic phenomena at different time scales, such as the ENSO phenomenon, the Madden-Julián Oscillation, and ITCZ migration, among others (Cerón et al., 2021a;Puertas & Carvajal, 2008;Rueda & Poveda, 2006;Serna et al., 2018;Torres-Pineda & Pabón-Caicedo, 2017).It is highlighted that the areas with the highest average annual rainfall are located in the Pacific region, with records of up to 8689.39 mm.year -1 .
In general, the monthly rainfall time series showed high variation with respect to the mean, ranging between 174.11 mm.year 1 and 1347.6 mm.year 1, with variation coefficients that ranged between 9% and 41%, with the greatest variability being recorded in the Pacific region.This region is characterized by not presenting a defined rainfall trend (Guzmán et al., 2014); some areas register high rainfall during the year and, in others, low rainfall, mainly in stations near the Pacific Ocean.These results are consistent with Cerón et al. (2021b), who reported a core of high rainfall in the south of the Colombian Pacific (3000 to 7000 mm.year -1 ), and Canchala et al. (2022) who reported in the Pacific region of Nariño rainfall ranges from 2500 to 8650 mm.year -1 .
The high variability of rainfall in the study area is mainly due to the movement of the ITCZ, the influence of the Chocó jet stream, and the incidence of the La Niña and El Niño phases of the ENSO, which have been documented by different authors who have studied the hydroclimatology of Colombia and specifically the southwestern Colombian (Canchala et al., 2022;Cerón et al., 2021b;Poveda & Mesa, 1999;Urrea et al., 2019).Furthermore, the asymmetry coefficient was positive (negative) in 80.44% (19.56%) of the stations, indicating that in these stations, most of the rainfall values are higher (lower) than the average.

Regionalization of monthly rainfall
The regionalization of the monthly rainfall was performed using Ward's method, which showed the conformation of 3 homogeneous regions formed into three groups of 33-, 12-and 1-gauge station, which is consistent with the natural regions of the Andean region (AR), Pacific region (PR) and Amazon region (AMR), respectively, as shown in Figure 4.As well, these results are consistent with the rainfall regionalization performed by Canchala et al. (2022) through the application of the nonlinear technique called Kohonen's self-organized maps, also with Guzmán et al. (2014), using PCA and Jaramillo-Robledo & Chaves-Córdoba (2000) who used statistical clusters.
The three identified homogeneous regions show different patterns of rainfall interannual variability.The gauge stations of the PR are characterized by registering average rainfall between where, i z is the original observed value of monthly rainfall in month i, ˆi Z is the estimated value of monthly rainfall in month i, Z is the average of observed rainfall, ẑ is the average of rainfall estimated, and n is the number of observations (months).

7/16
2500-8689 mm.year -1 (See Figure 4), with a monomodal regime where the highest (lowest) rainfall is recorded from April to July (August to March) (See Figure 4a).On the other hand, the gauge stations located in AR show a variation of average rainfall between 600 to 1500 mm.year -1 , showing a bimodal cycle (see Figure 4b), where there are two periods of high rainfall corresponding to March-April-May (MAM) and October-November-December (OND).Finally, the MON station was classified in the AMR, with rainfall between 2500-4500 mm.year -1 , registering a monomodal cycle (See Figure 4c) with maximum rainfall in the months of June-July-August (JJA).These results are consistent with findings reported by Canchala et al. ( 2019

Gauge stations correlations
According to the correlation criterion for selecting stations (ρ≥0.5)described in the methodology for the NR method and according to the results obtained, nine neighbouring stations were selected, homogeneously distributed in the three regions.For the PR, the BAR, MIR, and RBB stations were selected; for the AR, the PIS, RMO, and SAM were selected; and for the AMR, which only has one gauge station (MON), three neighbouring stations of the department of Putumayo CPC, CHP, and TTS (located to the east of the study region) were used.More details about the correlation results for PR, AR and AMR are depicted in Supplementary Materials S1, S2, and S3, respectively.

Missing data estimation
The missing data estimation results using the NR, PCR, PLSR, and ANN methods in the PR, AR, and AMR regions are shown in Figures 5, 6, and 7, respectively.For purposes of comparative analysis, the results obtained at the gauge station with the highest percentage of missing data for each region are depicted in this study.In this sense, the gauge stations selected for PR, AR and AMR were Mataje (MAT-9.46%),Guasca (GCA-8.11%)and Monopamba (MON-3.38%),respectively.The results of the missing data estimation of all gauge stations are available in Supplementary Material S4.
Figure 5 shows the missing data estimation obtained for MAT gauge station.For the NR (Figure 5a), PCR (Figure 5b), and PLSR (Figure 5c) methods, underestimated values were obtained in the periods between the years 1986-1988, 1991-1994, 1997-1999, and 2013-2014, highlighting that in 1998 the observed value was underestimated by more than 60%.On the other hand,  Del Castillo-Gómez et al.

9/16
overestimates of 33% (38%) were obtained for the 2010-2011 (2016-2019) period.These periods mostly coincide with events related to typical increases and decreases in rainfall in the study area, mainly associated with the ENSO phenomenon described by Trenberth (2018), which in turn are consistent with the results obtained in the missing data estimation in southwestern Colombia by Canchala et al. (2020a), andOcampo-Marulanda et al. (2022).
The results of the missing data estimation in PR show that these three methods are sensitive to rainfall variability.In contrast, the missing data estimation using the ANN method shows that the reconstructed time series for MAT gauge station does not record high underestimates and overestimates (Figure 5d); instead, the time series estimated follows the behaviour of the observed values.This result shows that the ANN method has a high capacity to reconstruct historical data despite the high rainfall variability in the region.This shows agreement with the results obtained by Santos et al. (2021), Chiu et al. (2021), Canchala et al. (2019), Khalili et al. (2016), andMiró et al. (2017), who used ANN models to estimate missing data of rainfall time series in areas with high variability and obtained low estimation errors, indicating a high capacity in the reconstruction of time series in areas with climatic complexity.
Figure 6 shows the rainfall missing data estimation of the GCA gauge station of the AR.In general, the time series presented high variations in rainfall, highlighting high rainfall in the years 1985, 2002, 2004, and 2010, where rainfall was greater than 200 mm.month - .Particularly in the year 1985, the NR (Figure 6a), PCA (Figure 6b) and PLSR (Figure 6c) methods registered underestimates of 65%, 70%, and 63%, respectively.In contrast, the ANN method (Figure 6d) registered less underestimation (26%), showing superiority in its performance over the other methods.
Figure 7 shows the rainfall missing data estimation for the MON gauge station of the AMR, where maximum rainfall events are observed in the years 1986, 1988, 1992, 2002, 2011, 2013, and 2015, which were underestimated with the methods PCR (Figure 7b) and PLSR (Figure 7c).For its part, the estimation made by NR (Figure 7a), registered a better fit compared to PCR and PLSR, showing that, for some periods, this method was superior to the two linear methods; however, the estimates made with ANN (Figure 7d) showed high precision and accuracy in the reconstruction of the series.For example, the maximum event recorded in July 1986 was underestimated by 6%, 51%, 53%, and 0% with the NR, PCR, PLSR, and ANN methods, respectively.
Broadly, the best rainfall missing data estimation in the gauge stations of the three regions PR, AR and AMR was obtained with the nonlinear ANN method.This method showed superiority and high capacity for time series reconstruction even when high rainfall variability is recorded, which can reduce the performance of the methods.The results of the linear methods NR, PCR, and PLSR, were not the best; however, it is highlighted that in some periods the estimations depicted high similarity with the observed values; being consistent with the results obtained by Silva et al. (2007)

Performance metrics
The performance metrics described in the methodology section were used to quantify the performance of each of the missing data estimation methods evaluated in this research.The standardized performance metrics are shown in Figure 8, which range between 0.0 and 0.75, where the best (worst) performance is identified with values close to 0.0 (0.75).According to the three-error metrics, RMSE, MAE, and ABIAS, NR was the method with the largest errors, which ranged from 0.24-0.62,0.23-0.44,and 0.18-0.44,respectively.In contrast, the ANN method depicted superiority  Estimation of monthly rainfall missing data in Southwestern Colombia: comparing different methods in the imputation of missing over the other methods registering errors that ranged between 0.01-0.29 (RMSE), 0.01-0.22(MAE), and 0.01-0.22(ABIAS).On the other hand, it was evidenced that the estimated errors using NR, PCR, and PLSR were higher in the PR than in the AR and AMR, while using the ANN method, the lowest errors in their order were registered in the PR (0.01-0.03),AR (0.01-0.29) and AMR (0.02).For the PCR method, the error metrics ranged between 0.25-0.59(RMSE), 0.18-0.41(MAE) and 0.18-0.41(ABIAS) and for the PLSR, the error metrics ranged 0.22-0.54(RMSE), 0.16-0.48(MAE) and 0.18-0.41(ABIAS).
In general, the best method for rainfall missing data estimation in Southwestern Colombia is ANN, due to registering the lowest errors in the three natural regions of Nariño.It is highlighted that the best performance was recorded in the reconstruction of the rainfall time series of the PR, which registers high variability (See Table 2) and requires greater precision in their estimation, considering that it is a region with low-density of gauge stations, and therefore with lack hydroclimatological information.This result contrasts with Canchala et al. (2019).They obtained higher RMSE values in the PR and lower values in the AR of southwestern Colombia, highlighting differences in the study period and the number of gauge stations evaluated.Regarding the three linear methods, the best performance was registered by the PLSR, followed by the PCR and the NR.
Finally, to measure the statistical relationship between the observed and the estimated time series using the four methods already described, the Pearson correlation coefficient was estimated (See Figure 9).
The correlation coefficients confirmed the results obtained through the performance metrics.The Pearson coefficient for the estimation performed by the ANN method ranged between 0.81 -1.00, showing a high correlation between the estimated and observed values.In contrast, it was observed that the lowest correlations were recorded in the estimates of the NR method, highlighting

11/16
that the lowest coefficients are observed in the PR, where there is a high variability of rainfall (See Table 2).Furthermore, using the three linear methods, the best estimations were obtained in the AR.In contrast, with the nonlinear method (ANN), the best estimates were recorded in the PR, showing a high capacity for recognising the nonlinear relationships that allow overcoming difficulties associated with the high variability, scarce information, and low density of stations in this region.
The results obtained in this study are consistent with Miró et al. (2017), who evaluated ten missing data imputation methods and reported that the best results were obtained with nonlinear methods, highlighting the methods based on the NLPCA approach.Additionally, it is consistent with Canchala et al. (2019) and Ocampo-Marulanda et al. (2021).They used methodologies based on the NLPCA approach to monthly rainfall missing data estimation and extreme rainfall indices, registering low errors in their estimation.On the other hand, this research shows that methods based on artificial neural networks have a higher capacity to fill in missing data than linear inference methods, as shown in the studies made by Kim & Pachepsky, (2010) in Chesapeake Bay, United States; Teegavarapu (2012) in Kentucky, United States; Londhe et al. (2015) in Pune district, Maharashtra, India; Miró et al. (2017) in the Iberian Peninsula, Spain; Demir & Keskin (2021) in Turkey and Khalili et al. (2016) in Mashhad, Iran.Finally, it is highlighted that among the linear methods evaluated (PLSR and PCR), they show good performance and are easy to apply, being very useful when knowledge of artificial intelligence is unavailable.

CONCLUSIONS
The following conclusions are reached in the study: The ANN is the best method for the monthly rainfall missing data estimation in the AR, PR, and AMR of Southwestern Colombia due to this nonlinear method shows a high capacity to identify atypical patterns and reconstruct time series without the need to implement auxiliary variables such as altitude, geographical position, among others.This result is relevant for the study area, characterized by recording complex topographical conditions due to the Andes Mountains, the influence of the ENSO climate variability mode that occurs in the Pacific Ocean, the proximity to the Amazon Forest, and other conditions that influence the variability of the rainfall.
Among the linear methods evaluated, PLSR was the method that registered the best performance.However, the results were similar to those obtained by the PCR method.For both methods, the best results were obtained in the AR and AMR, contrary to the PR, which showed the highest errors and low correlation rates.Furthermore, it is highlighted that for the PLSR, a smaller number of CPs were used, which explained a significant percentage of the variance of the data.This aspect represents an advantage over PCR.
Using the NR method, a good performance was obtained in the time series reconstruction, very similar to that obtained in some gauge stations by the PCR and PLSR methods, showing synchrony between the observed and estimated data in extreme periods of low and high rainfall linked to ENSO Phenomenon.This is a positive aspect because it is a conventional method, easy to apply and does not require specific software knowledge; however, it cannot be used in areas without neighbouring stations.
Finally, we found that in areas with a lower density of gauge stations, such as PR (12-gauge stations) and AMR (1 gauge station), the performance of the ANN method was better compared to AR (33 stations), where a lower performance was observed.According to the results obtained in the metrics used, contrary to the NR, PCR, and PLSR methods, which presented better results in AR.Notably, the ANN method recognized the different rainfall patterns in the evaluated time series that are associated with nonlinear behaviour, which improved the quality of the estimates.

Figure 1 .
Figure 1.Geographic location of the study area and spatial distribution of rain gauge stations.

Figure 2 .
Figure 2. Flowchart of the methodology used in the study.

Figure 3 .
Figure 3. Flowchart of the NLPCA Inverse.n x is the input layer (Monthly rainfall dataset), m y is the bottleneck layer of the NLPCA model, and nx is the output layer (Monthly rainfall dataset reconstructed).

(
2021b), who studied the rainfall variability in the department of Nariño associated with ENSO and other climatic anomalies in the period 1983-2016 and identified in their studies the PR and AR, as two homogeneous regions with different rainfall regimes.In contrast, Arango et al. (2012) and Guzmán et al. (2014) identified three regions: PR, AR and AMR in the study area, with different interannual patterns consistent with those identified in this study.

Figure 4 .
Figure 4. Average annual rainfall in Nariño (1983-2019) and rainfall regime for the three identified natural regions a) Pacific Region, b) Andean Region, and c) Amazon Region.

Figure 5 .
Figure 5.Comparison of time series between observed and estimated rainfall for the methods evaluated at the MAT gauge station (PR).The blue line indicates observed rainfall, and the red line indicates the estimated rainfall.
in some regions of Sri Lanka; Pizarro et al. (2009) in the Maulé region of Chile; Cruz-Roa & Barrios (2018) in the Coello river basin, in the department of Tolima, Colombia; Morales-Martínez et al. (2019) in Tabasco, Mexico; Armanuos et al. (2020) in Ethiopia and Adilah & Hannani (2021) in the state of Pahang, Malaysia, in which the NR obtained a good fit in specific periods of the time series evaluated, and in some cases, throughout the study period.

Figure 6 .
Figure 6.Comparison of time series between observed and estimated rainfall for the methods evaluated at the GCA gauge station (AR).The blue line indicates observed rainfall, and the red line indicates the estimated rainfall.

Figure 7 .
Figure 7.Comparison of time series between observed and estimated rainfall for the methods evaluated at the MON station (AMR).

Figure 8 .
Figure 8. Performance metrics between observed and estimated rainfall using NR, PCR, PLSR, and ANN.The reference stations correspond to the neighbouring stations used for the NR method.

Figure 9 .
Figure 9. Pearson correlations between observed and estimated rainfall using NR, PCR, PLSR, and ANN.

Table 2 .
Analysis of descriptive statistics of the monthly rainfall in the gauge stations ofNariño (1983Nariño ( -2019)).