Application of multivariate methods and geoestatistics to model the relationship between CO 2 emissions and physicochemical variables in the Hidrosogamoso reservoir , Colombia

Aim: This article deals with the estimation of a model for CO2 emissions in the Hidrosogamoso reservoir based on the organic matter level and water quality. This is in order to determine the impact of the creation of a tropical reservoir on the generation of greenhouse gases (GHG), and to establish the water quality and emissions dynamics. We hypothesize that the spatial variability of emissions is determined by water quality and carbon cycling in water. Methods: Multivariate techniques were applied to determine the relationships between CO2 and certain physicochemical variables measured in the reservoir between February and May 2015, taking samples in 10 stations and measuring 14 variables (water quality parameters and CO2). Factor, cluster, discriminant and regression analysis, as well as the geostatistical technique kriging, were used. Results: We observed that all variables except dissolved organic carbon have strong linear relationships. Nitrate, total-P, total solids and total suspended solids are related due to the presence of nutrients in the water; chlorophyll a and biodegradable dissolved organic carbon due to organic carbon; and alkalinity and dissolved solids due to dissolved minerals. The sampling stations can be classified into two homogeneous groups. The first consists of the stations peripheral to the reservoir and the second of stations inside the reservoir. This difference is due mainly to the behavior of chlorophyll a and biodegradable dissolved organic carbon, and these two variables are also the best predictors for CO2, with a maximum adjustment of 70%. Conclusions: Our main conclusion is that the production of CO2 is due to decomposition of flooded organic carbon, depends on the soils flooded and the tributary water quality, and that the production of this gas will, based on the literature, continue for 5 to 10 years depending on the nature of the forest flooded.


Introduction
Hydroelectricity constitutes roughly 16% of the world's electrical generation according to the 2015 Hydropower Status Report (IHA, 2015) of the International Hydropower Association. In 2015, 33.7 GW of new capacity was installed, including 2.5 GW of pumped storage, bringing total hydropower capacity to 1,212 GW worldwide (IHA, 2016). This form of energy production has been considered a clean source of energy for a long time. However, it has recently been shown that some reservoirs emit greenhouse gases (GHG) at levels equivalent to those generated by fossil fuels for comparable quantities of energy (Fearnside, 2004), (Fearnside, 2016a).
estimate latent variables to evaluate their impact on CO 2 generation; 2) to determine spatial dynamics through analysis of state variables of emissions and water quality in the Hidrosogamoso reservoir system using grouping methods and spatial modeling for the identification of critical zones in the production of GHG; 3) to describe the relationship between CO 2 generation; water quality and organic carbon in order to establish the significant variables of GHG generation. The study aims to contribute to the management of water resources in Hidrosogamoso and to the determination of the impacts of future hydropower projects on regional and global carbon balance, especially in tropical areas where most of the projected new dams are located.

Study area
Hidrosogamoso ( Figure 1) is located in Santander, Colombia (7° 6'11.57"N, 73°24'37.09"W) in an area where the Sogamoso River crosses the Serranía de la Paz, 75 km from the mouth of the River Magdalena and 62 km from the confluence of the Suárez and Chicamocha rivers. The primary part of the Sogamoso River used in the generation of electrical energy features a 190 m high dam and a powerhouse of subterranean machines with the three biggest generators in Colombia. With 820 MW of installed capacity and an annual average generation of 5.056 GWh per year, it represents the fourth largest installed capacity in the country. Table 1 presents the main characteristics of the reservoir.

Physical Characteristics
The zone in which the reservoir is inserted is characterized by rains caused by movements of the intertropical confluence zone during the year. In the first half of the year, this zone displaces from south to north and produces an increase in precipitation for the months of April and May. In the second part of the year a movement of the zone from north to south generates rains in the months of October and November. Likewise, in the second semester the precipitation is higher because the movement of the intertropical confluence zone brings masses of air saturated with water from the Atlantic Ocean.
In the reservoir area, two centres of high rainfall can be observed: one over the Opón Swamp and another between the River Sogamoso and la ciénaga de Paredes. The high level of sunlight in the region is due to the small number of of geographic obstacles. This allows the land to receive a greater number of  hours of sunlight during the day, unlike regions with broken reliefs. This phenomenon, in combination with high temperatures of 27 °C and above, determines the high levels of evaporation that favour connective-type precipitation. The humidity average is greater than 80%, which provides a continuous availability of water on the ground throughout the year (GIDROT et al., 2011).

Hydrological Characteristics
The reservoir's major tributaries are the Chicamocha, Suárez, Sogamoso and Chucurí rivers and the Aguas Blancas stream. Table 2 shows their main characteristics.

Variables monitored
The reservoir monitoring begun before impoundment and continued after it. For this study we used samples taken in 2015, the first year after impoundment. The sample size was n=20, with 2 sampling campaigns in 10 sampling stations. We measured the following variables: total alkalinity, total phosphorus, dissolved solids, total suspended solids, total solids, dissolved organic carbon, biodegradable dissolved organic carbon, chlorophyll "a", biological oxygen demand after 5 days, chemical oxygen demand, ammonia nitrogen, total Kjeldahl nitrogen, nitrates and carbon dioxide. The samples were collected following the criteria established by the Standard Methods (APHA, 2012) and the sampling protocols of the laboratory of the "Pollution Control and Diagnosis Group-GDCON". After collecting the samples of CO 2 in the Tedlar stock exchanges, we transported them at environmental temperature to the laboratory (via land route to avoid abrupt changes of pressure), where the gases were later analysed using a 7890To gas chromatograph with a 5975C mass spectrometer and a conductivity detector. The parameters monitored, abbreviations used, units and reference methods are shown in Table 3.

Data processing, multivariate statistical and geostatistical methods
The variables DBO 5 , DQO, ammonia-N and NTK were discarded because their measurements showed values below their respective quantification limits of 5 mg DBO 5 .L -1 , 25 mg O 2 .L -1 , 5 mgNH 3 -N.L -1 and 5 mgN.L -1 . The Kolmogorov -Smirnov (KS) nonparametric test was performed to determine whether the data followed a normal distribution. According to the test, any variable from a normal distribution with a 95% confidence level cannot be rejected. The correlations between the variables (except CO 2 as the endogenous variable in the regression) were calculated using the Pearson correlation coefficient and organized in the correlations matrix. A correlation is considered strong (correlation is bounded between -1 and 1) if the coefficient is greater than 0.5 or lower than -0.5. The factor analysis multivariate statistical method was then used to reduce the dimensionality of the data (high correlations) and find underlying factors that generate variables (Thurstone, 1931). Heirachical cluster analysis was used to identify potential clusters in the sampling stations (Rencher & Christensen, 2003). Discriminant analysis was applied to confirm the clusters and to determine the variables that grouped these statistics (Johnson & Wichern, 2007). Multiple regression was used to determine a model that explained the existent relation between the variables analyzed and CO 2 at the post-filling stage of the Hidrosogamoso reservoir. Finally, spatial changes in the CO 2 and physicochemical variables were analysed with the geostatistical interpolation technique kriging. All mathematical and statistical calculations were performed using Microsoft Office Excel 2010 (SN: RJRGF-8H8PH-R4B8M-WBMBQ-9TDH9), Statgraphics Centurion XVI (SN: S520-6BOA-56S1-YKON-3EM1) and Stata 13 (1990516033). Geostatistical analysis and construction of the maps were done in ArcMap 10.2.2 (ID: 4401505451).

Factor Analysis
Factor analysis (FA) is intended to explain a set of variables observed by a small number of latent or unobserved variables called factors (Peña, 2002). In factor analysis the variables are represented as linear combinations of a small set of random variables (with m <<p) called factors. Where the factors are latent constructs that generate the variables. The factor model is: where µ is the vector of averages of X, L is the matrix that contains the loads or weights of the jth factor j f is the ith variable i X , F is the matrix that contains the factors and ε is the vector of errors that identifies the part of the variable that is divergent from the other variables.

Cluster analysis
Cluster Analysis (CA) aims to find an optimal grouping in which the observations or objects within each cluster (group) are similar but the clusters are different from each other (Rencher & Christensen, 2003). There are two approaches to cluster analysis: hierarchical and non-hierarchical. The differences between the two methods are detailed in Johnson (1998) and depend on the aim of the research. The hierarchical clustering approach is more common because the groupings are observable on a graph called a dendrogram, making it possible to visualize the formation of groups.
Cluster analysis usually involves at least three steps. The first is the measurement of some form of similarity or association between the entities (observations or objects), to determine how many groups really exist in the sample. The Euclidean or square Euclidean distance is usually used for this. The second step is the actual clustering process, whereby entities are partitioned into groups (clusters). Optimal results can be obtained using Ward's method. The final step is to profile the entities to determine their composition. This profiling is often achieved by applying discriminant analysis to the groups identified by the cluster technique (Hair et al., 2010).

Discriminant analysis
Discriminant analysis (DA) is a multivariate technique used to determine the variables responsible for the separation of the units within groups (Bierman et al., 2011). DA can resolve a series of questions including, but not limited to, the determination of statistical differences between two or more groups. This determines which independent variables contribute to the majority of the differences between the groups and makes it possible to establish procedures to classify units inside the groups (Hair et al., 2010). DA can be considered as regression analysis where the endogenous variable Y is categorical, taking values or categories for each group, and the exogenous variables are the continuous variables that determine to which group the units belong.

Regression analysis
The models used in regression analysis provide equations that describe the relationships of dependence between one or more endogenous variables (denoted as Y's) explained by a set of exogenous variables (denoted as X's). This results in the model shown below: where Y is the matrix containing the endogenous variables (explained), X is the matrix of exogenous variables (explanatory), β contains the weights or marginal contributions that have X's in Y's, and e is the error vector.

Geostatistical methods of interpolation: kriging
Geostatistical methods are widely used in exploration and reservoir modelling. They have the advantage of quantifying and predicting spatial variability in order to build models and make decisions (Chihi et al., 2013). Giraldo (2002) states that kriging is a method of interpolation from geostatistics that predicts values of a variable at unsampled locations based on the information collected. The interpolation process is performed as follows: Measurements of the variable of interest Z are made in points of the study area. In this case, realizations of the variables are obtained and it is necessary to predict in the when there are no measurements. In these circumstances, the ordinary kriging method proposes that the value of the variable can be predicted as a linear combination of the n random variables as follows: where i λ represents weights of the original values. These weights are calculated based on the distance between the sampled points and the point where the corresponding prediction will be made. The sum of the weights must be equal to one because the expected value of the prediction must be equal to the expected value of the variable. This is known as the requirement of unbiasedness.

Factor analysis
The correlations between physicochemical variables exogenous variables are shown in Table 4.
Here it is observed that all variables except DOC have strong linear relationships. For example, the strongest relationship is 0.93 for the total-P and TSS, the explanation for which is that more than 90% of the phosphorus in fresh water bodies (like reservoirs) exists in the form of organic phosphates and cellular constituents in the biota and is adsorbed to inorganic and dead particulate organic materials (Wetzel, 2001) as dissolved solids. The maximum inverse relationship occurs between BDOC and nitrates (-0.72). According to Wetzel (2001), a decrease in nitrogen occurs with increasing organic carbon content in lakes, because water bodies containing a high concentration of dissolved organic matter receive increasingly greater proportions of their organic content from organic plant material produced in wetland and littoral marsh areas surrounding the water. The inverse relationship between TSS and chlorophyll is explained because the high presence of suspended solids blocks light input, which is essential for photosynthesis, thereby interrupting the production of chlorophyll. The inverse relationship between total-P and chlorophyll is explained by the consumption of nutrients by photosynthetic organisms. Primary producers (chlorophyll producers) are potentially limited by carbon, nitrogen or phosphorus. The last of these, followed by nitrogen, is most likely to limit algal production in lakes, where it is less abundant in solution (Hairston & Fussmann, 2001). The dependence between water quality (alkalinity, phosphorus and nitrates) and organic carbon (dissolved solids, total suspended solids, total solids, biodegradable dissolved organic carbon and chlorophyll) can be explained by the anthropic interventions that occur in the area along the watershed and by the variation in precipitation and flow during the study period (Coletti et al., 2010).  Wu et al. (2014), report that wastewater discharges and the nutrients that reach the reservoirs are the main factors responsible for changes in water quality. These changes generate favorable conditions for GHG emissions not only in the reservoir, but in the rivers themselves (Demarty & Bastien, 2011;Rasera et al., 2008;Sawakuchi et al., 2014;Zhao et al., 2013).
Due to these high correlations, DOC is excluded in the factor analysis due to its low coefficient. Therefore, we can say that there are underlying factors or latent variables that generate data (Rencher & Christensen, 2003). Accordingly, factor analysis was applied in order to estimate these latent variables using maximum likelihood, due to the normality of the variables. Varimax rotation was used to make the variables orthogonal and so permit their use in subsequent analyses. The result is three factors with a cumulative variance percentage of 100% without loss of information, as shown in Table 5.
The factor loadings in each variable are shown in Table 6, where the most weighted factor determines the largest contribution of a factor to a variable, regardless of the sign. This is how Factor 1 generates Nitrates, total-P, ST and SST and, as explained in the correlations, the reason for this is that a large percentage of nutrient loads are often associated with suspended particles (Thornton et al., 1990). Factor 2 generates Chlorophyll and BDOC, as chlorophyll is a biodegradable organic compound which generates dissolved carbon in the water. Factor 3 generates alkalinity and SD, through minerals dissolved in the water.

Cluster analysis and discriminant analysis
In order to find similarities in the sampling stations, we applied cluster analysis using the squared Euclidean distance and Ward's method. The variables Factor 1, Factor 2, Factor 3, DOC and CO 2 were used. Figure 2 shows the resulting dendrogram, where it is clear that the stations are divided into two groups: Cluster 1, comprised by E1 (Esgamo), E10 (Chicamocha River), E6 (Chucurí River) and E9 (Suarez River); and Cluster 2, comprised by E7 (Aguas Blancas stream), E5 (Chucurí reservoir), E8 (Tablazo), E4 (Trigueros), E2 (Outlet) and E3 (Dam). These clearly show that each group can be defined by its spatial location,  as shown in Figure 1. Cluster 1 contains stations on the periphery of the reservoir, while Cluster 2 consists of those located inside it. Although the information reported covers a short period of time, it can be said that the spatial parameters and CO 2 behave differently in the water of the periphery compared to that within the reservoir. In Shrestha & Kazama (2007), it was found through hierarchical cluster analysis that the number of clusters was also determined by the spatial characteristics of the site of study. Similary, in Varol et al. (2012), each cluster represented the geographical location of sampling sites located at each reservoir dam. Another cluster analysis is presented in Belkhiri & Narany (2015), where two clusters defined by the changes in the physicochemical characterictics of the water were formed. This confirms the influence of spatial distribution on water quality. To monitor the water analysis we used DA for as a complement to CA, as in Bierman et al. (2011);Shrestha & Kazama (2007) and Varol et al. (2012). It was found that the cluster yielded a correct classification in 100% of cases, as shown in Table 7. The classification functions are shown in Table 8, where the determining variables in the standings are CO 2 and Factor 2 (composed of Chlorophyll and BDOC) which means that these three variables are the main determinants of spatial variability due to their high R 2 value and p significance below 0.05. Figure 2. The dendogram shows the result of clustering the n = 9 sampling stations in the Hidrosogamoso reservoir using Ward's method and squared Euclidian distance. El Esgamo, Río Chicamocha, Río Chucurí and Río Suárez stations forms a first cluster separate from the second one formed by Quebrada agua blanca, Chucurí embalse, Tablazo, Trigueros and Presa stations. The firts one located at the peripheria and the second one near the center of the reservoir. Table 7. Classification Table for the two groups from cluster analysis. Each row tabulates the results for cases that actually belong to a particular group and the columns show how often they were classified as belonging to each group. The percentage of cases correctly classified is 100%.

Actual Cluster
Group Size Predicted Cluster 1 2 1 4 4 (100.00%) 0 (0.00%) 2 5 0 (0.00%) 5 (100.00%) Percentage of cases correctly classified: 100.00%. The spatial variations of water quality and CO 2 emissions analyzed through cluster and discriminant analysis allow us to conclude that organic carbon is the main contributor to the spatial differences in emissions, due to human interventions to the ecosystem. In places where the chlorophyll and BDOC rise, CO 2 emissions also rise. This is shown in Figure 3, mainly in the center (around the dam in E3) where the main tributary is the Chucurí River. This zone had a "mosaic of crops, pastures, wooded pastures and natural spaces" before the impoundment, adding to which most of the urban settlements in the area discharge directly into the river, providing large amounts of organic matter that can reach the reservoir in large quantities.
The geographical location of the reservoirs has an impact on organic matter storage and water temperature, which influence GHG emissions (Yang et al., 2015). Carbon dioxide is also released from soil carbon in flooded land which, similarly to trees, is a source that will be exhausted in the future (Fearnside, 2015b). Research at the Petit Saut reservoir in French Guiana has found evidence that soil carbon is the main source for GHG produced in the initial emission pulse after the flood (Abril et al., 2005), whereas in temperate zones such as northern Canada it has been found that GHG emissions have no relation to the type of flooded soil (Duchemin et al., 1995).

Regression analysis
As reservoirs have an expected life time of over 100 years, it is essential to develop models for the generation of GHG, considering biogeochemical dynamics and spatial and temporal changes (Wang et al., 2018). In order to find an equation that describes the relationship between CO 2 and the physicochemical parameters, a regression model was adjusted where carbon dioxide is the endogenous variable and DOC, Factor 1, Factor 2 and Factor 3 are the exogenous variables. The resulting model, based on Equation (2) and ordinary least squares adjustment, is presented in Equation (4). Here, the only determinant for the endogenous variable was Factor 2, consisting of BDOC and Chlorophyll. The model fit was 70%.
Based on this we can conclude that BDOC and Chlorophyll are the determinants for the generation of CO 2 in the reservoir at the postfilling stage. This relationship is consistent with that indicated in Huttunen et al. (2002), Pérez & Restrepo (2008) and Wetzel & Likens (2013), concerning the relationship between photosynthetic activity and carbon dioxide.
The water quality of the tributaries and the area flooded with their soils, as well as the remaining vegetation, are very important for water quality. This is because they can serve as a source of nutrients for the reservoir for a long time (Gunkel, 2009). The availability of large . High levels of CO 2 and medium levels of chlorophyll and BDOC will be generated in areas close to the center of the reservoir (E4). Close to the dam (E1, E2, E3) will produce high values of BDOC and chlorophyll and low levels of CO 2 . In contrast, high levels of CO 2 and low levels of BDOC and Chlorophyll are predicted in the major tributary area (E5, E6, E7). amounts of dissolved organic carbon, i.e. BDOC and Chlorophyll, stimulates the generation of GHG and has been reported as being related to increases in primary productivity (Abril et al., 2005). Primary productivity has different effects with respect to the carbon balance as the reservoir ages. It may first arise from the dissolved CO 2 produced by decomposition of the MO, provided that the partial pressure of CO 2 in the water remains higher than the concentration of equilibrium with the air. In the long term, as the CO 2 production of the flooded CO decreases, the CO 2 fixed by the photosynthesis in the euphotic layer will be taken from the atmosphere, which will lead to a new flow of carbon towards the reservoir (Abril et al., 2005;Galy-Lacaux et al., 1999). Evidence has been found that in approximately 10 years these emissions decrease, as in the case of Petit Saut where emissions decreased from 200,000 tons CO 2 .year -1 in the first years post-flooding to 70,000 tons CO 2 .year -1 10 years later.

Spatial distribution of CO 2 and Factor 2
Using the statistical analysis results described in section 3.1 we proceeded to construct maps to account for the spatial variability of these measurements, separating BDOC and Chlorophyll for better interpretation. Figure 3 presents the interpolations based on ordinary kriging.
In areas around the center of the reservoir (E8) we can see that high levels of CO 2 and medium levels of Chlorophyll and BDOC are predicted. In areas close to the dam (E3) we can observe high values of BDOC and chlorophyll, and low levels of CO 2 . In contrast, high levels of CO 2 and low levels of BDOC and chlorophyll are predicted in the major tributary area. Those changes can be explained as follows: In a recently flooded reservoir such as Hidrosogamoso, fluxes of CO 2 have been reported to increase following impounding, owing to the decomposition of fresh plant biomass (Huttunen et al., 2002). Therefore, in new reservoirs, the production of CO 2 is mainly attributable to the decomposition of flooded organic matter and not to the phenomena of respiration and production by primary producers, the main managers of photosynthetic activity, which involve BDOC and chlorophyll. On the other hand, the high values predicted in the major tributary area are explained by the input of industrial, livestock and domestic wastewaters (Yang et al., 2015), into the rivers that feed the reservoir. Gaseous emissions are at their maximum level during the first 2 to 3 years after impounding and then slowly decrease with time (Kelly et al., 1997;Galy-Lacaux et al., 1999;Rosa et al., 2003;St. Louis et al., 2000). This shows that there is initially rapid loss of the more labile fraction of the flooded terrestrial OM, followed by a progressive decrease in quantity and bioavailability of the OM remaining at the bottom of the reservoir (Abril et al., 2005).

Conclusion
In this case study a range of statistical techniques, both multivariate and geostatistical, were applied to evaluate and model the spatial variability and relationships between various physico-chemical variables and the production of carbon dioxide in the Hidrosogamoso reservoir during the first year after impoundment.
We observed that all variables, except DOC, have strong linear relationships. Nitrates, total-P, TS and TSS are related due to the presence of nutrients in the water; chlorophyll and BDOC due to organic carbon in chlorophyll; and alkalinity and DS by dissolved minerals. The sampling stations can be classified into two homogeneous groups. The first consists of the stations peripheral to the reservoir and the second of stations within the reservoir. Based on this information, future decisions supported by cost studies could be made about reducing sampling stations. We model the relationship between carbon dioxide and measured variables, finding that chlorophyll and BDOC are the best predictors for CO 2, with a maximum adjustment of 70% for the period studied. Chlorophyll and BDOC are responsible for spatial variations of CO 2 . The spatial predictions for these three variables reported that high levels of CO 2 and medium levels of chlorophyll and BDOC will be generated in areas close to the center of the reservoir. Meanwhile, the reservoir will produce high values of BDOC and chlorophyll and low levels of CO 2 in the area close to the dam. In contrast, high levels of CO 2 and low levels of BDOC and Chlorophyll are predicted in the major tributary area. Finally, our main conclusion is that the production of the CO 2 is due to the decomposition of flooded organic carbon, the production of which will continue for 5 to 10 years depending of the nature of the forest flooded, according to the literature.
The relationship between water quality and GHG generation, added to the spatial changes, allows us to conclude that there are critical zones for the generation of GHG. These are mainly in places located near the center of the reservoir, due to the nature of the flooded organic matter being degraded, and to the high quantity of organic matter resulting from the water quality of the main tributary, which transforms into GHG. This allows us to accept the hypothesis that spatial variability of emissions is caused by water quality and carbon cycling in water.
Based on the above, the present study illustrates the usefulness of statistical and spatial analysis for the analysis and interpretation of phenomena presented in reservoirs, such as the generation of greenhouse gases.