Spatial and Temporal Potential Groundwater Recharge : the Case of the Doce River Basin , Brazil

Little is known about the groundwater recharge potential of weathered tropical catchments, where increasing water uptake is widespread to meet various water demands. This study aimed to estimate the volume of groundwater recharge of the Doce River Basin, Minas Gerais, Brazil. The BALSEQ model was applied to calculate the water balance over a period of two years (2007-2009). Evapotranspiration, runoff, and potential groundwater recharge (PGR) were calculated, using daily data on rainfall, potential evapotranspiration, and plant-available water. A soil survey was undertaken for all major soils occurring in the basin. Soils samples were used to determine hydraulic conductivity, bulk density, and water content at field capacity and at the permanent wilting point. Vegetation data were obtained from the literature and used to determine the following parameters: canopy interception, crop coefficient, and root depth. The estimated groundwater was spatially predicted using the Random Forests model with digital elevation, vegetation index, pedological, lithological, and climate maps. During the two years, an average of 32 % of rainfall was converted to groundwater. Annually, the percent of rainwater converted to groundwater varied between 27 and 48 % for all soil classes, highlighting the great temporal variability. The spatial prediction showed a volume of approximately 17,484 and 35,410 m of rainfall being converted to groundwater for the first and second year, respectively. The BALSEQ model showed a feasibility for the water balance calculation and can be reapplied for updating the groundwater maps of the Doce River Basin. These maps could then be used to guide land use planning programs, with the aim to protect water resources.


INTRODUCTION
Groundwater recharge is determined by a combination of variables such as soil type and management, vegetation cover, topography, climatic, and geological conditions.In Brazil, groundwater is increasingly used to supply urban centers, industries, irrigation, agriculture, and recreation, which coexist with a lack of information regarding the total amount of water being taken from aquifers (Hirata et al., 2017).
A correct evaluation of aquifer recharge is fundamental for the calculation of water availability, especially in areas of high demand for water resources.In Brazil, groundwater has become more than a complementary water source for multiple uses.Aquifers in Brazil are used to supply water for up to 40 % of the population and play a strategic role in times of climate change with long and severe drought periods (Hirata and Conicelli, 2012;Hirata et al., 2017).Furthermore, aquifers are important groundwater reserves that should be preserved to provide water at affordable prices (Resende Filho et al., 2015).
Despite the importance of groundwater for multiples uses, knowledge of its status at a national scale is extremely limited.Regional studies on the status of the quantity and quality of groundwater are only available for specific sites in Brazil, and the data are outdated and scattered (Hirata and Conicelli, 2012).Such studies have only focused on São Paulo and the northeast of Brazil (Zoby and Matos, 2002).
Infiltration, together with rainfall, runoff, and evapotranspiration, constitutes the hydrological water cycle, which is a phenomenon of continuous circulation and distribution of water in the earth's soil, surface, atmosphere, and oceans.Infiltration represents the most important hydrological process for aquifer recharge and is determined by variables such as soil type and management, vegetation cover, topography of relief, as well as climatic and geological conditions (Sophocleous, 1992;de Vries and Simmers, 2002;Brandão et al., 2006).
Groundwater recharge is broadly defined as water that reaches an aquifer from any direction (down, up, or laterally).Both types of aquifer, (i) confined, where it is completely filled with water and separated from the surface by an overlying aquifer or an almost impermeable rock layer, and (ii) unconfined, where the water table is exposed to the earth's atmosphere through the unsaturated zone and can be recharged by the water that infiltrates and percolates the soil zone by a direct, indirect, or localized mechanism, depending on the local condition (Lerner, 1997).
Groundwater recharge can be quantified above and/or below the saturated zone.Models which operate above the saturated zone are considered predictive, giving estimates of potential recharge, while models which work below the saturated zone provide estimates of actual recharge (Rushton, 1988;Scanlon et al., 2002;Oliveira, 2004).Potential recharge, according to Rushton (1988), refers to the groundwater from the net balance of precipitation over evapotranspiration, which subsequently disappears through a local discharge system or by evaporation from the saturated zone, but may become 'permanent' recharge by lowering the water table after extraction.Models for actual recharge estimates rely on water table levels or tracer methods.Predictive models, on the other hand, consist of soil water budget methods, which use climate and soil data to estimate the amount of water that passes the root zone.Within each zone, techniques are generally classified into physical, tracer, or numerical modeling approaches for groundwater quantification.
Approaches to quantify the direct recharge of groundwater include the following: water balance estimation, hydrogeological models, chemical tracer methods, and baseflow separation, whereas indirect recharge methods are based on water table fluctuation and the Darcy law.Direct recharge is conceptually the easiest to define and forms the basis of numerous commonly used recharge estimation techniques (de Vries and Simmers, 2002).In addition, recharge can be calculated by direct measurements using the soil moisture budget via sensors (neutron or TDR probes) (Udawatta et al., 2011) and lysimeters.In contrast to sensors, which are relatively inexpensive, lysimeters are expensive and therefore not routinely used to calculate recharge (Scanlon et al., 2002).
For an overview of models, techniques, and challenges related to estimate groundwater recharge, the following studies provide detailed descriptions of the models and examples of applications, with case studies based on specific environmental conditions: Lerner et al. (1990), Sophocleous (1992), Lerner (1997), Simmers (1998), de Vries and Simmers (2002), Healy and Cook (2002), Scanlon et al. (2002), Oliveira (2004), among others.Despite the existence of numerous methods to calculate groundwater recharge, the hydrological conditions, quantity, and quality of data determine the selection of a desirable method (Ghandhari and Alavi Moghaddam, 2011;Kinzelbach et al., 2002).The water balance (water budget) method is the most common method applied throughout the different zones (Scanlon et al., 2002).
For calculating water recharge, we selected a numerical model which estimates the potential groundwater by means of the daily water balance calculation in the soil layer above the saturated zone.The model, BALSEQ, was developed by Lobo-Ferreira (1981) and used to estimate the recharge of unconfined aquifers in several areas in Europe (Lobo-Ferreira, 1981;Lobo-Ferreira and Delgado-Rodrigues, 1998;Paralta et al., 2003;Oliveira, 2004).The BALSEQ with modifications, developed by Oliveira ( 2004), was later used by Lobo-Ferreira et al. (2015) and Hugman et al. (2017).
Water balance calculations using the BALSEQ model were also applied in studies on the Bardez and Kakinada basins in India and China, respectively, by Chachadi et al. (2001Chachadi et al. ( , 2004)).In Brazil, the model was successfully applied by Brito et al. (2009) in the state of Ceará, in the limestone (Karst) area of Minas Gerais (Camargo et al., 2011), and by Pontes et al. (2016) at the watersheds of the Serra da Mantiqueira.
Like most unsaturated-zone techniques, the numerical models calculated by BALSEQ provide point estimates of recharge.Hence, recharge is usually integrated as input data in a spatial model to allow the regional quantification of groundwater.Different modeling approaches have been used for this purpose by using spatial maps of environmental variables with strong relationships with groundwater recharge.
Despite the success of Random Forests, due to its robustness and power of generalization demonstrated in groundwater mapping, its applications are still limited.Hence, we chose this model for the mapping of potential groundwater recharge in the unconfined aquifer of the Doce River Basin.
The Doce River is one of the largest rivers in southeast Brazil; its surface water and groundwater supply more than 3.3 million people, and the river is used for urban, mining, forestry, and industry purposes.Despite its importance, the water quality and quantity across the basin is critical due to widespread deforestation and inappropriate land use, along with soil degradation.Soil and forest degradation have led to soil erosion and, consequently, the silting of rivers and flooding, besides a direct impact on the increase in water price (Resende Filho et al., 2015).
Rev Bras Cienc Solo 2019;43:e0180010 The largest metallurgical complex of the Latin America is located in the Doce River Basin, resulting in water contamination related to mining activities, as sadly demonstrated by the environmental disaster that occurred on November 2015 (Neves et al., 2016).On this occasion, the collapse of a large dam at Fundão, Mariana's municipality, affected all riverine cities along the states of Minas Gerais and Espírito Santo, with a population of 2 million people left without access to clean water.Groundwater supply was particularly important for urban supplies during such an emergency.
The main objective of this study was to calculate the daily water balance for a two-year period and to develop a spatial prediction of groundwater recharge of the unconfined aquifer of the Doce River Basin in Minas Gerais State, Brazil.

MATERIALS AND METHODS
The methodological approach used, including data acquisition, processing, and modelling of groundwater estimate, and spatial prediction, is presented in figure 1.In this study, we used the BALSEQ model to obtain daily sequential estimates of water balance for the entire Doce River Basin.In addition, we used the Random Forests model for the prediction of groundwater recharge, using environmental covariates as proxies.

Study area: Doce River Basin
The Doce River Basin is located in southeastern Brazil, with 86 % of its area (71,472 km 2 ) in the state of Minas Gerais and 14 % in Espírito Santo.For the present study, we only considered the Minas Gerais sector.Three types of climates are found in the area, according to the Köppen classification system: Cwb in the highlands of the Espinhaço and the Mantiqueira mountain ranges, Cwa in the dissected upper highlands of the Doce River, and Aw on the middle plateau and the lower depressions (Alvares et al., 2014).
Average annual rainfall on the depression and low plateau varies from 1,000-1,200 mm, combined with high temperatures, contributing to high evaporation rates (Cupolillo, 2015).The temperatures of the coldest month in the basin are below 18 °C, but this never occurs in areas below 250-300 m.The rainy season ranges from October to March and the dry period from April to September (Cupolillo, 2015).
The basin is the nuclear domain of a hilly landscape known as "Mares de Morros".The landforms are dominated by convex hills on deep weathered soils with pronounced dissection.Nowadays, forest areas correspond to 26 % of the basin, while pastures occur in 67 %, and eucalypt plantations in 3 %.The remaining 8 % of the area are represented by transitional Brazilian Savanna (Cerrados), rupestrian vegetation (Campos Rupestres), and agricultural areas (Probio, 2006).

Pedological
The soil cover of the Doce River Basin was mapped based on 123 representative profiles of the main soil classes occurring in the area (Figure 2a).For each soil profile, data of the A and B diagnostic horizons were gathered.The soil classes mapped and the number of profiles per class were as follows: Latossolos (55), Argissolos (41), Cambissolos (17), and Neossolos Litólicos (10).
Rev Bras Cienc Solo 2019;43:e0180010 For each soil profile, undisturbed samples were collected for A and B horizons to determine the water content at field capacity (FC) and at permanent wilting point (PWP) by using Richard's chamber method (Richards, 1949), applying a 10-kPa and a 1,500-kPa tension, respectively.Soil bulk density was determined by the volumetric method, and the hydraulic conductivity of saturated soil (K 0 ) was evaluated according to Teixeira et al. (2017).Soil water recharge was calculated for both A and B horizons, accounting for a total of 231 soil samples.

Meteorological
Data of daily rainfall for the period from 1975 to 2010 was obtained from regional stations operated by the Brazilian National Waters Agency (Agência Nacional de Águas -ANA).We acquired data from 88 stations located inside the basin and from 44 stations in the surrounding area (Figure 2b).
Evapotranspiration of reference (ET 0 ) was calculated via the FAO-Penman-Monteith method (Allen et al., 1998).For the ET 0 calculation, daily climatic data of minimum and maximum temperature, wind speed, average relative humidity, atmospheric pressure, and solar radiation of 31 meteorological stations were used.The data were obtained from nine stations monitored by a private company (CENIBRA -Celulose Nipo-Brasileira S/A) and from 22 stations operated by the National Institute of Meteorology (Instituto Nacional de Meteorologia -INMET), with eight stations within the basin and fourteen in the surroundings (Figure 2c).For the spatial prediction of these meteorological variables, the Thiessen Polygon was used, which is a commonly used methodology for computing the mean value for catchment meteorological observations.These variables were then used to calculate the potential evapotranspiration (ETp).The distribution of the meteorological station is presented in figure 2.
The selected period for groundwater modeling was based on the available data.In the study area, the majority of rainfall stations (132 used) have a time series available of more  Considering that less than 50 % of the stations have a 5-year period when the groundwater was modeled (2011), we chose to work on a 3-years period, using daily data without filling the gaps.
As we aimed to evaluate the spatial pattern of the average recharge in a large basin, we found it more consistent to select a time series with high data quality instead of the traditional approach of splitting the time series into calibration and validation periods.Furthermore, a poor spatial distribution of rainfall data is one of the main sources of uncertainty in hydrological modeling (Pluntke et al., 2014).

Vegetation
Land use and land cover were classified during the soil survey, recording the geographical position of each of the soil profiles by using a GPS positioning system.The vegetation parameters obtained were as follows: crop coefficient (K c ), plant root depth (depth), and the index of vegetation cover (Ic), which determined the percentage of rainfall intercepted by vegetation cover (Table 1).The calculation of effective soil depth up to which water is subject to evapotranspiration (rooting depth) was based on the literature (Allen et al., 1998;Almeida and Soares, 2003;Kobayashi, 2007), considering values for the culture mid-season stage.
The coefficient K c was used to estimate plant water consumption throughout the development stages as it allows to calculate the crop potential evapotranspiration (ETp) by multiplying K c by ET 0 .We used a K c value of the average stage of the culture (Allen et al., 1998), whereas the same values were used for forests as well as eucalyptus and pine forests.Similarly, the same values were used for pasture and open grassy vegetation on highlands (Table 1).
The Ic refers to the percentage of water retained in the aboveground vegetation (leaves, branches, and trunks) during a rainfall event, which evaporates without reaching the ground; values of I C were obtained from the literature (Almeida and Soares, 2003;Arcova et al., 2003;Moura et al., 2009).An Ic of 10 and 15 % of rainfall was considered for eucalyptus and forest vegetation, respectively.These percentages were deducted from the total daily rainfall in areas with forest cover.

Model of sequential water balance -BALSEQ
Considering the variables rainfall, actual evapotranspiration, variation of water storage, surface runoff, and recharging potential, the water balance was calculated using the BALSEQ model, as presented in the workflow in figure 3. The calculation of the balance can be expressed by the equation 1: Eq. 1 in which P is the rainfall precipitation in mm; ETR is the actual evapotranspiration (mm); DAi is the variation (final -initial) of the soil water storage (mm); SR is the surface runoff (mm); I is the infiltration (mm); and ε is the error of the water balance.
As the aim of BALSEQ is to estimate the Potential Groundwater Recharge (PGR), PGR is calculated as the difference between the soil water content (Hi) and the water stored in the soil (TWS) (Equation 2).To calculate the sequential water balance, it is necessary to know the values of rainfall (P) and ET 0 for each time interval of the balance, as well as the total water stored in the soil (TWS): (1) adapted from conifers (Allen et al., 1998), (2) adapted from grazing pasture (Allen et al., 1998), (3) Alencar et al. ( 2009). (a) Depth for coffee (Kobayashi, 2007). (b) Root depth for forest, eucalyptus, and shrub (Almeida and Soares, 2003).Ic = interception by canopy cover (Almeida and Soares, 2003;Moura et al., 2009;Arcova et al., 2003).-= no value.
Rev Bras Cienc Solo 2019;43:e0180010 TWS = FC × Depth × Bd × 10 -1 Eq. 2 in which TWS is the water in mm; FC is the soil water content at field capacity (g g -1 ); Bd is the soil bulk density (Mg m -3 ), Depth is the soil depth subjected to evapotranspiration in centimeters (usually the effective root depth), and the factor 10 -1 is used to obtain the water depth in millimeters (mm).Rev Bras Cienc Solo 2019;43:e0180010 The calculation of TWS in the model proposed by Lobo-Ferreira (1981) considers the soil water content at field capacity (FC) minus the water content at the permanent wilting point (PWP).In contrast, we only considered the water content at FC to obtain a more precise recharge.As the water at PWP is strongly adsorbed on soil particles, and unavailable for most plants, any increase would not effectively contribute to groundwater recharge.
Runoff was determined by the Curve Number (CN) method, which has been developed by the US Department of Soil Conservation (SCS, 1993) and uses data of K 0 , soil water at FC and the soil classification according to the soil class and vegetation cover.
The calculation of the water balance on the BALSEQ model run continuously for the two-year period (2007)(2008)(2009), which means the components SR, I, and ETR in the balance accounted for 100 % of the total rainfall at the end of the two-year period.However, we also calculated the annual balance, separately for each year, by subtracting the variables of each year from the balance of the total period.
The bottomland areas of Gleissolos (Gleysols) and Neossolos Flúvicos (Fluvisols) and the water bodies do not directly contribute to aquifer recharge, and thus, they were not considered in the groundwater recharge prediction, although they have been considered for runoff calculations.
The period of time used to calculate the water balance was defined based on continuous data availability, setting a maximum limit of 15 % of data missing for each station used in the model.The water balance was calculated for two hydrological years (2007-2008 and 2008-2009).The available data to calculate evapotranspiration was the main constrain for the two-year period under study.
Since it is a daily water balance model, days without data were excluded from the model, and the balance continued on the following day.Thus, the number of days within the two-year period of calculation differed between stations, which can result in non-systematic errors, especially in stations with missing data during the rainy season.
The water content stored in the soil on the first day of the balance (Ai) to the start of the groundwater recharge calculations was determined interactively, in which the soil water content determined by the model at the end of the two hydrological years was considered as the initial water content at the first day of the balance.This procedure was repeated until the end of the balance, in which the sum of SR, ETR, and I was equal to the total precipitation in this period.
The BALSEQ model was developed for seasonal regions, which require daily balancing due to the water deficit caused by the difference between rainfall and potential evapotranspiration, on a daily basis.This aspect does not limit its application in a tropical humid environment.As pointed out by Lobo-Ferreira and Delgado-Rodrigues (1998), the model is sensitive to the parameters Curve Number and Total Water Storage Capacity (TWS).Although it was designed for a different region, it has been successfully applied to tropical conditions in Brazil in a large number of hydrological studies.The parameter soil water capacity (TWS) was adopted using field data of soil and standard laboratory analyses.Plant evapotranspiration cannot be directly measured and is therefore best estimated by the information available.
The BALSEQ model does not allow any assumptions related to the local climate.The model uses local climatological data as input to calculate evapotranspiration, runoff, and infiltration.The volume of infiltrated water that is potentially converted to groundwater recharge, when it exceeds the water storage capacity at a given soil depth, depends upon the local characteristics.This depth is measured according to the effective depth of the vegetation root system.
Rev Bras Cienc Solo 2019;43:e0180010 Therefore, the model is expected to give a reliable water balance according to the assumptions adopted.The main limitation of the model is that it gives punctual estimates, and to overcome this limitation, we used spatial maps as covariates of groundwater and applied the Random Forests model for the spatial prediction of groundwater recharge.

Model assessment
A three-step assessment was performed as follows: 1) sensitivity analysis; 2) uncertainty analysis; and 3) model validation.

Sensitivity analysis
Sensitivity analysis plays an important role in the calibration of hydrological models because it enables the evaluation of which parameters should have their values changed in the calibration.In addition, it provides information on the magnitude and direction (increase or decrease) of this change.One-at-time (OAT) sensitivity analysis presents the sensitivity of a variable (in this case PGR) to changes in a given parameter when all other parameters are kept constant.We used the sensitivity index (SI) (Equation 3) proposed by McCuen and Snyder (1986): in which SI is the sensitivity index; O and I stand for the output and input parameter value, respectively, associated to the minimum (I 1 , O 1 ), maximum (I 2 , O 2 ), and average value (Ī, Ō) of a given parameter.
We tested the following parameters: field capacity (FC), bulk density (Bd), crop coefficient (Kc), root depth, and curve number (CN).The model sensitivity to one parameter depends on the value of other parameters, as we did not have the "correct" values of the parameters.The initial values used for the analyses were defined according to the standard deviation and the average value of each parameter tested as follows: FC = 0.25 (g g -1 ); Bd = 1.1 (Mg m -3 ); Kc = 1; Depth = 0.50 m; and CN = 50.

Uncertainty analysis
All models are prone to uncertainties that can be structural, related to the algorithm, to its assumptions and to the parameters used.Uncertainties are also due to input data and caused by all types of data errors, the spatial distribution of measurement data, and the temporal resolution of measured data.The rainfall data is the main source of uncertainty in hydrological models (Fischer et al., 2014).
Traditionally, uncertainty analysis in hydrological models has been made using simulations in which the model performance is assessed while changing parameter values.These techniques have high computational costs, and only the uncertainty of one parameter is assessed.In the case of BALSEQ, the advantage of the model is its simplicity, which allows a discussion of uncertainties in a comprehensive way to evaluate not only the parameters, but also the model structure and the input data used in the groundwater modeling.

Model validation
To The land uses in these two areas are conventional coffee plantations (42° 31' 45.4" W and 20° 41' 53.9" S) and coffee plantations with trees (Agroforestry system) (42° 32' 34.5" W and 20° 39' 52.7" S); both areas are on Latossolos, with a total rainfall of 1,250 mm.The water balance for the 10-month period on the two locations was calculated using the BALSEQ model, applying local data (soil, vegetation, rainfall, and evapotranspiration) and the groundwater recharge compared with the soil moisture data of sensors installed at a depth of 1 m.

Spatial prediction of the groundwater using the Random Forests model
Groundwater estimated with the BALSEQ was used to predict the water recharge for locations not sampled in the basin, using the Random Forests model with maps of variables correlated with groundwater.Random Forests is a tree-based model developed by Breiman (2001), whose working principle is based on the construction of many trees, resulting in a "model of forests" without any pruning during the growth of the trees.
The setting of the model parameters [number of trees in the forest (ntree), number of predictors sampled for splitting at each node (mtry), and the number of divisions at the end of each tree (nodesize)] depends on the size of the dataset.Breiman (2002) explains how to define the parameters for obtain optimal estimates.In the case of a regression model, as in this study, basically a "nodesize" equal to 1 is usually appropriated and should not be very large to avoid a high computing demand.The "mtry" should be tested considering values in a range of the squared root of the number of predictors and a number twice larger or twice smaller, whereas the "ntree" does not affect the model accuracy; a number between 1,000 and 5,000 trees generally provides good results.
Each tree in the model is constructed from a selection of samples from the total dataset, which is done by the bootstrap method, while another subset is kept for assessing model accuracy.This selection, based on the data split, allows for a more robust error estimate, using the second set of samples called Out-Of-Bag (OOB).The OOB data set is excluded from the training dataset.The mean square error (MSE-OOB) is calculated from the aggregation of the predictions of all OOBs of the trees (Liaw and Wiener, 2018).
The MSE of the OOB is also used to evaluate the importance of covariates in the prediction.
For each tree in the Random Forests model, the prediction error of the OOB portion of the data is recorded (MSE).Subsequently, another run is performed after permuting each predictor variable.The differences between the two are then averaged over all trees and normalized by the standard deviation of the differences.
The subdivision of the trees at each node is made from the best set of subdivisions, randomly selected from the predictors of the total dataset, which prevents over-fitting.
The advantages of the method, according to Breiman (2002), are as follows: (a) higher prediction performance; (b) no overfitting; (c) low correlation of individual trees, as tree diversity increases through of the use of a limited number of predictors; (d) low deviation due to the mean of a large number of trees; and (e) robust error estimates using the OOB dataset.

Covariates used as a proxy of groundwater recharge
Random Forests was run in R software (R Development Core Team, 2016), using the Random Forests Package (Liaw and Wiener, 2018).All maps used to predict water recharge of the Doce River Basin were resampled for a 1,000-m spatial resolution.The choice of the spatial resolution was based on the scale and resolution of the maps.The covariates were selected as candidates for spatial models of recharge, based on factors driving the spatial variation of groundwater drivers identified in the literature (Lerner, 1990;Sophocleous, 1992;de Vries and Simmers, 2002;Scanlon et al., 2002), including rainfall, geology and soil, land cover, topography and landform, and groundwater condition.
To consider model parsimony, we used the RF importance rank of the OOB estimates to select the best predictor among the following terrain attributes: elevation above sea level, multiresolution index of the ridge top flatness (MrRTF), multi-resolution index of valley bottom flatness (MRVBF), slope, standardized elevation, topographic wetness index, and valley depth.The maps used as proxies for water recharge in the Random Forests model were as follows: • Annual rainfall and temperature, spatial resolution of 1,000 m (Hijmans et al., 2005).These two variables stand for the climate factor of the model; • Actual evapotranspiration (mm), with a spatial resolution of 920 m (Trabucco and Zomer, 2010); • Elevation above sea level, derived from a digital elevation model (DEM) generated with SRTM images with a spatial resolution of 90 m (Jarvis et al., 2008), representing the relief; • Annual solar radiation, generated on ArcGis using the DEM of SRTM images; • Lithological map representing the parent material, scale 1:1,000,000 (CPRM, 2004); • Pedological map, scale 1:500,000 (FEAM/UFV/SEMA, 2011); • Vegetation index EVI (Enhanced Vegetation Index), used as a proxy of vegetation (MODIS, 2009).

Assessment of Random Forests performance in groundwater prediction
To assess the performance of Random Forests in terms of predicting groundwater recharge, we partitioned the dataset in model training (75 %) and model validation (25 %).The model fit was assessed by means of the accuracy metrics, Root Mean Squared Error -RMSE (Equation 4) and R-squared (R 2 ), calculated using the OOB estimates of model training.The R 2 of Random Forests is a measure of model-fit comparable to the R 2 values of other models.This "pseudo R 2 " is labeled the "percentage of variance explained by the model".In addition to the model training, RMSE and R 2 were calculated for an external validation dataset to assess the power.
The RMSE indicates the magnitudes of the errors in the predictions by the global standardized correction included in the index calculation, as follows: in which: x i represents the recharge estimated by BALSEQ; xi the recharge predicted by the Random Forest model; and n is the number of locations used for the estimates.

Soil data
The statistical distribution of soil properties used as parameters to calculate the water balance for the BALSEQ model is shown in table 2. The results correspond to the expected behavior of the soil class surveyed in the basin, where Latossolos, due to a greater development of the granular structure, show the highest hydraulic conductivity, followed Rev Bras Cienc Solo 2019;43:e0180010 by Neossolos Litólicos, and Cambissolos.Argissolos, especially in the B horizon, show a lower infiltration capacity, resulting from the textural gradient formed by clay illuviation in the subsurface.The good consistency of soil parameters with reference to soil types is the high correlation between soil properties, such as bulk density, water content, and hydraulic conductivity.
Bulk density varied between 0.87 and 1.87 Mg m -3 .Neossolos Litólicos are young, shallow soils with primary minerals and higher values of bulk density.Consistently, on average, the Latossolos showed the lowest values (1.15 Mg m -3 ).
The Curve Number (CN) showed large variations for Neossolos Litólicos and Latossolos (SD = 17), with a minimum of 25 and a maximum of 79 and 80, respectively.

Sensitivity analysis
All parameters tested showed a negative response to the OAT sensitivity analyses (Figure 4).The parameters that presented the highest sensitivity were Kc (SI = -1.83),CN (SI = -0.89),and the variables used in the TWS (SI = -0.75).
The parameters FC, Bd, and Depth had an equivalent effect on the calculation of groundwater using the BALSEQ, as these variables are used in the calculation of water available for evapotranspiration (TWS).Hence, the sensitivity index values for these three variables are represented in the same graph (SI = -0.748).The TWS represents the water available for evapotranspiration.The higher the TWS, the higher the ETR and, consequently, the smaller the water available for recharge.The same can be observed for the crop coefficient (Kc), which varies according to the type of vegetation cover, directly influencing evapotranspiration.Therefore, the higher the value of Kc, and TWS and its components, the lower the PGR, which explains the negative signal of the SI.
The variable CN also presented a negative SI value, mainly because an increase in CN causes a direct increase in surface water flow, with lower infiltration and PGR.It should be noted, however, that for a negative variation in CN, the change in PGR is small and shows a rapid decrease for variations above zero.Other parameters in the model showed a linear behavior.
The parameter Kc showed the highest sensitivity, followed by CN and the combined parameters Bd, Depth, and FC.Thus, these parameters should be used for model calibration.Similar results were obtained by Pontes et al. (2015Pontes et al. ( , 2016)), when testing the BALSEQ model in small catchments in the Serra da Mantiqueira.

Conceptual
The BALSEQ water budget equation was developed for situations in which there is no artificial recharge, no surface runoff affluent from adjacent areas, the water table is always below the soil depth subject to evapotranspiration, there are no impediment subsurface layers, and there are no preferential paths of water in the soil.

Each of these assumptions refers to possible uncertainties not computed by the model.
There is no evidence to indicate the occurrence of artificial recharge or surface inflow affluent to the study area.Therefore, it can be discarded as a potential source of uncertainty.
On the other hand, the occurrence of saturated zones above the root depth, that is, places where the root system has access to the free aquifer, is an important source of uncertainty in the estimation of actual evapotranspiration, especially for forest vegetation.In this type of vegetation cover, the root system can reach saturated zones of the soil and absorb water to meet the evapotranspirative demand, which would cause an underestimation of the actual evapotranspiration.However, in this case, it would not directly affect the estimate of the potential recharge, since the saturated zone already constitutes water recharge.
Possible impermeable subsurface layers and preferential pathways are also important sources of uncertainties in PGR estimation.Given the high spatial variability of soil properties, the calculation of the effects of preferential pathways and dense layers is highly complex when included in hydrological models.Therefore, they are incorporated as uncertainties even in the most complex and robust models (Fatichi et al., 2016).One of the advantages of BALSEQ is the simplicity of its structure, which allows it to be adapted to different situations.
The preferential pathways of water flow are not traced and could be considered a limitation of the model.However, the criterion of classifying the recharge (e.g., direct, indirect, localized) is difficult to adhere to (Lerner, 1997), since the flow of infiltration through fissures, fractures, root channels, and galleries can potentially lead to direct recharge.In addition, there are other mechanisms, such as the variability of water percolation in soils with varying soil structure and pedoclimate, which can account for different recharge potentials and are not considered in the model.
Regarding the model parameters, related to soil data, it is worth highlighting the extensive work required for soil data surveys to characterize the physical soil properties (soil classification, bulk density, field capacity, and hydraulic conductivity), which contributes to minimize the uncertainties regarding soil parameters used in the model.Furthermore, FC and Bd present a small sensitivity index (-0.36)and therefore only slightly impact BALSEQ PGR estimations.
Rev Bras Cienc Solo 2019;43:e0180010 The CN is a critical variable for groundwater modeling, especially at values higher than 50, representing most values for the soils in the basin.Although CN has been developed for USA soils, it has been applied in hydrological models for tropical soils, particularly for soils in Minas Gerais (Silva et al., 2018).
The crop coefficient (Kc) is used to represent the evapotranspiration of each crop and varies according to the phenological stage of the crop.However, a fixed and constant Kc value was used for each crop due to the great complexity involved in taking this variation into account.This simplification introduces uncertainties in the estimation of evapotranspiration and, consequently, of PGR.On the other hand, the Kc values for the vegetation cover found in the basin are widely available in the literature, which reduces the degree of uncertainty regarding evapotranspiration.
Another difficulty in adopting depth values for trees is that the BALSEQ model does not work under conditions where the water table or even the capillary effect is in contact with roots, influencing evapotranspiration.Thus, it was decided to adopt an average fixed value for eucalyptus (Depth = 2.5 m) and for sub-deciduous seasonal forests (Depth = 3.5 m), which were lower than those suggested by other authors (Almeida and Soares, 2003).

Input data uncertainty
Regarding the modelling period (3 years), our choice was to exclude data of gauge stations with more than 15 % of missing data rather than filling the missing values with estimates based on the neighboring stations.Because the daily water balance uses a data series with missing data for some days (missing data were excluded, and the balance was continued on the next day), the number of days differed from season to season, which can result in non-systematic errors, especially with missing data from rainy seasons.To overcome this issue, we suggest that, as new data is collected, the groundwater model should be rerun for further estimates.
The accuracy of estimates is closely related to climatic and soil datasets used, particularly the climatic variables.Regarding the climatic variables, the number of rainfall gauge stations has a direct impact on the temporal and spatial accuracy of estimates.Rainfall is a key variable in the model since we use a water balance model in which the main input for the water budget is the amount of rain.This means that a large variation in input data will lead to increasing variability.In the Doce River Basin, the number of rainfall gauge stations was about three times that of meteorological stations.
Evapotranspiration is a variable typically estimated due to difficulties in finding measured data.The estimates use meteorological data and are hampered by missing data, which could not be estimated due to the short time series and the low number of neighbor stations.As already commented, the constraints for the time-scale used in this study were limited by the data of the meteorological stations.
Once more data is collected, a more accurate estimate will be possible, and new stations will lead to a decrease in spatial variation.

Model validation
For this comparison, it was considered that soil moisture at a depth of 1 m exceeded the water content at field capacity (FC) at that same depth, representing groundwater (Figure 5).
Qualitatively, it can be stated that the estimates of BALSEQ responded correctly to the recharge events, since the events were identified simultaneously by both methods and in both areas, agroforestry and conventional agriculture (Figure 6).The volume of recharge could not be quantified using the sensor method, since a volume of control should be Rev Bras Cienc Solo 2019;43:e0180010 adopted, which would dictate the volume of recharge infiltrated.Therefore, the analysis of validation was only of a qualitative nature.
Another observation that corroborates with the results of BALSEQ, especially for forested areas, is that in the agroforestry system, there was no recharge event between April and May, contrary to the conventional agricultural area.One possible explanation for this would be that after the rainy period (December to February), the trees in the agroforestry system increased the demand for water up to a depth of 1.00 m, leading to decreased soil moisture.In April, at the beginning of the rainy season, the subsoil under the agroforestry system should be drier, not exceeding field capacity.In addition, the rapid tree growth, after the first rains, increased the demand for water for evapotranspiration, with rapid water consumption at subsequent rainfall events.

Estimates of water balance using the BALSEQ model
The calculation of the water balance over the two-year period, using the BALSEQ model, was continuously run from the first day of the hydrological year of 2007 to the end of the hydrological year of 2009 for each point of the 123 soils.This means that the water balance considering runoff, evapotranspiration, and the potential recharge summed up 100 % of the accumulated rainfall at the end of the two-year period.The annual balance indicated the annual pattern of groundwater recharge, whereas the total period allowed to observe the long-term trend.
The statistical summary of the water balance variables (rainfall, evapotranspiration, runoff, and groundwater recharge) is presented in table 3. Overall, considering the annual and biannual time length, the highest percentage of rainfall was evapotranspirated (53-73 %), while surface runoff accounted for only 4-11 %.Groundwater recharge varied from 32-35 % of the total precipitation, while annual rainfall varied from a minimum of around 500 mm to a maximum of 1,662 and 1,998 mm.
During the two-year period, the average recharge was 757 mm, which accounts for 32 % of the rainfall.Recharge was considerably smaller in the first year compared to the second year.On average, 25 and 35 % of the rainfall were converted to recharge in the first and second year, respectively.The factors that contributed to the lower percentage of water percolating to the deeper soil layers in the first year were the lesser volume of water rainfall and the higher volume of evapotranspirated water.The volume of groundwater recharge observed for the Doce River Basin was similar to that observed in other studies in nearby areas.For example, Paiva (2006) estimated the recharge of the Piranga River watershed, a tributary of the Doce River, and found a potential recharge of 33 % of rainfall, using the water balance method.Also, using the water table method, a study on a watershed in the municipality of Araguari, Minas Gerais, showed a recharge of 38 % of the mean rainfall (Velasquez et al., 2007).
In a study on a karst area, using the BALSEQ model, Camargo et al. (2011), observed a higher volume of recharge (43 % of the annual rainfall) than that determined in the present study, which is most likely a result of the high rock porosity in karst areas, facilitating the percolation of water to the deeper soil layers.Rev Bras Cienc Solo 2019;43:e0180010 calculated with the BALSEQ to estimates obtained with the base runoff method was considered satisfactory.
Overall, among soil classes an average of 27-48 % of the rainfall contributed to groundwater, with the lowest rate found for Argissolos and the largest for Neossolos.Latossolos occurs in 66 % of the basin and therefore feature areas of importance to maintain water recharge, with low levels of surface runoff and a moderate rate of recharge.On the other hand, Argissolos showed the lowest potential for recharge and the second largest for the generation of surface runoff; this soil type is therefore more susceptible to erosion and losses of soil and water (Figure 7).
Neossolos Litólicos, although presenting the highest potential for recharge, with up to 48 % of rainfall turning into groundwater, only covered 3 % of the basin (~5,000 km 2 ); nevertheless, such areas are of high importance for the preservation and maintenance of groundwater levels.
Cambissolos showed the highest percentage of runoff (13 %); this was expected, as Cambissolos typically occur on steep reliefs, contributing to increased runoff rates (Figure 7).

Spatial prediction of groundwater recharge
The potential groundwater was spatially predicted using environmental maps considering the interactions of climate, lithology, morphology, soil type, and vegetation, based on the Random Forest model.The parameters of the Random Forest model which gave the best results were as follows: number of trees (ntree = 1,000), number of predictors sampled for splitting at each node (mtry = 2), and the number of divisions at the end of each tree (nodesize = 5).These parameters were defined after simulation using values of the intervals indicated by Breiman (2002).
The importance of each predictor variable used in the model, measured by its contribution to decreasing the MSE of prediction, is presented in figure 9.
Rainfall appears as the main driver of recharge, followed by evapotranspiration, lithology, and elevation.Rainfall contributed to 16-17 % of MSE decrease.Solar radiation, temperature, and the vegetation index EVI contributed with similar importance, while the soil map showed the lowest contribution (maximum 7 %), which can be related to the level of generalization of the map (Figure 8).The soil map was simplified based on the first component of the map unit, which corroborates with its lower importance on spatial prediction of the recharge.
Previous studies on mapping PGR conducted by Naghibi et al. (2017) and Rahmati et al. (2016) showed elevation as the main predictor, and in both studies, the Random Forest model performed better than the other models evaluated.In the study of Naghibi et al. (2017), for an area of 1,524 km 2 in Iran, the authors observed that morphometric maps, such as those showing elevation and slope, were highly important for the Random Forest model compared to lithology and land use maps.This study compared Random Forest performance with Kernel-based support vector machine methods.In our study, we obtained an R 2 of 0.22 and 0.27 for predicting groundwater recharge for the first and second year, respectively, and of 0.27 for the two-year period (Table 3).
The accuracy of prediction based on the external validation dataset showed a higher  Rev Bras Cienc Solo 2019;43:e0180010 accuracy for the two-year period, with an R 2 of 0.34 and a lower RMSE.For the annual recharge, the accuracy showed little change (Table 4).
Considering the scale and resolution of maps used for the spatial prediction of recharge, as well as the natural limitation of soil data and climatic variables, we consider the accuracy of the prediction to be satisfactory.In addition, we highlight the fact that the regional scale of the Doce River Basin, which comprises a large diversity of environmental conditions, contributes to the high prediction variability.
A visual analysis of the spatial distribution of the groundwater recharge across the basin showed a similar annual recharge pattern over the studied period (Figure 9).The maps for the two-year period and the maps of each year separately depicted higher recharge rates on the western border of the basin (Espinhaço Range), where rainfall and elevation levels are higher.Also, the headwaters of the Doce River at the south are also areas of high potential recharge on the plateau.
Groundwater recharge rates are directly related with rainfall distribution.Areas of higher recharge (>500 mm year) are concentrated on the western side of the basin, where rainfall above 1,500 mm (Figures 2b and 2c).Temperature, on the other hand, is generally lower in areas with a maximum groundwater recharge.Maximum monthly temperature varies from 21-30 °C.The forest cover is also more preserved on the western side of the basin, as well as at the headwaters of the basin to the south (Figure 2d).The combination of higher elevation and rainfall with lower temperature and a denser vegetation is therefore favorable for soil water percolation and groundwater recharge.The identification (stratification) of zones of the basin into areas of different PGR levels can be a valuable tool for land planning.
Using the quantile interval, the groundwater maps were classified into four different groundwater potential zones, including low, medium, high, and very high groundwater  potential zones (Figure 9).In this regard, the use of the quantile method for stratifying the PGR into classes has effectively been applied by Nampak et al. (2014), Naghibi and Pourghasemi (2015), and Razandi et al. (2015).

CONCLUSIONS
The estimation of the water balance for the Doce River Basin, using the BALSEQ model, is feasible with a combination of soil, vegetation, precipitation, and evapotranspiration data.Hence, the amount of water that can potentially percolate and reach the water table can effectively be quantified and modelled.
An innovative aspect of this study is spatial prediction of groundwater recharge using the statistical Random Forest model, which differs from the commonly used methods based on the weighted average method, using conventional soil maps.
The spatial prediction of groundwater recharge using the Random Forests model showed a moderate performance, which is compatible with the scale of the maps used as predictors.
Both models, the BALSEQ model to estimate the recharge and the Random Forest model for the spatial prediction of the recharge, showed feasibility of updating the groundwater map when new datasets were available.
The maps of groundwater recharge of unconfined aquifers could be used to guide regional land and water use planning to help protecting the groundwater resources in the Doce River Basin.In future studies, data on infiltration and groundwater percolation, obtained at the field level, should be considered for the validation of estimates, ensuring a greater reliability of the groundwater recharge at the basin scale.

Figure 2 .
Figure 2. Location of the Doce River Basin in the state of Minas Gerais (MG), Brazil, and the thematic maps: (a) soil types and location of the soil profile used, (b) rainfall gauge stations, (c) meteorological gauge stations, (d) land coverage, and (e) elevation above sea level.

Figure 4 .
Figure 4. Variation in groundwater recharge due to the BALSEQ model parameters (a) field capacity, soil bulk density, and effective depth of the root system; (b) crop coefficient; and (c) curve number.

Figure 5 .
Figure5.Soil moisture and water content at field capacity (FC) at a depth of 1 m in the agroforestry system.The area above the red line corresponds to the total volume of recharge (water per volume of soil).

Figure 6 .
Figure 6.Comparison between recharge calculated with the BALSEQ model and recharge measured with the sensor in a conventional coffee plantation (a) and an agroforestry system (b) (PGR = potential groundwater recharge; Rp = potential recharge with BALSEQ; R sensor = recharge with the TDR sensor).

Figure 7 .
Figure 7. Mean values of the water balance components for the different soil classes for the two-year period from 2007-2009; the values in percentage refer to the (%) of rainfall in the water balance components (Potential recharge, Runoff, Evapotranspiration).

Figure 8 .
Figure 8. Percentage of covariate contributions in the spatial prediction of groundwater recharge in the Doce River Basin.

Random Forest (Spatial prediction of groundwater in the Doce River Basin) Rainfall (daily)
Flowchart of the methodological groundwater estimation and spatial prediction in the Doce River Basin, Minas Gerais, Brazil.

Table 1 .
Vegetation types and number of profiles sampled in the basin, with respective parameters adopted for the vegetation in the water balance model Allen et al. (1998)values were obtained fromAllen et al. (1998), except: weather station (E-5000, Irriplus Equipment's) and soil moisture reflectance sensors (CS 616, Campbell Scientific)(Udawatta et al., 2011).
validate the groundwater recharge from the BALSEQ model in the typical soil conditions, vegetation, and climate in the Doce River Basin, two locations at the headwaters (municipality of Araponga) were monitored by collecting soil moisture data during a period of 10 months (09/01/2009 to 07/07/2010).The locations were monitored using Rev Bras Cienc Solo 2019;43:e0180010 a

Table 2 .
Statistical distribution of the soil properties used as parameters of water balance in the BALSEQ model Teixeira et al. (2017)d permanent wilting point (PWP) determined by Richard's chamber method; bulk density (Bd) (volumetric method), and hydraulic conductivity of saturated soil (Ko) according toTeixeira et al. (2017); classification of soil into hydrologic groups according to the Curve Number (CN) method(SCS, 1993).

Table 3 .
Statistical summary of the water balance for the two-year period(2007)(2008)(2009)and the hydrological years2007-2008 and  2008-2009in the Doce River Basin, obtained with the BALSEQ model for the point data estimates PGR = potential groundwater recharge; SR = superficial runoff; ETR = evapotranspiration, values in (%) are given in percentage of rainfall; SD = standard deviation; CV = coefficient of variation.Rev Bras Cienc Solo 2019;43:e0180010 Rahmati et al. (2016)mapped PGR classified in categorical classes (low, moderate, and high) for a catchment of 226 km 2 .Their results showed elevation as the most important predictor, contributing to 50 % of the decrease in accuracy.Similar to the present study, the lithological map showed a relatively low contribution.Rahmati et al.
(2016)compared the Random Forest model with a model of maximum entropy, which performed significantly better.

Table 4 .
Statistical distribution of the groundwater recharge predicted for the Doce River Basin for two hydrological years Min = minimum; Max = maximum; SD = standard deviation; CV = coefficient of variation; R 2 = R-squared; RMSE = root mean squared error.Rev Bras Cienc Solo 2019;43:e0180010