Assessing two precipitation data sources at basins of special interest to hydropower production in Brazil

Accurate estimates of precipitation amounts are necessary to evaluate river flows, assess water-related risks (floods and drought) and quantify water availability for a broad range of water uses, such as water supply, agriculture, navigation and energy production. Especially in the context of operations in the Brazilian electricity sector, where the electrical system is essentially hydrothermal and more than 65% of its production comes from hydroelectric generation, real-time observed precipitation plays a key role as a primary input for hydrological models and river flow forecasting. It is thus crucial to build knowledge on and quantify river basin precipitation and its uncertainties. In this paper, we evaluate two sources of real-time (or near real-time) precipitation data, the TRMM-MERGE dataset from the CPETC and the CPC dataset, distributed by NOAA. Our assessment is based on 41 river basins in South America and covers the period 1997-2017. We investigated differences for different time resolutions (daily, monthly and annual precipitation) and their impact on the simulation of streamflows. Substantial differences were found between the two data sources, which seem to be amplified in the second decade. A spatial trend was found towards higher TRMM-MERGE precipitation values than CPC values when moving from north and west in the study area. We also found evidence that differences in precipitation propagate to simulated flows, with large percent differences in precipitation resulting in even larger percent differences in streamflow.


INTRODUCTION
Knowledge of surface precipitation is particularly important in hydrological research and operations. Accurate estimates of precipitation amounts are necessary to estimate river basin runoff, assess water-related risks (floods and droughts) and evaluate water availability for a broad range of water uses (e.g., water supply, agriculture, hydropower, and environmental protection).
At the global scale, gridded precipitation products have emerged since the late 1990s (Huffman et al., 1997;Adler et al., 2003). These products usually provide monthly estimates of surface precipitation from merged analyses that blend precipitation estimates from satellite data and in-situ rain gauge observations. While they can be useful for global climate change impact studies, finer space and temporal resolutions are often needed for hydrological applications that involve daily decision-making at continental or national scales, such as flood forecasting or hydropower operations (Alfieri et al., 2013;Fan et al., 2016;Emerton et al., 2016;Siqueira et al., 2018).
In Brazil, hydropower generation is responsible for 65.2% of the electric production (Empresa de Pesquisa Energética, 2015). The inflows to hydroelectric plants have a considerable influence on planning the operation of the electrical system, as well as on setting energy prices in the short-term market. Computer models that optimize the system's operation, solving the hydrothermal dispatch problem, run once a week, every Thursday, providing forecasts of inflows on a daily and weekly basis for the first five weeks and on a monthly basis for the next months (Operador Nacional do Sistema Elétrico, 2016).
Currently, the National Operator of the Electric System (ONS) uses historical natural flow records in statistical models for monthly forecasts and daily precipitation data from rain gages to run hydrological models for daily flow forecasts at dozens of hydropower plants distributed around the whole Brazilian territory. Daily precipitation data is provided by the national electricity generators. When real-time precipitation data show gaps or inconsistencies, corrections have to be applied to have a better fit between simulated and observed flows during the warm-up phase of the hydrological models, i.e., before a forecast is issued (Operador Nacional do Sistema Elétrico, 2016). This is an important step to achieve accurate inflow forecasts to the hydropower plants.
Uncertainty in daily precipitation real-time data may come from various sources, such as the low density of the gauging network at some regions, human errors when reading the data, measurement problems at the gauging stations, data communication failures, among others. Observed real-time precipitation will therefore always be an approximation of the actual precipitation falling inside the river basins.
However, accurate real-time precipitation estimates are not the only challenge in hydropower operation. Long-term records of unbiased daily gridded-based observed precipitation data are also crucial when running sub-seasonal to seasonal forecasting systems. This is the case when using the Ensemble Streamflow Prediction (ESP) method, a widely applied technique to generate ensembles of possible future scenarios of streamflow over several weeks and months ahead. The method is based on using a continuous hydrological model to estimate initial hydrological conditions (using real-time meteorological data as input) and future meteorological forecasts (based on historical sequences of meteorological data) to obtain streamflow predictions several months ahead (see recent applications in, for instance, Crochemore et al., 2016;Bennett et al., 2017;Arnal et al., 2018;Harrigan et al., 2018). Reliable and consistent long-term historic meteorological data are therefore also crucial when running seasonal forecasting systems. For the operation of the Brazilian hydropower system at seasonal lead times, a necessary preliminary step to setting up an ESP system is to ensure that a homogeneous long-term precipitation time series over the whole country is available. Beck et al. (2017) listed a group of 22 gridded rain datasets, but only a few of them have, simultaneously, a daily temporal resolution and a spatial resolution smaller than or equal to 0.5º covering the South America area. Among them, only two datasets were available in real time (or near-real-time), which is a necessary characteristic to use the data in forecasting systems. These datasets are the CPC Unified Gauge-Based Analysis of Global Daily Precipitation of the US National Center for Atmospheric Research (NCAR) and the NASA's IMERG, Integrated Multi-satellitE Retrievals for GPM (Global Precipitation Measurements) data, the successor of the Tropical Rainfall Measuring Mission (TRMM) data products. Sun et al. (2017) also made a wide review of 30 currently available global datasets, based on gauging stations, satellite estimates and reanalysis. They compared 22 datasets from daily to annual time scales and found hight discrepancies between them. The magnitude of the differences in annual precipitation estimates between two different data sources was found to be as high as 300 mm/yr. The discrepancy in precipitation amounts varied however from region to region and according to the time scale. According to the authors, important differences can limit the capacity of the products to be used for climate monitoring, attribution and model validation.
Another example is provided by Negrón Juárez et al. (2009), who analyzed six databases, some based on radar and others based only on gauging stations. They observed that the largest difference among the datasets over the Brazilian Amazon region was of 8% during the apex of the rainy season (December t0 March). Over the Brazilian northeast region, the maximum difference in the wet season rainfall total (February to April) was 30 mm, or 18%.
Many other authors have studied the uncertainties and the differences among different sources of observed precipitation data in different areas of the world, with some studies focusing on South America (e.g., Demaria et al., 2011;Scheel et al., 2011;Falck et al., 2015;Mantas et al., 2015). Overall, their main conclusion is that differences between different sources of observed precipitation data are common, and often vary in space and time.
The objective of this study is to evaluate two real-time (or near real-time) sources of gridded daily observed precipitation data available over the Brazilian and adjacent territory, namely the TRMM-MERGE (Rozante et al., 2010) and the CPC (Chen et al., 2008) datasets. The differences between these two datasets are investigated in space and time. We evaluated the datasets over 41 river basins of special interest to hydropower production. We also considered the evolution in time of the deviations and the deviations at different temporal resolutions (annual, monthly and daily time steps). The impact of using these different datasets in hydrological modelling is also presented for two case studies.

Study area
The study area covers 41 river basins that represent 31 main hydroelectric power plants in Brazil. These river basins vary in size, with drainage areas ranging from 9 300 km 2 to 38 2000 km 2 .
The study area extends from the north (Madeira River, Xingu River, Tapajos River, Tocantins River, and others) to the south of Brazil (Iguaçu River basin), and includes also river basins located at the central part of the country (Paraná River, Grande River, São Francisco River). Figure 1 shows the study area, with the main hydropower plants indicated in the map and the basin areas associated with the hydropower plants delimited in red.

Observed precipitation datasets
Two datasets are evaluated in this study: the TRMM-MERGE and the CPC datasets.
The TRMM partnership project between the US National Aeronautics and Space Administration (NASA) and the Japan Aerospace Exploration Agency (JAXA) started in November 1997, with the main goal of studying and monitoring precipitation in tropical regions (Kummerow et al., 2000). The TRMM satellite uses several instruments to detect rainfall, including radar, microwave imaging, and lightning sensors (Maggioni et al., 2016;Huffman et al., 2017). Although the TRMM products are considered valuable for numerical validation and simulations (Rozante et al., 2010), systematic errors have been detected, especially on the coast of the northeast of Brazil and in the south region of Brazil, close to the triple frontier between Brazil, Argentina, and Paraguay.
The TRMM-MERGE data was developed by CPTEC/INPE to reduce the interpolation problems observed in regions of lowdensity of rain gages, causing under-and over-estimations in the TRMM products. It combines gauging station datasets from the Global Telecommunications System (GTS), telemetric stations from various agencies and companies in South America and the real-time TRMM rainfall product (3B42RT), providing an improved quality gridded dataset with a spatial resolution of 0.25° for evaluation of models and operational uses (Rozante et al., 2010). Basically, the merging technique consists in identifying the TRMM grid boxes where the observations are present, discarding the two adjacent grid boxes to the observation point, and, finally, interpolating the TRMM precipitation and the ground observations using the Barnes objective method (Barnes, 1973, apud Rozante et al., 2010. The TRMM-MERGE daily precipitation data used in this study was obtained from the CPTEC FTP site (Centro de Previsão do Tempo e Estudos Climáticos & Instituto Nacional de Pesquisas Espaciais, 2018). The information is available in the grib2 format, 0.25º resolution, and the historic period covers 1997 to 2017.
The CPC data (Chen et al., 2008) is a product developed by the US NOAA's Climate Prediction Center. It comes from a project created to develop a group of automatic procedures to do quality control for the GTS daily precipitation products, comparing historical gauge records, concomitant observations at nearby stations, satellite estimates and numerical model forecasts. As a product of this project, NOAA/NCEP provides a daily observed precipitation gridded dataset with 0.5° spatial resolution since 1979.
The CPC data used in this study was obtained from the NCEP/NOAA FTP site (National Oceanic and Atmospheric Administration, 2018). The information is available in a grib2 format, and was retrieved for the historic period that covers 1979 to 2017.

METHODOLOGY
Our methodology consists of two main analyses. First, we evaluate basic statistics from the two different precipitation data sources for the average precipitation over each river basin. Secondly, we investigate the impact of the differences in precipitation on streamflow simulations on two selected river basins, one with a low and another with a high difference between the two precipitation data sources. For the flow analysis, we set up and calibrated the HEC-HMS model, as presented below.

Average precipitation for each River Basin
The first step to obtain the average precipitation is to delimit the contour of the basins and generate the shapefile with the basins that will be superposed on the gridded data files. In this study, we used the DELFT FEWS-CEMIG System (Pinto et al., 2013;Werner et al., 2013;Schwanenberg et al., 2015;Gibertoni et al., 2017) to obtain the average precipitation at each basin and time step. We configured the FEWS system to read the shapefiles and to use the geographic information provided to extract the average precipitation. The system uses a workflow of routines to perform the calculations. The methodology used to obtain the average precipitation is the "Average Area", which takes the mean of the data points inside the shape of each basin.

Basic statistics of precipitation differences
We analyzed the daily, monthly and annual precipitation totals of the two data sources, and their basic statistics. We also evaluated the differences in percentage between TRMM-MERGE and CPC precipitation data: where P1 is the precipitation from TRMM-MERGE and P2 is the precipitation from CPC at time t. A positive (negative) difference indicates higher (lower) value of precipitation for the TRMM-MERGE dataset.
In the annual analysis, we considered the hydrological year from October 1 st to September 30 th to calculate the annual totals. From the time series of the annual percent differences, we estimated the following basic statistics: the minimum, the first quartile, the mean, the third quartile, and the maximum (Naghettin & Pinto, 2007). For the monthly analysis, we used box-plot representations of the monthly percent differences to display the distribution of differences in monthly precipitation.
For the daily analysis, we estimated the Empirical Cumulative Distribution Function ECDF ( ( ) n F y ) (Naghettin & Pinto, 2007) of the daily precipitation values for each dataset. We considered only the amounts higher than 1 mm/day over the basin areas. For observations x = (x 1 , x 2 , ... x n ), n F is the fraction of observations less or equal to y.
where the indicator is ( i x y ≤ ) and n = number of data points. Finally, to visualize the geographic impact of the differences between the datasets and their evolution in time, we used maps to represent the percent differences of the annual average precipitation for two decades: 1998-2007 and 2008-2017.

Flow analysis with the HEC-HMS Model
Streamflow modeling for flow forecasting has to be performed using a continuous model for the simulations in order to evaluate the initial conditions at the onset of the forecasts. For this, it is necessary to choose a more sophisticated configuration of the HEC-HMS model (Feldman, 2000). Below, we describe the modules used and the method chosen for each module in order to build the hydrological modeling approach used in this study (Scharffenberg, 2016).
• Canopy method: Simple Canopy -the precipitation is intercepted until the canopy storage capacity of the surface is filled. All excess of precipitation falls to the surface. The potential evapotranspiration is used to empty the canopy storage; • Surface method: Simple Surface -the precipitation that arrives on the soil is captured until the storage capacity of the surface is filled, then the runoff starts with the excess of precipitation. The water in the surface infiltrates into the soil, according to the soil's infiltration capacity; • Loss method: Soil Moisture Accounting -SMA -this loss method uses three layers (soil storage, upper groundwater, and lower groundwater) to represent the dynamics of the water movement in the soil. For the given precipitation and evapotranspiration the model calculates surface runoff of the basin, groundwater flux, losses and the deep percolation over the whole basin. The method is capable to simulate wet and dry cycles and can be used for long periods of continuous simulation; • Transformation method: Clark Unit Hydrograph -this method is a synthetic unit hydrograph and the principal components are the time of concentration defining the travel time in the sub-basin and the storage coefficient used to account for storage effects on the linear reservoir; • Base-flow method: Linear Reservoir -uses a linear reservoir to model the recession of the base-flow after a precipitation event, conserving the mass. The lateral outflow of the groundwater is connected with the infiltration from the soil moisture accounting loss method.
The HEC-HMS configuration selected has 26 free parameters to be calibrated against observed flow data. In our study, observed streamflow data comes from ONS (National Operator of Electric System). The calibration was performed manually, by comparing simulations with observations and minimizing volume errors.
For the analysis of the impact of the different precipitation data sources on the simulations of streamflow, we calibrated two selected basins. They represent the extremes of precipitation differences observed during the annual precipitation analysis. This basins are: i) the UHE Campos Novos, located at the south region, at Canoas River (Uruguay river basin), where the differences between the precipitation data sources are smaller than 5% and there is not a strong seasonality in precipitation, and ii) the UHE Tapajos, located at the north region, at Tapajos river (Amazon basin), with annual differences between precipitation data sources higher than 40%, and with a strong precipitation seasonality. These basins are full blue colored in Figure 1.
In the flow analysis, we want to investigate what happens if we calibrate the hydrological model with one dataset and simulate it with another dataset. The hydrological model is first calibrated for the complete data period, from October 1997 to September 2017, with one climatic forcing and then run to simulate streamflow using the other climatic forcing, over the same period. Since we have two climatic forcing datasets, the calibration and simulation procedure is done twice.
In order to analyze the effects of the different precipitation datasets used as input to the hydrological models, we represented the ECDF of both TRMM-MERGE and CPC simulated time series. We also evaluated the simulated flows against observed flows using four numerical criteria as performance indicators: NSE, RMSE, KGE and R 2 .
NSE -The Nash-Sutcliffe efficiency (Nash & Sutcliffe, 1970) measures how good the results of the model are when compared with a simulation represented by the mean observed flow. Values equal to 1 indicate a perfect fit and values smaller than zero indicate that the mean is a better predictor than the model. RMSE -The Root Mean Square Error is a common measure of the accuracy of a model. It is calculated by taking the square root of the average of the sum of squared differences between observed and simulated values. It can be interpreted as the standard deviation of the model prediction error. A smaller value of RMSE indicates better model performance.
KGE -The Kling-Gupta efficiency (Gupta et al., 2009) is an alternative criterion to the NSE, and was proposed to assess the qualities of a model in terms of its ability to represent the water balance, flow variability and correlation. Values range between −∞ and 1, and as for the NSE, values close to 1 indicate a more accurate model. R 2 -The coefficient of determination is an indication of how well one variable correlates with the other. It is represented as a value between 0 and 1. The closer the value is to 1, the better the fit, or relationship, between the two variables. Table 1 shows the basic statistics of the percent differences for annual precipitation totals for each basin. From this table, we can see that differences in quantiles range between -31% and +42% in mm of total annual rainfall. High differences are more frequently observed towards positive differences. This means that when differences are high, it is more frequently due to higher values of rainfall given by the TRMM-MERGE dataset.

Evaluation of annual precipitation totals
From Table 1, we also note that the magnitude of the differences in annual precipitation varies according to the basin. Some basins tend to exhibit a similar behavior in terms of basic statistics of the percent differences between annual precipitation totals. We detected eight groups of similar behavior (Basin Groups I to VIII, hereafter, BGI to BGVIII). These groups are indicated in Table 1 Figure 2 shows how the differences in annual precipitation totals between data sources evolve along the years. We selected one basin representative of each BG. We can see that the basins in the north region (BGI and BGII) display more often high positive differences, with values tending to increase with time, mainly after 2010. These BGs are generally affected by strong variations of one of the sources: either TRMM-MERGE presents very high annual precipitation totals all over the period (BGI) or CPC presents very low annual totals for a more recent period (BGII). These results illustrate how big the uncertainties in precipitation can be in this region, where the density of gauges is low. For the other regions, where the gauge density is higher, the variation of the differences is smaller and tends to be more linked to specific time periods. BGIII, BGIV and BGVI (northeast, central-west and southeast regions), for instance, display high values more often for the negative differences, with these occurring either at the initial years or at the final years of the study period. In the south region, BGVII and BGVIII present alternated years of positive and negative differences along the study period. Figure 3 shows the statistical distribution (boxplot) of the monthly differences. The line in red represents the monthly average precipitation of the TRMM-MERGE data. It provides a reference to compare the totals to the magnitude of the deviations. With this information, it is possible to visualize the differences along the months and the seasons in each of the eight groups of basins with similar behavior. We can see that monthly precipitation differences are higher for the basins of the north

Evaluation of monthly and daily precipitation
region (BGI and BGII). For these groups, the TRMM-MERGE data source presents more precipitation than the CPC data, in practically all months and especially during the rainy months (November to March). For the other regions, the differences vary around zero, with the higher variations occurring during the wet months. TRMM-MERGE dataset can display either wetter or dryer months than CPC, depending on the region. Figure 4 shows the ECDF curves of the daily precipitation values greater than 1mm/day from the two data sources. The line in red represents the TRMM-MERGE data and the blue line, the CPC data. Each graph shows one basin representative of the basin groups defined in Table 1. We can see that the cumulative distribution functions of daily precipitation are very similar. Differences can only be seen in the basins in the north region (BGI and BG II) and in the extreme south region (BG VIII), where the TRMM-MERGE dataset presents higher values of precipitation for almost all probabilities. This analysis illustrates the tendency of the basins in the more central regions to present more similarity between the data sources than the basins located in the extreme north and south regions. Table 2 shows an example of the differences we can expect in precipitation quantiles (mm/day) for the probability    of non-exceedance of P=0.9. The quantiles were extracted from the ECDF curves for all basins. The percent differences confirm, in numbers, the behavior showed in the ECDF plots (Figure 4), i.e., a tendency of the extreme regions (north and south) to have higher differences (TRMM-MERGE daily precipitation greater than CPC precipitation) and the center regions to have similar values for the same probability.

Variation of annual precipitation differences in space and time
We investigated if the differences between precipitation data from TRMM-MERGE and CPC vary when considering two time periods: Figure 5 shows a map with the average percent differences of annual precipitation for the period 1998-2007 and Figure 6 shows the same but for the period 2008-2017. The shadows of green represent positive differences and the shadows of red represent negative differences. Table 3 shows the specific values for the average difference for each basin for each decade.
The differences between the TRMM-MERGE and the CPC data sources present a clear spatial and temporal behavior. For the first period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007), some basins in the north region (BGI and BGII) and in the extreme south region (BGVIII) display the most important positive average values of percent differences, while basins in the south-east region (BGIII and BGIV) display the highest negative differences. Positive average percent differences become higher and spread over the north and central regions in the last period (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017). Negative average percent differences do not spread over the area in the second decade. They are however higher at the basins of the BGVI.

Impact of different precipitation data on flow simulations
The results of the experiment with the HEC-HMS hydrological model calibrated for the UHE Campos Novos and the Tapajos basins are shown in Table 4. It shows model performance when the model is calibrated with the TRMM-MERGE precipitation    data and then used in simulation with the CPC precipitation data as forcing and vice versa. Both calibration and simulation runs are performed over the same period, 1997-2017. We can see that model performance is good in both basins, with NSE and KGE values ranging between 0.71 and 0.94 in calibration and between 0.45 and 0.79 in simulation. As expected, the performance of the model behaves according to the magnitude of the differences between the precipitation data sources: for the basin where the differences are small (Campos Novos), the performance in calibration is similar to the performance in simulation (differences in performance indicators are between -4% and -8% when calibrating with TRMM-MERGE, and simulating with CPC and between -1% and 3% when calibrating with CPC and simulating with TRMM-MERGE). For the basin with a high difference in precipitation datasets (Tapajos), the decrease in performance from calibration to simulation is clear, with stronger differences in performance indicators. The most important losses are in accuracy (RMSE). Figure 7 shows the ECDF of observed flows and of daily flows when calibrating the model with TRMM-MERGE and simulating with CPC. We can see that for the basin where the precipitation data of both sources are similar, the ECDF curves are also very similar. However, for the basin with a higher difference between the two sources, the ECDF curves show a clear difference.
For the same probability level, the flows simulated with the CPC data are lower than the observed flows and the flows simulated with the TRMM-MERGE data. The ECDF curves obtained when using the CPC data for calibration and the TRMM-MERGE for simulation present a similar behavior (not shown in this paper).
Finally, in Figure 8, we present a comparison between the percent differences of annual precipitation (TRMM-MERGE minus CPC) and the percent differences of annual streamflow simulations, when using TRMM-MERGE data in calibration and CPC data in simulation. The points in the first quadrant indicate the situations where the values of precipitation and simulated flow are smaller when using the CPC data. The third quadrant indicates the opposite, the situations where the CPC precipitation is more intense than the TRMM-MERGE and therefore also the streamflow simulations based on CPC data.
We can see that in the basin with small differences between the precipitation datasets (Campos Novos), the relation between the differences of precipitation and the differences of flow (orange lines) tends to be closer to the diagonal line, with a slope that is higher when the CPC precipitation is higher than the TRMM-MERGE precipitation. For the basin with the high difference in precipitation datasets (Tapajos), the slopes of the regression lines (blue lines) are higher than 1 for both quadrants, first and third. The angular coefficient of the line of the third quadrant is also bigger than the one of the line in the first quadrant. The graphs with the CPC data used in calibration and the TRMM-MERGE data used in simulation (not shown in this paper) present a similar behavior for the Campos Novos basin. For the Tapajos basin, where the precipitation of the simulation dataset is more intense then the precipitation in the CPC calibration dataset, the differences in flows are amplified.

DISCUSSION
The TRMM-MERGE precipitation dataset uses raingauge data to calibrate the system. They have quality control of the sensors and use the telemetry rain data to improve the interpolation of the grid data (Huffman et al., 2017). The CPC precipitation data uses satellite information to perform the quality control of the rain gauge data and to create a better and trustworthy grid of precipitation data (Chen et al., 2008). Despite the fact that both sources use information from satellite and rain gauges in different degrees and ways, they both try to represent the same variable and one could expect they would provide similar datasets.
In our comparative study, the first signal of the differences between TRMM-MERGE and CPC precipitation datasets appears in the results from the basic statistics of annual precipitation totals. We found high differences, up to 42%, for the minimum annual precipitation, as well as for the first quartile (up to 25%).
The other values (mean, third quartile and maximum) also show differences, but they are not as strong. This shows that, in the absence of a more accurate dataset of ground precipitation data, and considering that the observed precipitation in a certain basin is an estimation of the actual precipitation, uncertainties  13/16 are present in any dataset and, for our study, both datasets need to be considered.
Our study provides an extensive analysis of differences in precipitation data for a wide range of basins in South America, covering the continent from north (3ºN) to south (30°S) and a variety of climatic conditions. Our results show that the 41 basins studied can be grouped into eight groups of similar behavior. The results from the group I, the Madeira river basin, show that the precipitation values from the TRMM-MERGE are higher than the values obtained from the CPC dataset for all the study period , with differences that tend to increase in the last years. The analysis of monthly precipitation totals shows that the highest differences are observed during the wet season, which goes from November to April. During the wettest months, it is common to observe differences higher than 100 mm/month and some maximum values higher than 200 mm/month, which can be higher than the monthly average precipitation. The analysis of the statistical distribution of daily precipitation shows that the TRMM-MERGE daily precipitation quantiles tend to be higher than the CPC quantiles for the same probability of occurrence. This is observed for the majority of the probability quantiles.
The results from the group II, representing other basins in the north region, exhibit smaller differences between the precipitation datasets for the first decade of the data period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007), with alternation of positive and negative differences. However, in the last decade (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017), the differences become more often positive, showing that CPC precipitation tends to be lower than TRMM-MERGE precipitation. These differences in the last decade translate into high variations during the wet season, from January to March. It is common to observe differences higher than 100 mm/month. The maximum differences occur at the basins in the extreme north, with values near to 200 mm/month, which can be around 70% and 80% of the monthly averages. The ECDF curves of daily precipitation also exhibit the tendency of TRMM-MERGE daily precipitations higher than the CPC values for the same probability of occurrence. The magnitude of the differences depends on the basin.
These results for the north region are coherent with the results obtained by Negrón Juárez et al. (2009). The authors show positive differences when comparing TRMM-MERGE and CPC precipitation data. Their differences were smaller, but the values were only computed considering the first decade of our data period.
The analysis of the northeast and east basins, group III, shows annual precipitation differences smaller than in the north region, with differences that, on the majority of the basins, are lower than 10%. The values tend to be slightly negative on the first decade of the study period and become slightly positive along the second decade. For the major part of the basins, the smaller differences are observed during the last five years, which can be an indication that both data sources may be getting more similar along the years in terms of annual precipitation totals. The differences in monthly precipitation show that the majority of differences in this group of basins are between -25 mm/month and 25 mm/month. The highest differences occur again during the wet season, November-March, but they are often smaller than 20% of the monthly average precipitation. In terms of daily precipitation, the ECDF curves do not indicate a great difference between the two data sources.
The basins in the center-west and southeast regions, group IV, show a tendency to have negative values of the differences between datasets in the first decade, varying from small values near 5% until higher values near 25% for some basins. In the second decade of the study period, the differences become positive (i.e., TRMM-MERGE precipitation data are higher than CPC data), but the values are smaller than 10%. For the monthly precipitation differences, the highest differences occur in November, December, January and February, with highlights for strong negative values that sometimes are close to -100 mm/month. For the daily precipitation values, the probability distribution curves do not indicate a great difference between the two sources of precipitation data.
The three basins in group V, in the southeast region, display a different behavior from the others. The differences in annual precipitation are slightly positive in the first decade and tend to be negative during the second decade. The highest differences occur between the years 2008-2013, with values near -30%. During the last five years of the study period, the differences become small again, near 5%. The monthly differences are higher from November to February. In terms of daily values, there is a tendency for TRMM-MERGE daily precipitation to be lower than the CPC precipitation for the same probability of occurrence.
In the group VI, basins on the neighborhood of the South region, the differences in annual precipitation are very small during the first-decade and become positive along the second decade, with values higher than 10%. The highest differences occur during the wet season, varying, in the majority of basins, between -50 mm/month and 50 mm/month. The ECDF curves show that the daily precipitation from TRMM-MERGE tends to be higher than CPC precipitation for the same probability of occurrence.
The group VII, basins in the south region, displays small negative differences during the first decade, which then become positive along of the last decade, with some values higher than 10%. These basins do not have a clear seasonality and differences in monthly precipitations spread all over the year, with the majority of values between -50mm/month and 50mm/month. For daily precipitation values, the ECDF curves show that the TRMM-MERGE daily precipitations are higher than the CPC values for the same probability of occurrence.
Finally, at the extreme south region, the group VIII exhibits a change of behavior in terms of differences in annual precipitation, when compared with the other south basins. Differences are more often positive in the first decade and tend to reduce along the second decade. In terms of monthly values, this group has the same behavior as group VII, without a wet season with higher differences. For daily precipitation values, the statistical distributions show that the TRMM-MERGE daily precipitations are often higher than the CPC values for the same probability of occurrence.
Our analyses show clearly a regional pattern on the differences between the two precipitation data sources. As we move to the north and west regions of the study area, the annual differences tend to become more positive. This spatial variability in annual precipitation differences is amplified in the second and most recent decade of the study period. The basins located in the northeast, east, and southeast regions have smaller differences and these differences tend to become more positive in the second decade, although at a smaller degree. The exceptions are the basins of the group VI, which tend to display more negative differences in annual precipitation during the last decade.
We also evaluated the impact of the observed differences on the simulation of streamflows, using calibrated hydrological models in two basins (Campos Novos and Tapajos), representative of the lowest and highest differneces between precipitation data sources. The analysis showed how the models are sensitive to changes in precipitation, confirming the general findings in Fan (2015). If the two precipitation data sources used in calibration and simulation are similar, model performance is also similar. However, when they are very different, the performance indicators showed that the hydrological model tends to lose performance. The amount of loss in performance may vary according to the quality of the data source used in the calibration. In our study, in the Tapajos basin, where the differences between TRMM-MERGE and CPC data are high, the performance loss was stronger when calibration was performed with TRMM-MERGE and CPC was used in simulation, comparatively to the opposite situation. The empirical relationship between annual precipitation and flow values shows that the dispersion is higher when dealing with the CPC data. This indicates that CPC data has more uncertainty, which impacts the results when the CPC data is used to simulate flows in a model that was calibrated with another data source. More uncertainty in the CPC data in this basin can be explained by the low density of raingauge stations in this area. The use of satellite information may give more accuracy to the TRMM-MERGE dataset in this case.
Another result of our study is that the hydrological model seems to propagate and amplify the differences in precipitation data into differences in streamflow simulations. Small differences in precipitation result in similar small differences in streamflow. However, large differences in precipitation seem to result in even larger differences in streamflow.

CONCLUSIONS
This study aimed to evaluate the differences between precipitation data obtained from two sources over 41 river basins in South America, the TRMM-MERGE and the CPC datasets, for the period 1997-2017. We investigated differences for different time resolutions (daily, monthly and annual precipitation), at different locations and according to their impact on the simulation of streamflow.
The results show that differences vary in space and time, and according to the temporal aggregation of the precipitation values. The second decade tends to amplify the observed differences in the majority of the basins.
Some basins show considerable differences, notably in terms of daily and monthly precipitation values, with an expected impact on the simulation of daily streamflows, which are also affected by the uncertainty of each precipitation data source. In addition, a spatial behavior of the differences between the precipitation sources was detected, with differences becoming more positive (i.e., TRMM-MERGE values are higher than CPC values) as we move to north and west in the study area.
With the results of this study, we recommend being cautious when working with a unique source of historic precipitation data to calibrate hydrological models, since this source can display uncertainties and errors that vary in space and time. In our study, we showed that it is a complex problem to determine a precipitation data source that is the best for all situations, especially when no observed data set can be used as ground truth or reference, as in the case of large continental areas such as South America.
When it comes to maximize the performance of streamflow simulations, it becomes important to extract information from all data sources available. In this study, we illustrated how hydrological models can be sensitive to changes in the precipitation data, especially when these changes reflect high differences between different forcing data sources. The use of observed streamflow is an alternative to help selecting the best precipitation data source. The comparison between observed and simulated streamflows is an indirect way to carry out the precipitation data analysis, but it can, nevertheless, be useful in hydrological applications at large river basins.
Further research will focus on accessing the uncertainties and investigating how data sources such as TRMM-MERGE and CPC can be combined, with varying weights according to basin location and time of the year, to provide a more robust long time series of precipitation data for hydrological model calibration and simulation. The goal is to have time series of forcing data that minimize the errors between observed and simulated flows in the past, so that these time series can be used for seasonal forecasting in the hydropower sector within the traditional ESP (Ensemble Streamflow Prediction) method, where hydrological models and historical precipitation are used to generate a set, or an ensemble, of possible flow scenarios dependent on the initial states of a given basin in real-time.