Distribution of rainfall probability in the Tapajos River Basin , Amazonia , Brazil

Studies on the probability of rainfall and its spatiotemporal variations are important for the planning of water resources and optimization of the calendar of agricultural activities. This study identifies the occurrence of rain by first-order Markov Chain (MC) and by two states in the Tapajos River Basin (TRB), Amazon, Brazil. Cluster analysis (CA), based on the Ward method, was used to classify homogeneous regions and select samples for checking the probability of rainfall occurrence by season. The historical series of daily rainfall data of 80 stations were used for the period 1990-2014. The CA technique identified 8 homogeneous regions and their probability of occurrence of rainfall, helping to determine which regions and periods have greater need of irrigation. Results of the probability of occurrence of dry and rainy periods in the TRB were used to define the dry (May thru September) and rainy seasons (October thru April). Elements of the matrix transition probabilities showed variability in relation to time and, in addition, the influence of geographical position of seasonal rainfall in determining dry and rainy periods at specific sites in the TRB.


INTRODUCTION
Rainfall is a climatic factor of major environmental importance, especially in relation to floods and prolonged periods of drought (Marengo et al., 2010). Rainfall regime impacts environmental conditions and almost all productive activities of society, since it is the main water supply for human and economic activities (Almeida et al., 2011). An understanding of the potential occurrence of dry and rainy days can contribute decisively to the decision-making process regarding planting times, water deficit risk assessment and irrigation system design (Keller Filho et al., 2006;Marcelino et al., 2012;Carvalho et al., 2017). According to Vela et al. (2007), historical rainfall data are highly relevant for monitoring the impacts caused by their excess or prolonged shortage within a specific region. Despite the great importance of pluviometric data in the planning and design of engineering works, prolonged rainy weather can be a limiting factor for the use of agricultural equipment, affecting previously established schedules (Fernandes et al., 2002).
The Central-West region, for example, ranks first nationally in corn and soybean production, with the State of Mato Grosso as the main producer of both crops, producing approximately 64% of maize and 60% of soybeans, with more than 4,498,000 and 9,518,000 hectares planted, respectively (CONAB, 2019). According to Garcia et al. (2013), in a study carried out in the Sinop region (MT), the best growing season indicated for maize crop corresponds to the wet season during spring and summer, which favors high water availability to the soil due to rainfall and less probability of occurrence of consecutive dry days. The authors also affirm that in order to obtain a good income from the agricultural activity, it is important to know the conditions of the environment and the culture, from before planting until postharvest, as the rainfall and the temperature of the air, together with the photoperiod, are the main meteorological variables for productivity. Semenov (2008) and Martin et al. (2007) state that stochastic models applied in hydrology are often used to complement daily climatological data. In addition, the models assess the effect of climate change on daily precipitation (Yoo et al., 2016). Markov Chain (MC) models are often proposed to rapidly obtain weather forecasts (dry or rainy) and their transition throughout the year (Lennartsson et al., 2008;Breinl et al., 2013). In the present work, the MC method is used to model the occurrence of daily precipitation, as occurs often in the literature (Sharif et al., 2007, Damé et al., 2007, Selvaraj and Selvi, 2010, Sukla et al., 2016, Yoo et al., 2016. The emphasis on the application of the Markov chain derives from the use of the information from the previous day (dry or rainy) to provide a prognosis about the possible occurrence of a dry or rainy day for a given region (Carvalho et al., 2017). The daily precipitation model, based on MC, presents several advantages, such as the ease of parameter estimation and data generation when compared to other models also used in rainfall modeling. For example, the Poisson model Rev. Ambient. Água vol. 14 n. 3, e2284 -Taubaté 2019 has more complex structures and greater difficulties in parameter estimation (Yoo et al., 2016), while MC basically involves two components: the occurrence of binary precipitation (precipitation or absence of precipitation) and the quantity of precipitation on wet days (Breinl et al., 2013). The MC are also the most-indicated models for the study of rainy-and dry-day sequences, since other models for probabilities have difficulty in describing the daily persistence of the occurrence conditions (Sukla et al., 2016). Another factor is the rapid response made possible by the model, which may contribute to a greater economic return for rural producers, as it supports the identification of alternative sowing dates. This planning can be optimized if time distribution is associated with a spatial distribution of occurrences by characterizing regions with similar behaviors.
Thus, regions with hydrological similarities can be classified into groups. The literature reports that the multivariate cluster analysis (CA) method provides striking results to determine homogeneity (Yang et al., 2010;Cabrera et al., 2012;Lyra et al., 2014). This study sought to determine the distribution of rainfall probability using first-order Markov chains and two states (dry and rainy days), grouped by Cluster analysis to identify pluviometrically homogeneous regions in the Tapajos River Basin, Amazonia, Brazil. Other authors have used rainfall (daily, monthly or annual) and other information in their studies (Raziei et al., 2012;Gonçalves et al., 2016).

Study area and data
The Tapajos River Basin (TRB) drains an area of 493,200 km 2 and comprises 6% of Brazilian territory. It occupies areas in the states of Mato Grosso (MT), Pará (PA) and a small part of the Amazon (AM), and lies between 2° and 15° S and between 53° and 61° W. According to Kottek et al. (2006), the TRB presents two climatic typologies by Köppen-Geiger´s climatic classification. The predominant climate from the headwaters to the center of the basin is classified as "Aw", that is, rainfall in the summer, the characteristic climate of the savannahs. From the center to the mouth, climate is classified "Am", or rather, tropical monsoon climate, with a dry season and intense rains for the rest of the year. In the study by Santos et al. (2014) in the TRB, within the "Am" climate area, the months with the lowest rainfall are from May to October. In the "Aw" climate, the dry season runs from April to September, with July being the month with the lowest rainfall value in both climates, with mean values of 36 mm / year and 6 mm / year, respectively.
Daily rainfall data provided by 80 gauge stations used in this study were retrieved from the National Water Agency (ANA) database, via HIDROWEB (http://hidroweb.ana.gov.br), and from the National Institute of Meteorology (INMET), via BDMET (http://www.inmet.gov.br/projetos/rede/pesquisa/). Because they were daily data, they were not submitted to any statistical treatment to fill in the flaws. Bertoni and Tucci (2007) present in their studies several methodologies used to fill in flaws and also comment that none is indicated to fill in daily failures, being more recommended to fill monthly or annual failures. In selecting the periods of the historical rainfall series, the admissible fault limit of up to 1.8% in each data series was applied as a criterion. This limit of failures was adopted based on the study by Baú et al. (2013), where the stochastic model chosen (Markov Chains) admitted, with good reasonableness, its application, but does not rule out the possibility of negative, albeit small, interference in the accuracy of the results. Figure 1 shows the site of the study area and the distribution of rainfall gauge stations. Table 1 provides information on the same.

Determination of rainy or dry days
Rainy or dry days are determined by applying the Markovian stochastic process, a widely used technique in the literature (Detzel and Mine, 2011;Dash, 2012;Stowasser, 2012;Szyniszewska and Waylen, 2012;Baú et al., 2013). The condition of rainy or dry state is associated with a probability of occurrence. The stochastic process adopted in the study was used to model rainfall occurrences by first-order MC (the probability of the rainfall state on the current day "t" depends only on the rainfall state of the previous day, t-1) and by two states (dry or rainy). Combination hypotheses for the determination of the probabilities of transition between states are carried out by a matrix of transition (MT) (Equation 1). MT = | P 00 P 01 P 10 P 11 | (1) The transition probabilities between states follow Equations 2 and 3: In the case of defining rainfall states on day "t", current days are tagged as "Xt", with "0" for dry days and "1" for rainy days. First order MCs consider combinations of dry (0) and rainy (1) states as follows: P00 is the probability of not raining today because it did not rain yesterday; P01 is the probability of not raining today because it rained yesterday; P10 it is likely to rain today because it did not rain yesterday; P11 it is likely to rain today because it rained yesterday.
Calculation of probabilities is performed by counting the items in the historical records of the desired rainfall station, as described in Equations 4, 5, 6 and 7, where "N" represents the relation between the number of occurrences of combinations dry / rainy days of the historical series according to the rainfall station (j). Where: N00number of dry days with previous dry day; N01number of dry days with previous rainy day; N10 -number of rainy days with previous dry day; N11 -number of rainy days with previous rainy day.
The values of daily rainfall as indicative of dry periods, also called minimum values, present a considerable variation among the authors found in the literature, such as, 0.3 mm (Baú et al., 2013), 5.0 mm (Pizzato et al., 2012;Viana et al., 2002), 0.2 mm (Calgaro et al., 2009), Rev. Ambient. Água vol. 14 n. 3, e2284 -Taubaté 2019 0.1 mm (Dourado Neto et al., 2005;Keller Filho et al., 2006), 0.85 mm (Barron et al., 2003) and 1 mm (Jeong et al., 2013;Santos et al., 2009), in which days with rainfall below these limits were classified as dry. Mehrotra and Sharma (2009) fined as a rainy day the one whose measured value reaches the threshold of 0.3 mm. Andrade Junior et al. (2001) and Viana et al. (2002) defined the day as dry, based on the occurrence of water deficit, that is, dry days are considered, those in which rainfall is less than the reference evapotranspiration. In the study by Vasconcellos et al. (2003), the authors defined the day as dry when the water storage in the soil, according to the water balance, is equal to or less than a certain critical value, conditioned by the atmospheric demand. However, the above rates depend on the study's aim, activity and type of environmental management under development. So that a day may be considered rainy, the criterion in current study is that minimum rainfall rate recorded in one day should be equal to or greater than 0.1 mm; therefore, if it is less than 0.1 mm, the day will be considered dry. This is the criterion used by INMET. This rate is equivalent to the smaller amount recorded by the pluviograph in conventional meteorological stations (CMS).

Sensitivity analysis
Sensitivity analysis was performed to determine the minimum period required for a given historical series of daily precipitation in order to estimate the probability of occurrence. In order to generate these probabilities, we verified the length of the historical series and their start and end date. In many studies applied to hydrological modeling there is a limitation in the available series, as well as in relation to the small period of observations and the low density of the data collection network. To analyze the behavior of pluviometric probabilities over time, a historical series of 30 years was selected, considering a failure limit of up to 1.8%. The four probabilities of monthly occurrences were generated for periods of 1, 2, 3, ..., 29 and 30 years. Thus, the station selected was station code 455001 (ID: 09), located in the state of the PA, in group 1 (G1) (Figure 1 and Table 1). The selected period was from 07/01/1984 to 06/30/2014 for 30 years and for the remaining periods (1 to 29 years) the final date was kept constant (day, month and year), varying only the start date and the year at each interval. To evaluate the behavior of the probabilities, a simple linear regression analysis was performed between the probabilities of occurrence generated for the 30-year series (keeping it as an independent variable in the regression analysis) and the respective probability of occurrence of each period of 1 to 29 years (dependent variables). Thus, for P00, P01, P10 and P11, the coefficient of determination (R2), the adjusted determination coefficient (R2a), the standard error, and the mean absolute and relative errors were obtained, considering p-value less than 0.05 at a significance level of 95%, with acceptance of the null hypothesis, that there is a correlation between the variables.

Cluster Analysis
The cluster analysis (CA) is a Multivariate Statistics technique that has the purpose of grouping individual items (such as objects, places or samples) into several groups, according to a classification criterion so that there is homogeneity within a group (or variables) and heterogeneity among the other groups formed based on their characteristics (Lyra et al., 2014;Gonçalves et al., 2016). The CA technique was used to determine the homogeneous pluviometric groups in the TRB, using all the values of the probability of occurrence of rainfall resulting from CM. The grouping criterion used was the method proposed by Ward Jr. (1963), which is a method of hierarchical data grouping that forms groups in such a way as to always achieve the smallest internal error between the vectors that make up each group and the average group vector. According to Johnson and Wichern (2007), this method is also called "minimal variance", because in each step of convergence of the method, the two clusters that have the smallest distance between them are combined to form a single group. For each phase, Ward's method uses Equation 8, which regulates the operation of the same and its convergence.
Where, E(G1G2) is the mean rate for two clusters; ̄ is the cluster mean for each variable "v".
Euclidean distance, the geometric distance between two objects i and i' was employed to measure similarity between clusters. Let xij be the observation of i-th rainfall gauge station (i = 1, 2, ..., n), with reference to j-th variable in each class (j = 1, 2, ..., p) studied. Standardized Euclidean distance (DE) between two stations i and i' is calculated by Equation. 9.
Where, xij is the j-th characteristic of the i-th individual; xij is the j-th characteristic of the i-th individual. Figure 2 shows the operational scheme of the methodology applied in this study, guiding the sequence of each step.

Sensitivity analysis
For the sensitivity analysis, the selected historical series was the one that presented values for R 2 and R 2 a equal to or greater than 0.8 in relation to the historical series of 30 years. From 3 years of data, considering the limit of failures in the registers equal to or lower than 1.8% for each analyzed period, it was possible to note that, with the exception of the linear coefficient (a), the angular coefficients determination (R 2 ) and adjusted determination (R 2 a) presented the same values (Table 2). This is due to the equation of the transition probabilities (Equation 2), since the behavior of P00 and P10 is complementary, as well as the behaviors of P01 and P11. Therefore, these parameters tend to present the same slope of the regression line, although they cut the x-axis at different points.  It was observed that R 2 and R 2 a presented values ranging from 0.8 to 0.9 in the historical series of 3, 4 and 5 years. From 6 years, the values of the coefficients are all above 0.9; therefore, the results of the other years were not shown in Table 2. Based on the analysis of the absolute and relative errors in each month for each probability of occurrence, it was verified that the highest values of these are found in the transition months (May and November) between the dry-and rainy period of the region where the rainfall season occurs.
As the period of the historical series available for generating the probabilities increases, the errors obtained in the estimates decrease. In Figure 3 (a, b, c and d), one can see the standard error and the mean absolute and relative errors for P00, P01, P10 and P11, respectively. The level of significance was 95%, with a p-value of less than 0.05 in all analyses. The highest value was 0.0146 for the historical series of 1 year, indicating, therefore, strong statistical evidence of the relationship between the data. The range of inclination of the lines for 3 years of data, considering 95% confidence, presented the values 0.51 and 0.93 as lower and upper limit, respectively, in P00 and P10. For P01 and P11, the values were 0.86 and 1.26 as lower and upper limits, respectively. Since zero is not included in the confidence interval, it is possible to confirm the existence of a positive relationship between the analyzed data. In order to verify if this behavior could be applied in other stations, the analysis was performed this time for 8 distinct rain stations, with these codes: 758000 (ID: 27), 556000 (ID: 18), 555002 (ID: 23), 255001 (ID: 11), 857000 (ID: 29), 1454000 (ID: 80), 1358002 (ID: 71) and 1054000 (ID: 47). The starting date was the same as in Table 1, and from these the first 3 years of each historical series were separated, which were related to their maximum periods used in this study. In addition to the analyses performed at station code 455001 (ID: 09), the study also sought to verify possible interferences in the determination coefficients. The results appear in Figure 4. Analyzing the results of both tests, the length of the historical series presented a greater influence on the results of the probabilities of occurrences when compared with the obtained Ps (simulated precipitation) if only the start and end dates are changed. This can be explained by MT itself, in that the larger the amount of information available to generate the Ps, the closer to 1 the coefficient of determination is presented (Table 2 and Figure 4).
In a historical series of 3 years, the limit of 1.8% corresponds to approximately 19 days of failure, being distributed throughout the 12 months, so that they do not present consecutively or concentrated in a single month. Otherwise, the lower the total available period, the greater the interference in the estimate of Ps in the month in question. Another possible interference is the occurrence of ENSO phenomena, since these will be related to the frequency and distribution of dry-and rainy days throughout the year, since the occurrence of the previous day is the information used to calculate the Ns of each probability. The occurrence of large-scale natural events (eg volcanic eruptions and forest fires) and low-frequency atmospheric-ocean phenomena (El Niño and La Niña) are pointed out by Salas et al. (2012) as factors that affect the statistical balance of hydrological series. In the study by Baú et al. (2013) in the Paraná Hydrographic Basin III, the analysis of the probability of occurrence results showed that the daily precipitation behavior maintained a pattern of quantity and occurrence simultaneously with the appearance of El Niño and La Niña phenomena. These possible interferences justify the reduction in the values found for the analyzed coefficients (R 2 and R 2 a), where the lowest coefficient obtained corresponds to 0.73 for R 2 a of P01 in station code 556000 (ID: 18), which, although less than the others, can still be considered as a good correlation between the data. The behavior of the p-value remained above 0.05 in all analyses. Thus, from 3 years of data, the probabilities of occurrence tend to present behavior statistically similar to the probabilities of the larger historical series, thus allowing the use of data with historical series from 3 years. It is worth mentioning that in the study are historical series ranging from 3 to 25 years, so it is not necessary to exclude any rainfall in this analysis.

Probability of rainfall occurrence and Cluster Analysis
The four probabilities of occurrence (P00, P01, P10 and P11) were determined for each month of the historical series of the rainfall gauge stations of the TRB, calculated according to the number of dry and rainy days (Ns). Figures 5 and 6 show the boxplots of probabilities P10 and P01, respectively, for the 80 stations studied. For each Boxplot (or box diagram), the vertical bar indicates the minimum and maximum value of the sample, the values being discrepant or outliers (if any) represented by circumferences. The horizontal lines of the gray box represent from bottom to top, respectively, the 1st quartile, the 2nd quartile or the median and the 3rd quartile. It may be observed that probability P01 presents less dispersion and asymmetry than probability P10. The presence of outliers, especially in the transition months (May and October), may be explained by the individual behavior of each station along the TRB. Mean probability rates in the basin denote that the dry period occurs in the months between May and September and the rainy season from October to April. A similar result was obtained by Collischonn et al. (2008), who reported that the region had two well-defined seasons, a dry season from May to September and a rainy season from October to April, with annual rainfall rates varying between 1,600 mm and 2,700 mm. Figure 5, representing P10, the month of February showed the lowest data variability. However, the lowest rainfall probabilities occurred in July, a fact confirmed by Figure 6 for P01. Once more, dry days were a great probability in July. As a rule, May is considered the transition month between the dry and rainy periods in the TRB. Change is more pronounced in June because, depending on the rainy season and its location, rainfall probability approaches zero, confirming July and August as the driest months in the TRB.   Figure 7 shows the dendrogram with cluster formation and the results of the sensitivity analysis concerning the selection of Euclidean distance. Rainfall gauge stations, clustered according to similarity, are represented on the x-axis. The y-axis represents the measure of similarity for binding distances. The data used for the grouping were all 48 values of Ps obtained for each of the 80 rainfall stations. In the sensitivity analysis, the distance 3.5 was selected (red line) because it presented a better distribution of the formed clusters, resulting in 8 homogeneous ones. The stations of each cluster are identified in Table 3, while Figure 8 represents their distribution in the TRB.

Cluster Identification
Analyzing the distribution of clusters in the TRB area, it can be observed that clusters 1, 2, 3 and 4 are located in the northern region of the basin, while clusters 5, 6, 7 and 8 are distributed in the central and southern areas. It may be seen that the stations inserted in the same cluster, although presenting different rates for the occurrence of rains, have dry and rainy periods divided in a similar way in relation to the months. This fact can be verified in Figure 9 (a, b, c, d, e, f, g, h), in which the rates of clustered stations in the clusters were plotted, presenting in common the months of June, July and August as the driest; except in cluster 4, in which the smallest occurrences of rainy days occur in the months of August, September and October. For cluster 3, although the rainfall stations show relatively different values of P10, the dry period is the same (June to November). Figures 10 and 11 show the results for one station from each cluster.
Rev. Ambient. Água vol. 14 n. 3, e2284 -Taubaté 2019 g) G7 h) G8 Figure 9 (a, b, c, d, e, f, g, h). Continued. Figures 10 and 11 demonstrate that transition probabilities provide information on the dry or rainy periods of each season. It is possible to predict the magnitude of each period over each rainfall season. This may be noticed when comparing the probability rates in stations subjected to different climatic factors, for instance, stations 11 (code: 255001) and 80 (code: 1454000), located respectively north and south of TRB, where different biomes predominate, such as the Amazon biome at the mouth and the savannah biome at the headwaters (Mancuzzo et al., 2011). This fact provides particular aspects with regard to climate and rainfall frequency. Climatic factors correspond to the static geographical features of the landscape, such as latitude, altitude, relief and vegetation (Mendonça and Danni-Oliveira, 2007).  When the rates of P10 of stations 11 and 80 are compared, one may observe that the probability of rain after a dry day presented a rainy and dry period during a good part of the month. It was greater in the region where pluviometric station 255001 was installed. Similar results were reported by Mancuzzo et al. (2011) for the state of Mato Grosso (MT) where rainfall rates were higher in areas characterized by the Amazon biome and lowest for the Savannah and Pantanal biomes. According to Ziegler et al. (2004), the vegetation cover may influence the rainfall percentage of a given region, since, due to soil cover, the recharge of the surface and underground aquifers tend to increase or decrease due to the direct interference of the flow component. In their study conducted in Cáceres (MT), a municipality located in the south of the state of Mato Grosso, Pizzato et al. (2012) registered that rainfall behavior in this region differed from that in a study by Moreira et al. (2010) in Nova Maringá (MT) in the north of the same state. The authors concluded that, in the northern region, the drought period occurs earlier when compared to the Pantanal Matogrossense region.
In the case of the study by Moraes et al. (2005), the authors observed that December characterizes the beginning of the rainy season in most localities of the state of Pará (PA). However, in a small area south of the state, the beginning of the rainy season may occur in the month of October. However, in a wide range that goes from southwest to south-east, including center-south, the beginning of the rainy season occurs in November. This result does not differ from that found by Menezes et al. (2015) when they divided the state of PA into three homogeneous rainfall regions where the biggest rainfall rate (mm) occurred in the south-north region of the state. The occurrence of rainfall increases in most of the state of Pará in December also. In the case of annual rainfall distribution in the state of Pará, and taking into consideration the occurrence of El Niño and La Niña, results of a study by Gonçalves et al. (2016) showed that both events triggered the highest rainfall indexes, mainly in the northeast of the state, followed by the south region with the lowest rainfall rate.
In the case of station 255001, the period with the highest rainfall probability comprises the period between January and June, and the period with the lowest rainfall probability comprises the period between July and December, contrary to the result of Moraes et al. (2005). However, in the case of station 1454000, the period with the highest rainfall probability lies between October and March. Lowest rainfall probability is more pronounced between April and September. In most stations, especially those located between the headwaters and the center of the basin, probabilities P10 indicated a low transition probability in the driest period, i.e., rates close to zero. This transition increased significantly in stations located at the mouth. This may be explained by the type of climate of the region ("Am"), characterized by a brief dry season and intense rains during the rest of the year. Figure 11 also reveals that P11 rates were higher when compared to those of P10. According to Baú et al. (2013), these results tend to confirm the hypothesis of persistence of the preservation of rainfall data from the previous day in the generation of probability of rainfall.

CONCLUSION
The Tapajos Basin region has two well-defined seasons, a dry season from May to September and a rainy season from October to April, with May and October characterized as transition periods. Factors of the probability transition matrix show variability in time and also the influence of the geographical position of the rainfall gauge stations on the determination of dry and rainy periods in specific localities of the Tapajos River Basin. Further, 8 regions with rainfall probability were identified by the clustering technique. This identification shows the difference in specific behavior of each rainfall station within the Tapajos River Basin. The insertion of the probability of occurrence analysis for different rainfall volumes according to annual variation may be recommended. For agricultural activity, the definition of these regions