IDENTIFICATION OF HOMOGENEOUS RAINFALL ZONES DURING GRAIN CROPS IN PARANÁ, BRAZIL¹

The aim of this study is to identify homogeneous rainfall zones in the winter and summer 1 st and 2 nd crops, in the state of Paraná, Brazil. The zones were defined by clustering using the expectation-maximization (EM) algorithm to transform seasonal rainfall series. Monthly average rainfall data collected from 157 weather stations for 20 years (1996 to 2015) were employed. The results show that the number of homogeneous zones varied among growing seasons. The summer crop presented two clusters, with rainfall averages of 1489 a


INTRODUCTION
The state of Paraná has the highest agricultural production in Brazil. In 2016, the agricultural gross domestic product was R$ 88.83 billion. Agricultural production accounted for 50% of this amount, and grain exceeded 38 million tons (Seab, 2015).
The climatic conditions of Paraná allow three distinct growing seasons: summer crop, from September to December; second crop, from January to March; and winter crop, from March to July (Iapar, 2017a). The three main grain crops grown in Paraná are soybean, maize and wheat. Maize is sown from January to April and harvested from May to September; soybean is sown from October to December and harvested from January to April; and wheat is sown from April to July and harvested from August to December (Conab, 2017).
Climate has a significant impact on agricultural production, especially rainfall (Pereira et al., 2014), which can produce varying results, from high yields to partial or total crop losses. Therefore, the knowledge of rainfall dynamics is crucial in decision-making to mitigate the damaging effects of droughts and floods (Romani et al., 2010). Rainfall in Paraná is higher than that in the rest of Brazil and is variable throughout the state (Ely & Dubreuil, 2017). The average annual rainfall varies between 1200 and 3500 mm, and most of the state presents averages between 1400 and 2000 mm (Iapar, 2017b). However, the high annual rainfall, which also exhibits a seasonality that is unsuitable for crop development, limits crop yield and causes agricultural losses during heavy rainfall.
Considering the different growing seasons and high variability in rainfall in Paraná, systems that accurately analyze climate information are essential. Fayyad et al. (1996) proposed the application of data mining. This technique is the primary step in the Knowledge Discovery in Databases (KDD) process.
Among data mining techniques, clustering uses rainfall time series to define homogeneous rainfall zones, in which each rainfall station corresponds to a data time series, and a homogeneous zone is formed by grouping similar time Engenharia Agrícola, Jaboticabal, v.39, n.6, p.707-714, nov./dec. 2019 series (Dourado et al., 2013). Oliveira-Júnior et al. (2017) reported that the identification of homogeneous zones was fundamental for agricultural planning. Marzban and Sandgathe (2006) showed that clustering is an effective algorithm when using observed meteorological data.
Clustering has been used to define similar climatic zones in different regions of Brazil (André et al., 2008;Comunello et al., 2013;Moreira et al., 2016), including the state of Paraná (Fritzons et al., 2011;Pansera et al., 2015). However, no studies have defined homogeneous rainfall zones in winter and summer 1 st and 2 nd crops in Paraná. Owing to the high production of soybean, maize, and wheat in Paraná, Sampaio et al. (2006) emphasized that a rainfall study based on the crops cultivated in each region should be carried out. Overall, the objective of this study is to identify homogeneous rainfall zones in the in the winter and summer 1 st and 2 nd crops in state of Paraná, Brazil.

MATERIAL AND METHODS
The study area comprised the state of Paraná ( Figure  1). According to Thornthwaite's classification, the climate of Paraná is divided into 12 classes. Class C1 (C1dA'a' and C1wA'a') is more common in higher latitudes (north, northwestern, and northern pioneer) whereas classes B2 and B3 (B2rB'4a' and B3rB'3a') are predominant in lower latitudes and in the east coast (south-central, southwest, and southeast). The most prevalent class in Paraná is C1rA'a', corresponding to 30.74% of the state. Class B1rA'a' is the second most represented class in the state (21.28%) and is predominant in the west and central-west regions (Aparecido et al., 2016).
The study followed the KDD process described by Fayyad et al. (1996), with five steps: data selection, preprocessing, formatting/transformation, data mining, and interpretation. Rainfall data were obtained from the Paraná Water Institute database maintained by the Hydrological Information System. We choose homogeneous and continuous time series and selected 157 weather stations that collected daily rainfall data from January 1996 to June 2016 covering most of the state of Paraná, corresponding to 20 crops ( Figure 1). The 157 stations represent different municipalities, and the obtained maps were expressed at municipality level for better visualization and analysis of the results (Figure 1).
A monthly rainfall database was constructed for each weather stations using the obtained data. Rainfall rates for the months of each growing season and year were summed, forming a series of accumulated rainfall data for each season (summer crop, second crop and winter crop).
Various techniques were applied to identify outliers (clustering). However, the outliers were not excluded because they were observed values. There were neither flaws nor inconsistent values in the database.
Data normalization was applied to the entire database: the highest value of each attribute was assigned a value of 1, whereas the lowest value of each attribute was assigned a value of 0. The other values were proportionally allocated within this interval. This technique was applied using Weka software version 3.8.0. This software contains visualization tools and algorithms for data analysis and predictive modeling, as well as graphical interfaces to facilitate access to these features (Arora & Suman, 2012).
Clustering was used for data mining in Weka software version 3.8.0. Cluster analysis involves the organization of a set of patterns commonly represented as vectors of attributes or points in a multidimensional space in groups according to a measure of similarity (Pontes Júnior et al., 2018). The similarity between attributes was measured by the Euclidean distance. For cluster analysis, a partitioning method was adopted using two algorithms: Kmeans and expectation-maximization (EM).
K-means is a very popular method (Dhanachandra et al., 2015) in which data are grouped according to their proximity to each other as a function of the Euclidean distance. The goal of this method is to divide a set of N data vectors into K non-overlapping clusters (K ≤ N) using K centroids adequately positioned in the data space. Then, each data vector is associated with a centroid by similarity.
Engenharia Agrícola, Jaboticabal, v.39, n.6, p.707-714, nov./dec. 2019 EM involves expectation (E) and maximization (M). M uses the probabilities employed in phase E to refine the model parameters. E and M constitute an iterative process in which the new probabilities calculated in phase M are used to perform inferences in phase E. As a result, EM allows learning the parameters that govern the distribution of data in which some characteristics are missing. The algorithm is based on statistical calculations that estimate maximum-likelihood parameters in cases in which equations cannot be directly solved because the model depends on unobserved latent variables (Torres et al., 2017).
K values (number of groups) from 2 to 10 were tested to obtain a number of clusters most consistent with the actual distribution of data. The classifiers J48, Naive Bayes, and sequential minimal optimization (SMO) were used to select the optimal number of clusters and the algorithm that analyzed the data more accurately.
The J48 algorithm is a modified version of the c4.5 and ID3 algorithms, which are used for constructing classification trees (Quinlan, 1996). Classification trees are generated from a training dataset, and at each node, the algorithm chooses an attribute that most efficiently subdivides the sample set into homogeneous subsets according to their class (Giasson et al., 2013).
Naive Bayes is a simple but highly practical classifier with a wide range of applications. It generates a group of probabilities that are estimated by calculating the frequency of the characteristic value of each training data instance. Given a new instance, this classifier estimates the probability that the instance belongs to a specific class (Gao et al., 2018).
The SMO is a variant of the SMV algorithm (Zhang et al., 2018) and was created to solve the quadratic programming problem presented by the SMV algorithm. SMO analyzes sparse datasets as well as binary or non-binary input data. SMO also analyzes missing data via global substitution and nominal attributes (Witten & Frank, 2005).
The percentage of accuracy of each clustering algorithm according to the classifiers, the Kappa index, and the domain expert was calculated to determine the ideal number of clusters during each growing season.
Descriptive statistics were applied using the ideal number of clusters to better analyze the data. The clusters were spatialized in ArcGIS environment version 10.3 for better visualization of the results.

RESULTS AND DISCUSSION
The EM clustering algorithm was the most adequate for analyzing rainfall time series, with higher success rates and Kappa indices (Table 1) than the k-means algorithm ( Table 2). EM was previously used for analyzing climatic variables (Kalaiselvi & Geetha, 2016;Vrac & Yiou, 2010), and the K-means algorithm was efficient in analyzing groups of spatiotemporal rainfall characteristics in the state of Rio Grande do Sul (Boschi et al., 2011). Richetti et al. (2018) evaluated the regionalization of the state of Paraná from 1989 to 2013 and concluded that K-means obtained the best results and fewer errors when compared with the EM algorithm. The boxplot of the rainfall clusters in each growing season is shown in Figure 2. There were three outliers in cluster 1 during the summer 1 st crop and 1 outlier in cluster 2 during the summer 2 nd crop. However, during the winter crop, data had a lower dispersion than in other seasons and no outliers at all. Average rainfall data in regions with discrepant values should be carefully used for irrigated agriculture, hydraulic works, and climatic zoning because the high variability in rainfall may cause severe damage by the lack or excess of water (Lima et al., 2008). For summer 1 st crops, clustering generated two homogeneous rainfall zones, designated as clusters 0 and 1. A total of 96 stations were grouped in cluster 0, with an average rainfall of 1489 mm, a maximum of 2695 mm, a minimum of 634 mm, and the lowest rainfall (Table 3). Cluster 0 was predominantly found in the northwestern, northern, easter center regions, and in the metropolitan region of Curitiba (Figure 3). These regions presented similar rainfall regimes and were strongly affected by continentality. Farias et al. (2001) found that the risk of droughts in the northwestern of Paraná was high during the most critical phases of soybean cultivation. Sixty-one weather stations were grouped in cluster 1, with an average rainfall in summer 1 st crops of 1925 mm, a maximum of 3269 mm, a minimum of 1077 mm (Table 3), and the highest rainfall rate for all crops. Both clusters presented rainfall rates (450-850 mm) above the requirement for soybean crops (Carvalho et al., 2013).
The rainfall rate was lowest in the summer 2 nd (Sampaio et al., 2006), and four clusters were formed during this crop (Figure 4).
Thirty-seven weather stations were grouped in cluster 0, which had the lowest average rainfall (1004 mm), lowest minimum rainfall (510 mm), and lowest maximum rainfall (2006 mm). Cluster 0 predominantly included the north side of the state of Paraná (Figure 4), where water deficit occurs (Fritzons et al., 2011). Gonçalves et al. (2002) studied sowing dates for maize in Paraná in summer 2 nd as well as its climatic risks and concluded that the risk of drought in the region of cluster 0 exceeded 50%.
Seventy weather stations were grouped in cluster 1, which was the most representative in the state, with an average rainfall of 1182 mm, a maximum of 2108 mm, and a minimum of 610 mm. This cluster included the western center, easter center, western, and the metropolitan region of Curitiba (Figure 4). Variability was higher in the south, which can be attributed to the interaction between the relief and the constant development of frontal systems associated with subtropical jet streams. Shiroga & Gerage (2010) found that there was water deficit in maize crops in the region covered by cluster 1.
Cluster 2 comprised 47 stations, with an average rainfall of 1454 mm. This cluster had the highest maximum rainfall (2612 mm) and primarily comprised the southwestern and easter center regions (Figure 4). Cluster 3 presented the highest rainfall rates (Table 4) but contained only three weather stations, indicating that this rainfall regime was restricted to a small portion of the state. The high rainfall rate is due to the influence of the orographic factor (Silva et al., 2012).
Engenharia Agrícola, Jaboticabal, v.39, n.6, p.707-714, nov./dec. 2019    Bergamaschi et al. (2001) analyzed soils of Rio Grande do Sul and found that early-cycle maize requires an average rainfall of 650 mm throughout the growth cycle to express its maximum potential. None of the analyzed clusters presented values lower than the averages, indicating that, when considering the averages during the summer 2 nd crops, the lack of rainfall was not a limiting factor for maize production in Paraná. However, compared with the minimum rainfall values, the summer 2 nd crops grouped in clusters 0, 1, and 2 presented rainfall levels below the averages, and rainfall might limit productivity in these areas.
Three homogeneous clusters were generated during winter crop ( Figure 5). Cluster 0, containing 46 municipalities, presented the lowest average rainfall (969 mm), the lowest maximum rainfall in one crop (1965 mm), and the lowest minimum rainfall (476 mm) ( Table 5). This Cluster comprised the northern pioneer and center, northwestern ( Figure 5) and was characterized by the lack of rainfall (Salton et al., 2016). Fritzons et al. (2011) showed that droughts occur in this region during winter seasons.  Cluster 1, with 52 municipalities, presented an average rainfall of 1171 mm, a maximum of 2159, and a minimum of 545 mm. This cluster was considered a transition zone between clusters 0 and 2. Fritzons et al. (2011) emphasized that a transition zone is located in the central region of the state. Fifty-nine municipalities were grouped in cluster 2 and were primarily located in the south western, western, and coastal region; the latter had the highest rainfall in the state of Paraná ( Figure 5) (Fritzons et al., 2011), good rainfall distribution throughout the year, and no dry season. However, rainfall was lowest in winter crop (Vanhoni & Mendonça, 2008). The southwest is characterized by heavy rainfall (Machado et al., 2013), which is explained by cold fronts and a decrease in temperature, increase in atmospheric pressure, and presence of southerly winds (Waltrick et al., 2015). This cluster had the highest average rainfall (1498 mm), highest maximum rainfall (2517 mm), and highest minimum rainfall (662 mm), as well as the lowest coefficient of variation (19.93%), constituting the region with the highest rainfall homogeneity during winter season. Aparecido et al. (2016) reported that only the southeast and southwest regions had annual rainfall higher than 1800 mm.

CONCLUSIONS
The EM algorithm was slightly more effective than the k-means algorithm for clustering homogeneous rainfall zones during winter, summer 1 st and 2 nd crops in the state of Paraná, Brazil.
Cluster analysis showed that each growing season had a different number of homogeneous zones. Rainfall distribution in the state of Paraná was more homogeneous during summer 1 st crop, with only two clusters, and was more variable in summer 2 nd crop, with four clusters.
The clusters with the lowest rainfall rates were the northwestern, northern pioneer, and north, and the coastal region was represented in all clusters with the highest rainfall. Clustering was an appropriate instrument to identify geographical regions with similar rainfall regimes during different growing seasons in the state of Paraná.
These results can be used to plan growing seasons and manage water resources in the state of Paraná and can serve as support for future research.