Patterns recognition methods to study genotypic similarity in flood-irrigated rice

: Genetic diversity studies are performed based on information on a set of traits measured in a group of genotypes, considering one or more environments. The pattern recognition methods allow classifying genotypes from a set of important agronomic information. Thus, this study aimed to present and compare pattern recognition methods to inquire about the similarity of environments and genotypes in flood-irrigated rice for the recommendation of cultivars. The experiments were performed in the municipalities of Leopoldina, Lambari, and Janaúba, state of Minas Gerais, Brazil. To evaluate the pattern of similarity, 25 rice genotypes in three environments belonging to the flood-irrigated rice breeding program were used. Among these genotypes, five cultivars were used as an experimental control for the grain yield, the height of the plant, flowering, panicle length, grains filled by panicles, percentages of grains filled by panicles, in the 2012/2013 agricultural year. The methods used were mixtures of multivariate normal distributions and density-based clustering algorithm. It was observed, therefore, that the genotypes are distributed in three distinct groups, in which there are intragroup homogeneity and intergroup heterogeneity for the agronomic traits of the flooded rice culture. The methods used to assess the dissimilarity of environments using pattern recognition methods were efficient in classifying flooded rice irrigated environments. Genetic diversity studies are performed using information from a set of traits measured in a group of genotypes, considering one or more environments. These studies are useful for recognizing similarity patterns and quantifying variability to explore breeding plants. The similarity pattern is generally attributed to genetic similarity by ancestry or by sharing alleles in common, which are fixed by selection.


INTRODUCTION
Rice (Oryza sativa L.) is one of the most produced and consumed cereals in the world, and is characterized as the main food for more than half of the world population. With the increase in the population, the demand for grain productivity has increased over the years and it is estimated that by 2050 global rice production should increase from 60 to 110% to supply the demand of the world population (Godfray et al. 2010;Tilman et al. 2011;Ray et al. 2013;Santos et al. 2019).
Genetic diversity studies are performed using information from a set of traits measured in a group of genotypes, considering one or more environments. These studies are useful for recognizing similarity patterns and quantifying variability to explore breeding plants. The similarity pattern is generally attributed to genetic similarity by ancestry or by sharing alleles in common, which are fixed by selection.
When performed experiments in more than environment, an approach to the behavior of genotypes are common, emphasizing stability and adaptability for a given characteristic of agronomic importance, mainly grain production. Also, the study of dissimilarity of environments is equally important and aims to identify more discriminative and representative environments to subsequently analyze the most stable and adapted genotypes.
One way of evaluating the behavior of genotypes, given the dissimilarity of environments, but little explored among breeders, is the use of approaches based on pattern recognition methods. In this case, the evaluation of a set of traits relevant to the breeder and the influence of the environments on a possible pattern of grouping of the evaluated genotypes are considered. Thus, it is assumed that genotypes can be differentiated by genetic causes, within the environment, and by environmental causes, called macrovariations, provided by the edaphoclimatic differences to which they were subjected. The pattern recognition methods allow classifying objects, within many categories or classes expressed by the environments, from a set of important agronomic information (Bishop 2006).
Pattern recognition between genotypes provided by the dissimilarity of environments allows the breeder to make decisions to identify groups of environments in which the interaction genotype × environments (G × E) may not be significant for the set of available genotypes.
Thus, this study aimed to use pattern recognition methods (mixture of normal distributions and density-based clustering algorithm) to study the similarity of environments and genotypes in flood-irrigated rice for the recommendation of cultivars.
To inquire about the pattern similarity, 25 rice genotypes were evaluated in three environments belonging to the floodirrigated rice breeding program. Among these genotypes, five cultivars were used as experimental controls ('Rubelita' , 'Seleta' , 'Ourominas' , 'Predileta' , and 'Rio Grande') for the following traits: grain yield (Kg·ha -1 ), height of plant (cm), flowering (days), panicle length (cm), grains filled by panicles, percentages of grains filled by panicles, in the 2012/2013 agricultural year. The experimental design used in all experiments was randomized blocks with three replications. The value for cultivation and use (VCU) tests were conducted on floodplain soils with continuous flood irrigation. The cultural treatments were carried out according to the recommended for the cultivation of irrigated rice in the evaluated regions (Soares et al. 2005).  In the Leopoldina experimental field, the seedlings were previously formed in nurseries and, later, transplanted at a spacing of 0.20 m on the line. In other, sowing was carried out on the planting line with a density of 300 seeds·m -2 . The tests were carried out in floodplain soils with continuous flood irrigation. Irrigation started around 10 to 15 days after seedling emergence, in the case of planting with seeds, or when the seedlings were established in the soil. The irrigation depth was gradually increased according to the development of the plants.

Statistical analysis
The statistical model described in Eq. 1 was considered to each original observation: where: Y ijl is the observation in the l th block, evaluated in the i th genotype and j th environment; µ is the general average of the experiments; B/E jl is the effect of block l within environment j; G i is the effect of the i th genotype (i = 1,2 ... g); E j is the effect of the j th environment (j = 1,2 ... k); GE ij is the random effect of the interaction between genotype i and environment j; e ijl is the random error associated with the Y ijl observation. The pattern recognition analysis was made from adjusted values (Y * ijk =μ + Ĝ i + ê ijl ), which adjust the phenotypic values for the effects of block, environment and the interaction of genotypes and environments. With the adjusted value, pattern recognition analyzes were performed using mixtures of multivariate normal distributions and density-based clustering algorithm.

Mixtures of multivariate normal distributions
In this analysis, it was considered that there are, in the set of environments, homogeneous subgroups whose data could be characterized by probability distributions, supposedly normal. As a whole, a multivariate normal distribution is assigned to each component of the mixture. Thus, it is expected that the clusters represent sets of environments with an ellipsoidal data arrangement, centered on the mean vector µ k , and with other geometric characteristics, such as volume, orientation, and shape, determined by the covariance matrix Σ k of dimension v × v.
Parsimonious parameterizations of the covariance matrices, for each environment group, can be obtained through Eq. 2: where: λ k is a scalar that controls the volume of the ellipsoid, A k is a diagonal matrix that specifies the shape of the density contours with det (A k ) = 1 and D k is an orthogonal matrix that determines the orientation of the corresponding ellipsoid (Banfield and Raftery 1993;Celeux and Govaert 1995). In one dimension, there are only two models denoted by E for equal variance and V for a variable variance. In the multivariate configuration, the volume, shape, and orientation of covariance can be limited to being equal or variable between groups. Thus, 14 possible models with different geometric characteristics can be specified. Table 1 presents all of these models with the corresponding distribution structure type, volume, shape, orientation, and associated model names.
The model is chosen using the Bayesian information criterion (BIC) (Scrucca et al. 2016), according to Eq. 3: where, l(Ψ;y) is the logarithmic of the maximized likelihood function (Supplemental Material available); v м,k is the number of independent parameters to be estimated in the model M; and k is the number of components in the mixture, supposedly equal to the number of environments analyzed. According to the BIC expression presented by Scrucca et al. (2016), the higher the BIC value, the better the model.

Density-based clustering algorithm
The density-based clustering algorithm to discover the number of clusters (environments) was created to identify different forms of groupings and the presence of noise in the databases (Ester et al. 1996) 4 . It uses the concept of center-based density since the density of a point in the data set is the number of points within a neighborhood radius. This algorithm contains two input parameters, the radius and the minimum number of points in a given radius. However, center-based density makes it possible to classify a point in dense regions (center point), at the limit of a dense region (limit point) or in a sporadically occupied region (noise point).
To evaluate the classification performance objectively, the confusion matrix was used, in which the frequency observed on the diagonal represents the elements correctly classified. The marginal column represents the total of elements classified for a category i. On the other hand, the marginal line represents the total of reference elements sampled for a category i.
The GENES software (Cruz 2016) was used to perform the analyses, integrated with Matlab (Matlab 2011) and R (R Core Team 2019). Table 2 shows the result of the joint analysis of variance of 25 rice genotypes evaluated in three environments. The estimate of the coefficient of variation (CV%) was low for all characteristics, indicating adequate experimental precision, as demonstrated in other studies related to the culture of irrigated rice (Hosan et al. 2010;Silva et al. 2011;2019;2020;Costa et al. 2002;Streck et al. 2017;Santos et al. 2019). For the effect of genotypes, there was statistical significance for the traits of grains filled by panicles and the percentage of grains filled by panicles, and no significance was observed for the effect of genotypes in the joint analysis for the other traits. The difficulty in detecting differences between the general means of such genotypes can be justified by the advanced stage of genetic improvement in which these genotypes are found for these traits. There was significance (p < 0.01) for the effects of the environment, except for grains filled by panicles, and for the genotype interaction by environments (G × E), except for the height of plant and percentage of grains filled by panicles. Consequently, the behavior of the genotypes was influenced by environmental conditions, justifying the use of methodologies that are capable of classifying environments according to clustering methods.

RESULTS AND DISCUSSION
Among all the adjusted models presented in Table 1, the two that presented the highest Bayesian information criterion (BIC) values were VEI (diagonal distribution, variable volume, equal shape, and coordinate axis orientation; BIC = -2901.55) and VVI (diagonal distribution, variable volume, variable shape and the orientation of coordinate axes; BIC = -2918.59), associated with five and three components of mixtures, respectively (Fig. 2). Table 3 shows the number of genotypes allocated to each of the five and three components of mixtures considering, respectively, the VEI and VVI models.
The model considering a mixture with three components allocated approximately 25 genotypes in each component. This result is interesting since, due to the edaphoclimatic differences in each location, such as temperature and humidity, it is expected to obtain a mixture composed of three components. On the other hand, the mixture model composed of five components divided the genotypes into two other groups. Therefore, Table 4 shows a confusing matrix of classification of genotypes in the different environments, which obtained a prediction accuracy of 97.33% (representing the number of correct classifications on the total genotypes). In this table, environment 1 obtained 100% classification of the 25 genotypes, while environments 2 and 3 presented an error when classifying the genotypes in their respective environments.   Figure 3 shows the density-based clustering algorithm to identify the number of clusters. It was possible to observe three different groups of classification of environments. However, based on center density, it was possible to classify a point in dense regions (center point), at the limit of the region or in a sporadically occupied region (noise point).
The environment can be classified as favorable or unfavorable depending on the conditions in which it is found, thus being able to influence the classification of the genotypes. Therefore, the favorable environment corresponds to all the conditions that a given gene has to express the desirable characteristics since the genotypes perform better in these environmental conditions. Another issue that must be taken into account is the minimization of the response to uncontrollable factors since the breeder aims to produce cultivars with greater capacity for genetic resilience and more responsiveness. To the unfavorable environment, environmental conditions do not provide an expected performance, that is, a given gene is not expressed depending on the conditions in which it is exposed. For example, the genotype interaction by the environment,    in which the different behavior of the genotypes in the face of environmental variations is observed. In this case, it makes it difficult to decide to recommend a specific cultivar.
In this context, based on the results obtained, one can consider only one of the environments in the next evaluations, and the breeder should choose the best environment for the needs of his flood-irrigated rice breeding program. Criteria such as proximity to the research center and ease of access can be adopted. Also, the decrease in the number of environments will reduce the cost of evaluating the cultivars, in addition to allowing more judicious evaluations to be carried out in the remaining trials. Thus, the methods used to assess the dissimilarity of environments through pattern recognition methods provided a better classification between environments.

CONCLUSION
The methods used to assess the dissimilarity of environments using pattern recognition methods were efficient in classifying flooded rice environments.