Bayesian network: a simplified approach for environmental similarity studies on maize

The current methodologies used to evaluate environmental similarities do not allow the simultaneous analysis and categorization of the environments. The objective of this study was to verify the possibility of using the Bayesian network (BN) to detect similarities between environments for plant height, lodging, and grain yield in maize. Thirteen experimental varieties were grown in six environments to measure the traits plant height, lodging, and grain yield. The BN was constructed for each trait, using the Hill-Climbing algorithm. Results were compared with the simple part of the genotypes x environments interaction, clustering by the Lin’s method and by simple correlation between environments. The Lin’s method clustered environments with predominance of complex interaction for all traits. The BN is efficient to analyze environmental similarity for plant height and grain yield since it detected the highest correlations. The BN revealed no connections among the environments that presented predominance of complex interaction.


INTRODUCTION
The primary objective of breeding programs is to release new cultivars with optimal agronomic traits. To reach this goal, the evaluation of genotypes in several years and locations is necessary and helps estimate genotype x environment interaction. This interaction prevents the generalized recommendation of genotypes and demands the study of the genotype in specific environments. This process is expensive and requires financial and human resources, which makes the research onerous. The interaction can also be used to select similar environments with predominance of simple interaction ). Cruz and Castoldi (1991) proposed a methodology that divides the interaction into simple and complex parts, based on the decomposition of the mean square of genotypes x environments interaction (GE). Despite being adequate to evaluate this type of experiment, this methodology does not allow the simultaneous analysis and categorization of the environments, once the result is given by pairs of environments. Another frequently used methodology is the Lin's algorithm (Lin 1982), which groups the environments based on the absence of interaction.
The Bayesian network (BN) is an approach that represents cause and effect relations. Its graphical structure allows the identification of assumptions between system variables that may be obscured when using other methodologies (Borsuk et al. 2004). Currently, the BN has been studied to predict variables. Felipe et al. (2015) successfully analyzed the predictive capacity of the BN considering 31 traits, which revealed that the BN can be used even with a large number of traits, and allowed their joint analysis.
The present study raises the hypothesis that the BN can be used to predict environments instead of variables, and has the potential to be used to evaluate similarity between environments with predominance of simple interaction, simultaneously. Therefore, the objective of this study was to verify the possibility of using the BN to detect similarity between environments for plant height, lodging, and grain yield in maize.

MATERIAL AND METHODS
The experiments with the Bayesian network were carried out in two locations, Jaboticabal (lat 21º 14' 33'' S, long 48º 17' 10'' W, and 565 m asl), SP, Brazil and Campo Alegre de Goiás (lat 17º 38' 20'' S, long 47º 46' 55'' W, and 884 m asl), GO, Brazil. These locations were selected due to their diverse environmental conditions. The same experiment was carried out in five different seasons in Jaboticabal ( Information from 13 open-pollination synthetic varieties, obtained as described by Oliveira et al. (2016), was evaluated in each environment. All experiments were arranged in a complete randomized block design, with three replications. Each plot consisted of two 5 m-long rows, and the population was corrected to 60,000 plants ha -1 . The management of the experiments followed the recommendations of Fornasieri .
The following traits were measured: plant height, determined by the distance in cm between the ground and the insertion of the flag leaf, in 10 random plants per plot; number of lodged plants per plot, determined by the breakage below the ear and maize root lodging; and grain yield per plot. After physiological maturity, the ears of both rows of the plot were hand-harvested; the grains were separated and weighed, and the grain moisture was determined. The grain yield of each plot was corrected to 13% of humidity and converted to kg ha -1 .
The BN is a graphical representation of a probability distribution over a set of variables (Felipe et al. 2015). The Directed Acyclic Graph (DAG) represents the BN using nodes connected by arrows, and is used as an output to the modeling approach. In this case, it is used to illustrate the association between environments. This graph characterizes a joint probability of the data, which brings scale benefits due to the factorization (Aliferis et al. 2010). In a set of variables {X 1 , X 2 ,…, X p } with joint distribution Pr(X 1 , X 2 ,…, X p ) and a DAG D that is compatible with this joint distribution (Pearl 2000), the following factorization can be performed: in which Pa i are the parents of X i in D. The BN analysis involves searching for a structure that is compatible with the joint distribution of the data. The selected structure has already been used as a prediction tool, as described by Felipe et al. (2015). In this study, the BN was only used in the context of environmental association.
For the present work, the Hill-Climbing algorithm ("search and score" approach) was used to construct the BN from the means of each plot. The model was adjusted using the package "bnlearn" of the R software (Scutari 2009). The environmental correlation was estimated, using the Pearson's correlation, to quantify the relationship between environments associated by the BN. The magnitude of the correlations was analyzed according to the limits of interpretation of correlations proposed by Carvalho et al. (2004), where: r = 0.0 (no correlation); 0.0 < |r| < 0.30 (weak); 0.30 < |r| < 0.60 (intermediate; 0.60 < |r| < 0.90 (strong); 0.90 < |r| < 1 (very strong) and |r|= 1 (perfect). After the network was "learned" from environment data, additional conventional methodologies were applied to validate the BN. Therefore, joint analysis was performed considering all the six environments, using the following model: where Y ijk is the phenotypic observation; m is the general mean; B(A jk ) is the effect of k block within the j th environment;

CB Amaral et al.
G i is the effect of the i th genotype; A j is the effect of the j th environment; GA ij is the effect of the interaction between the i th genotype and the j th environment; and E ijk is the random error or residue. All effects, except for error, were considered as fixed.
Environmental stratification was carried out using the conventional approach as proposed by Cruz and Castoldi (1991). This method can be used when the GE interaction is significant between the pair of environment, decomposing this interaction into two parts. The first part, denominated as simple interaction, is determined by the difference in variability between genotypes in the environments; and the second part, denominated as complex, is given by the absence of correlation between genotypes under environmental variation (Cruz et al. 2012). Moreover, this methodology allows estimating the Pearson's and Spearman's correlation. In this method, the lowest values of the percentage of simple interaction represent the most different environments.
The division of the simple part of the mean squares of the interaction (MSGxE) was performed for plant height, lodging, and grain yield, using the following formula: where Q 1 and Q 2 were the mean squares of genotypes in environments 1 and 2, respectively; and r was the correlation between the genotypes means in both environments. The percentage of the simple interaction of MSGxE is expressed as follows: Another estimation method was proposed by Lin (1982), which considers the sum of squares for the interaction between genotypes and pairs of environments, and subsequently clusters of environments with non-significant interaction. Afterward, the method estimates the sum of squares between genotypes and groups of three environments each time, and uses the F test to evaluate the possibility of creating a new group. A sum of square of the pairs of environments was estimated, using the means, according to Cruz et al. (2012), by: The highest values represent the most similar environments. The Genes software (Cruz 2013) was used to analyze the algorithms of Cruz and Castoldi (1991) and Lin (1982).

RESULTS AND DISCUSSION
The experimental coefficient of variation (Table 1) was classified as intermediate for plant height and grain yield, and as very high for lodging, according to Fritsche-Neto et al. (2012). The coefficient of variation is an adequate method to evaluate the experimental precision and the estimated mean accuracy (Cargnelutti Filho and Storck 2007). Lodging usually presents high values of phenotypic coefficient of variation, as reported by Nzuve et al. (2014), due to the difference of influence of the environment in the plots for the trait.
The joint analyses of variance revealed significant effects at 1% probability for environments, genotypes, and the GE for all traits (Table 1), indicating the presence of differences among environments, variability among genotypes, and different response of genotypes to environmental condition. Quantitative traits usually present genotypes x environments interaction (Fan et al. 2007), requiring the unfolding of the interaction into simple and complex interactions.
The unfolding of the GE in the percentage of simple effect for pair of environments by the method proposed by Cruz and Castoldi (1991) showed that most of it were composed by simple interaction between the pairs of environments for all traits (Table 2).
Synthetic varieties are composed of a high number of genotypes, leading to great variability within the population, as demonstrated by Semagn et al. (2014), who found significant genetic variability within open-pollination varieties. This variability implies high stability, defined as the ability to maintain performance throughout multiple environments (Mansfiel and Mumm 2014). In this case, the stability results in predominance of simple interaction, that is, the great number of genotypes in the population confers the ability to predict the mean performance, regardless of the environmental effects.
The environments clustering based on the Lin's method (Lin 1982) formed one group for plant height and lodging, and four groups for grain yield (Table 3), as expected, due to the higher percentage of complex interaction between pairs of environments for plant height and lodging. For plant height, the group was formed with the environments 1, 3, 4, 5 and 6. However, the decomposition of the genotypes x environments interactions indicated predominance of the complex interaction between pairs 1 x 4, 1 x 6, 3 x 4, 3 x 6, 4 x 6 and 5 x 6 (Table 2), suggesting inconsistency between the results of the unfolding of the effect of the interaction and the results obtained with the Lin (1982)'s method. The same disparity was observed for lodging, where the pairs 2 x 6, 2 x 5, 3 x 6, 4 x 5, 4 x 6 and 5 x 6 presented predominance of complex interaction and were allocated in the same group.
The groups formed for grain yield were 1-2-3-5, 2-6, 4-5 and 3-4 (Table 3), and the pair 2-6 presented 70.62% of complex interaction. Cruz et al. (2012) also reported differences between the environment clustering using the Lin's algorithm and the environment clustering using the method of Cruz and Castoldi (1991), where the interaction was predominantly simple The authors also observed that this inconsistency was not a barrier since the interaction detected within the group was simple. However, the use of the simple interaction became a problem since the result is given for each pair of environment, making it difficult to stratify the environments.
The environments clustering by the Lin's algorithm aims to allocate in the same group the environments that presented lack of interaction (Peluzio et al. 2012). According to Mendonça et al. (2007), this is a less selective environment clustering method than the simple interaction, which leads to differences between the methods.   An explanation for the inconsistencies is that the Lin's algorithm estimates the sum of squares between genotypes and pair of environments to form the groups, while the lower value is used to form the initial group. The significance of the interaction between pairs of environments is tested considering the sum of square and the mean square of the GE interaction (Cruz et al. 2012). The division of the GE interaction proposed by Cruz and Castoldi (1991) decomposes the interaction component, without considering the significance, while Lin uses a variance ratio to cluster the environments.
In the BN, it is possible to observe the joint distribution and conditional dependence of a data set for prediction purposes. Thus, the information provided by the BN could be used to determine the predictive capacity of the environments, allowing environments clustering when this information is associated with those provided by simple or complex interaction between environments.
The BN for plant height detected the most similar pair of environments, 2 x 4, with almost 91% of simple interaction, according to the method proposed by Cruz and Castoldi (1991), and classified this pair as the most important ( Figure  1). The representation of the DAG, besides the visualization of the relations between traits, allows the categorization of the parameters, since the most relevant environments are allocated in the upper part of the figure (Felipe et al. 2015).
The correlations between the connected environments were classified as intermediate or strong (Figure 1). The BN also predicted the highest correlation between environments 2 x 4 (r= 0.88), and showed no connection between the less correlated environments, 1 x 6 (r= 0.35).
The complex interaction was predominant between pairs 1 x 2, 1 x 4, 1 x 6, 3 x 4, 3 x 6, 4 x 6 and 5 x 6 for plant height (Table 2), indicating that the environment 6 is the less similar, which was demonstrated by the DAG. The only discrepancy was the pair 1 x 4, which presented complex interaction of 51% and was connected by the BN. However, this could be explained by the lower magnitude of the complex interaction, which is almost close to 50%.
Although the BN for lodging was not efficient in discriminating the pair with the highest percentage of simple interaction (pair 1 x 4), it was able to demonstrate the most correlated environments (pair 2 x 3 (r = 0.73)). The correlations were classified as weak, intermediate or strong, and a discrepancy was observed between environments 5 x 6, for this pair presented the lowest correlation (r = -0.20) and complex interaction of 93.83%. Complex interaction was also detected between pairs 2 x 5, 2 x 6, 3 x 6, 4 x 5, and 4 x 6 ( Table 2).
For grain yield, the pair with higher percentage of simple interaction in the DAG (Figure 3) was not identified (pair 2 x 5). The environmental correlations were classified as intermediate and strong, and the BN did not identify the highest correlation (pair 2 x 5), probably due to the proximity of the correlation values of this pair with the correlation values of the pair 1 x 3, which was indicated by the BN. Discrepancies were not observed, and therefore the pairs 1 x 4, 1 x 6, 2 x 6, and 3 x 6, which presented predominance of complex interaction, were not associated by the BN.  For plant height and lodging, environment 2 was considered as the most important in the DAG, while for grain yield, the same environment was considered as one of the least important. The similar results for plant height and lodging can be explained by the close association between these two traits (Shi et al. 2016). In the case of grain yield, despite the association with plant height and lodging, correlation is not always observed (Rafiq et al. 2010, Nzuve et al. 2014. Considering all traits, the BN was efficient in indicating, directly or indirectly, the pair of environments with higher percentage of simple interaction for plant height and grain yield, and not efficient in the case of lodging, which can be explained by the high experimental error of this trait (Table 1). High coefficient of variation indicates high environmental variability (Keshavarzi et al. 2015), which decreases the predictive power of the BN. Therefore, the BN was effective in identifying similarities between environments for plant height and grain yield, and consequently facilitated the joint analysis of the environments per se. The BN could be advantageous to plant breeders since it allows using a great number of environments and does not require parity analysis.

CONCLUSIONS
The Bayesian network is efficient in connecting environments with predominance of simple interaction for traits with high experimental precision, while the Lin's method allocated in the same group environments with complex interaction. Therefore, the Bayesian network is a practical method to analyze an environmental net and detect similarity, without the need for the pairwise analysis of the environments. In addition, this method has the potential to cluster environments; however, the results must be associated with information on the type of interaction predominating between environments.