Artificial neural network analysis of genetic diversity in Carica papaya L .

The study of genetic diversity is fundamental in the preliminary selection of accessions with superior characteristics and for a successful use of these genotypes in breeding programs. The purpose of this study was to evaluate, as a strategy for genetic diversity analysis, the bioinformatics approach called artificial neural network. Based on the average of three growing seasons, eight quantitative traits and thirty-seven papaya accessions were evaluated in a randomized complete block design, with two replications. By Anderson’ s discriminant analysis, 91.90 % of the accessions wer e co rectly classified in the gr oups pr eviously defined by artificial neural network. It was concluded that the technique of artificial neural network is feasible to classify the accessions. The presence of significant genetic diversity among accessions was observed.


INTRODUCTION
Plant breeding is the most valuable strategy to increase productivity in a sustainable and ecologically balanced way.The genetic variability indicates the possibility of preliminary selection of accessions with superior characteristics and a successful use of these genotypes in breeding programs (Dantas and Morales 1996).
The genetic diversity in germplasm collections can be studied based on qualitative or quantitative morphological traits.In this case, several statistical techniques can be used to predict diversity.
For papaya, such studies have been conducted focusing on the evaluation of segregating populations by estimating genetic parameters, using selection indices and estimates of correlations among traits related to fruit yield and quality.In addition, these studies were supported by classical procedures and/or assisted by DNA markers (Silva et al. 2007a, 2007b, 2007c, 2007d, Ramos et al. 2007a, b, c, Silva et al. 2008).
In other perennial species, the multivariate approach has been successfully used to detect genetic diversity in study populations.
This approach was also used to characterize the genetic structure of the sampled plants as criteria and indicators for the selection of promising genotypes for breeding programs, aside from the conservation of germplasm of the sampled species (Arriel et al. 2006, Oliveira et al. 2006, Viana et al. 2006, Oliveira et al. 2007, Silva et al. 2009).
Bioinformatics refers to the application of computational and mathematical techniques in biological analysis.The artificial neural network is a bioinformatics technique with multivariate approach.
Bioinformatics approaches can be applied to any situation of fruit tree cultivation in which one wants to predict something, recognize a pattern, or as cluster analysis technique (Ruggiero et al. 2003).
The use of artificial neural network technology has been fit into the context of agriculture in different ways, e.g., in the identification of early stages of pest or disease development, the classification of satellite images for various purposes (Chagas et al., 2007, 2009, Costa and Souza Filho 2008, Schimith et al. 2009, Vieira et al. 2009, Watanabe et al. 2009, França 2010), or to control robots (Pessin et al. 2007), among others.
According to Galvão et al. (1999), due to its nonlinear structure, artificial neural networks can capture more complex features of the data, which is not always possible with traditional statistical techniques.For Sudheer et al. (2003), the greatest advantage of artificial neural networks over conventional methods is that they do not require detailed information about the physical processes of the system to be modeled.
The use of neural networks associated with classification methods is a promising path.These classifiers have the advantage of being nonparametric, requiring small samples for training (Kavzoglu and Mather 2003) and tolerate missing data (Bishop 1995).
In this sense, the purpose of this study was to evaluate the feasibility of artificial neural networks as a technique for genetic diversity analysis of Carica papaya L., implementing an artificial neural network (computer program), according to the model proposed by Kohonen (1982), which can propose a classification and the formation of divergent groups of the study accessions based on data extraction from the database.

Experimental design and plant material
In three growing seasons (May 2007, August 2007and November 2008), 37 accessions of the germplasm bank were assessed on the Fazenda Caliman Agrícola S/A, in Linhares, State of Espirito Santo.
The experiment was arranged in a randomized block design, suitable for germplasm evaluation, with two replications, with 20 plants per plot in double rows, spaced 3.6 x 2.0 x 1.5 m.Once planted, the accessions were evaluated in three growing seasons.
Since the genetic diversity is verified based on the study of plant and agronomic traits n the search for genetic variability, the following traits were evaluated: fruit weight, fruit length, fruit diameter, flesh thickness, firmness, external and internal fruit, soluble solids and incidence of skin freckles.
In all measurements three fruits at the stage of maturity (yellowing) were used from three plants per plot.The average fruit weight was determined on an electronic scale.Fruit length (longitudinal) and diameter (transversal) were measured with calipers.The flesh thickness was measured as lateral length of the fruit flesh, using a caliper.The fruit firmness was measured after dividing the fruit in half, in the transverse direction.The firmness was determined at four equidistant points on each fruit half, at a distance of 0.5 cm from the skin, based on resistance to penetration of the flesh.For this purpose, a penetrometer (Fruit Pressure Tester, Italy; model 53205) with an adapter (height 3.0 cm, diameter 3.0 cm) was used.
The content of soluble solids of the fruit juice was read, using a portable refractometer Atago N1, with readings ranging from 0° to 32° Brix.The juice was extracted from a sample of flesh tissue from the middle of the fruit, using a hand juicer.The incidence of physiological leaf spot was determined on a 0 -5 score scale, according to the incidence degree and means calculated based on three scores of each sample.
Papaya (Carica papaya L.) accessions of the groups Solo and Formosa were evaluated (Table1).

Artificial neural network
To study the genetic diversity among accessions, a computer program for artificial neural networks was implemented.The development methodology was based on the model of Kohonen (1982).
There are many models of artificial neural networks, but the one proposed by Kohonen (1982) has the advantage of not requiring any theory for the data organization, making it suitable for this study.
The traditional Kohonen model was modified, according to the characteristics and needs of this study.The methodology presented here was based on concepts proposed by Haykin (2001).
The artificial neural network consisted of an n x m input matrix, where n are accessions and m input elements or characters, which together represent the input vector X, and of k output neurons, representing the classes into which the accessions can be grouped, determined as: n = 37 accessions, m = 8 characters and k = 4 classes.

CD Barbosa et al.
The definition of the number of groups was random and is an adjustable parameter in the program developed.
For a given input, only one output neuron is activated, indicating the class to which the accession belongs.The classes should group accessions with similar characteristics.Consequently, the classification was based on similarity of values.
The process consisted of finding the best-matching neuron in terms of similarity (winner) i(X) in time step t, using the criterion of minimum distance between accessions.
An input pattern to the neural network considered the average of the three growing seasons and was expressed as: It was assumed that the synaptic weight vector represented the characteristic plant of the group formed and was randomly defined, based on the input data, as follows: The synaptic weight vector is the criterion for acceptance or rejection of a group of accessions or plants.
The similarity between the input and the neuron was measured as the average Euclidean distance between vectors X n and W k , calculated by: The output layer unit with the lowest average Euclidean distance is considered the best.When using index i(X) to identify the most similar neuron to vector X n , known by the network at the moment, i(X) is expressed as: Subsequently, the synaptic weight vectors of neurons were adapted, according to the updating formula below.
Given the synaptic weight vector W k (t) of neuron k at time t, the updated weight vector W k (t+1) at time t +1 was defined by (Kohonen 1982) as: which was applied to the winner neuron i, where η (t) was the parameter of the learning rate and must be variable in time, initiated with a value close to 0.1 that must gradually decrease but remain above 0.01.The learning rate determines the speed at which the network stabilizes.

Anderson's discriminant analysis
The multivariate technique, called discriminant function, proposed by Anderson (1958), was used in this study to verify the consistency of grouping by artificial neural network analysis.
The groups were defined by the neural network and compared with clustering by Anderson's discriminating technique.Based on Anderson's discriminating technique, the percentage of correct and incorrect classification of any other technique can be determined.
Using the discriminant functions and data of the proper populations p j, , the apparent error rate (AER) was estimated, which measures the efficiency of these functions to classify the accessions correctly in the previously established populations.
The apparent error rate was given by the ratio between the number of erroneous classifications and the total number of classifications (Cruz and Carneiro 2003), according to: where m j is the number of observations of population p j , which were, by means of discriminant functions classified in another population p j' , where j' = j and j = 1,2,..., g populations. Considering: where nj is the number of observations related to population p j .

Artificial neural network
The classification by the artificial neural network, based on mean values of three growing seasons, considering eight quantitative traits in 37 accessions, is presented in Table 2.
Four groups were estimated for the classification, to ensure the distinction of the accessions in groups, based on the main characteristics.
Group 1, with most accessions, was composed of 15 accessions of the group Solo and three accessions of Formosa, which produced the smallest fruits.The average values of the fruit characteristics were fruit weight 378.72 g, fruit length, 12.96 cm, fruit diameter 7.90 cm, flesh thickness 2.16 cm, external fruit firmness 86.87 N and flesh firmness 64.48 N.
Group 2 consisted of two accessions of the Formosa group, which produce the largest fruits.The average values of the fruit characteristics were fruit weight 2159.72 g, fruit length 29.13 cm, fruit diameter 12.57 cm, flesh thickness 3.16 cm, external fruit firmness of 194.68 N and flesh firmness 146.61 N.
Group 3 comprised two Solo and seven Formosa accessions.The average fruit weight was 640.81 g, average fruit length 17.24 cm, average fruit diameter of 9.21 cm, average flesh thickness of 2.49 cm, external fruit firmness 142.94 N and flesh firmness 104.67 N.
Group 4 consisted of eight accessions of the Formosa group.The mean fruit traits were fruit weight 1142.95g, fruit length 22.44 cm, fruit diameter 10.47 cm, flesh thickness of 2.89 cm, external fruit firmness 184.52 N and flesh firmness 139.46 N.
The presence of genetic diversity was significant, as the formation of heterotic groups demonstrated.
The groups generated by the artificial neural network facilitate the selection of divergent genotypes for improvement by the generation of hybrids, since they allow the selection of genotypes indicated for crosses from different heterotic groups.Thus, the probability of obtaining superior genotypes is greater.
In studies considering the same population evaluated in this study (Quintal et al. 2007a), the genetic parameters (variances) were estimated, observing the phenotypic variability for all traits in question, thus confirming the possibility of identifying genotypes.
By other techniques, Quintal et al. (2007b) showed the presence of genetic variability in these accessions and reported a significant difference for the genotype effect, for the traits assessed in this study.
The possibility of parent selection for breeding programs targeting superior genotypes in Carica papaya L. populations was corroborated, e.g., by Cardoso et al. (2009), who observed high genetic diversity of traits related to papaya seed quality, and in studies published by Silva et al. (2007bSilva et al. ( , 2008) ) that also indicate the availability of genetic variability among Carica papaya L accessions.
In the groups formed by the estimation of genetic divergence between genotypes in the study population, one must compare the existing possibilities for the case of perennial plants with sexual propagation, with the possibility to obtain lines via selfing and/or to obtain new hybrids.
For the selection of new crosses, in order to exploit heterosis and allele diversity, the agronomic performance of genotypes should also be taken into account, to indicate crosses between the most divergent genotypes with best agronomic performance in the traits studied.This should lead to a better allele complementation, resulting in improved genotype performance in future generations.
Another possibility is to directly recommend promising genotypes for further evaluation together with other accessions, to assess the possibility of their recommendation as new varieties.
These indications of crosses agree with Dias et al. (1994), who emphasized the possibility of considering not individual genotypes but a genotype group, in studies involving a large number of genotypes.In this case, crosses among accessions from the most distant groups and with agronomic traits of interest are to be preferred.
Considering that the synaptic weights were randomly generated at the implementation of the artificial neural network and that these initial values could affect the final classification, two types of synaptic weights were used.Random weights were used as well as weights representing the averages of the groups proposed by the principal component technique, resulting in no difference in the classification.
In the development of the neural network, two databases were tested to feed the input layer of the network; original means and standardized means, by the technique of principal components.
It was observed that the neural network was not influenced by the scale of input data.The classification by original data was the same as when using standardized data.
An important aspect that should be emphasized is that the data are very group-specific.The neural network tends to perform best when the data are more heterogeneous, characterizing the plants with regard to their groups.

Anderson's discriminant analysis
This procedure was adopted on the assumption that the group to which the accessions belong is known.Thus, the consistency of the grouping was verified by Anderson's discriminant analysis, as described by Cruz and Carneiro (2003).
When considering the group proposed by the artificial neural network, the apparent error rate, according to Anderson's discriminant analysis, was 8.10 %.That is, based on Anderson's discriminant analysis, 91.90 % of the accessions were classified correctly in the previously defined groups./ CD Barbosa et al.
The percentage of correct and incorrect classification of each group, detected by Anderson's discriminant analysis, based on the classification proposed by the neural network, as shown in Table 3, should be analyzed as follows: in the main diagonal is the percentage of correct classification for each group.All other fields correspond to misclassification.To determine the percentage of misclassification of a particular group, the respective row should be consulted.
For example, according to Anderson's discriminant analysis, the artificial neural network classified 94.44 % of the accessions correctly in group 1 while 5.56 % of the accessions that should belong to group 1 were incorrectly classified in group 3. Mariot et al. (2008) worked with accessions of M. ilicifolia and M. aquifolium and found an apparent error rate of 10.48 % when the accessions  Anderson's (1958) discriminant analysis, in the classification proposed by the artificial neural network, considering eight quantitative traits evaluated in 37 accessions of Carica papaya L.
were grouped by the Tocher method, which was considered adequate by the authors.
The consistency of clusters of Brachiaria species was verified by Assis et al. (2003), based on Anderson's discriminant analysis, and found high apparent error rates, as for example, 69.33 % for B. brizantha.Sudre et al. (2006) concluded that the discriminant function of Anderson is highly promising for the characterization and management of germplasm banks, as an additional tool to measure the consistency of the classification based on the various multivariate analysis methods used.

Table 1 .
Accessions of the Carica papaya L germplasm bank of Caliman/UENF, of the groups Solo and Formosa evaluated

Table 2 .
Papaya (Carica papaya L.) accessions grouped by the artificial neural network, based on the means of three growing seasons, considering eight quantitative traits evaluated in 37 accessions