High-efficiency phenotyping for vitamin A in banana using artificial neural networks and colorimetric data

Banana is one of the most consumed fruits in Brazil and an important source of minerals, vitamins and carbohydrates for human diet. The characterization of banana superior genotypes allows identifying those with nutritional quality for cultivation and to integrate genetic improvement programs. However, identification and quantification of the provitamin carotenoids are hampered by the instruments and reagents cost for chemical analyzes, and it may become unworkable if the number of samples to be analyzed is high. Thus, the objective was to verify the potential of indirect phenotyping of the vitamin A content in banana through artificial neural networks (ANNs) using colorimetric data. Fifteen banana cultivars with four replications were evaluated, totaling 60 samples. BASIC AREA Article High-efficiency phenotyping for vitamin A in banana using artificial neural networks and colorimetric data César Fernandes Aquino1*, Luiz Carlos Chamhum Salomão1, Alcinei Mistico Azevedo2 1. Universidade Federal de Viçosa Departamento de Fitotecnia Viçosa (MG), Brazil. 2. Universidade Federal de Minas Gerais Instituto de Ciências Agrárias Montes Claros (MG), Brazil. *Corresponding author: cesarfernandesaquino@yahoo.com.br Received: Oct. 2, 2015 – Accepted: Dec. 17, 2015 For each sample, colorimetric data were obtained and the vitamin A content was estimated in the ripe banana pulp. For the prediction of the vitamin A content by colorimetric data, multilayer perceptron ANNs were used. Ten network architectures were tested with a single hidden layer. The network selected by the best fit (least mean square error) had four neurons in the hidden layer, enabling high efficiency in prediction of vitamin A (r2 = 0.98). The colorimetric parameters a* and Hue angle were the most important in this study. High-scale indirect phenotyping of vitamin A by ANNs on banana pulp is possible and feasible.


INTRODUCTION
The banana tree (Musa spp.) is one of the most cultivated fruit trees in tropical and subtropical countries.In Brazil, the production of bananas and plantains was 6.89 million tons in 485,000 hectares of harvested area in 2013 (FAO 2015).Due to its good organoleptic properties and low cost, bananas are consumed by people across the social spectrum, representing a good source of minerals, vitamins and carbohydrates, with a high potential as a functional and nutraceutical food (Amorim et al. 2011;Aquino et al. 2014).The carotenoid contents, such as lutein, β-carotene, and α-carotene, which play an important role in the operation of the human body, stand out among the functional and nutraceutical properties.Moreover, β-carotene and α-carotene are converted to vitamin A in the human body (Davey et al. 2009).
Vitamin A deficiency is considered a serious nutritional disease and is the most common cause of preventable blindness in the world (Santos et al. 2010).One of the sustainable ways to mitigate the problem of vitamin A deficiency is to encourage the consumption of natural foods rich in provitamin carotenoids, such as fruits and vegetables (Ekesa et al. 2012).Thus, the prospection of banana access into collections is important to breeding programs, focusing on the development of cultivars with better nutraceutical properties (Amorim et al. 2011).However, the quantification of vitamin A content is expensive and it may become unfeasible if the number of samples to be analyzed is high.
The indirect estimate of the carotenoid content and, consequently, the provitamin one is possible by using colorimetric data, which are easily measured in the pulp or peel of the fruit using the colorimeter.This analytical approach has been used in tomato (Carvalho et al. 2005;Fernandez-Ruiz et al. 2010), pumpkin (Seroczyńska et al. 2006;Itle and Kabelka 2009;Doka et al. 2013), and potato (Lu et al. 2001).The indirect estimate of the carotenoid content can reduce the time, labor and financial resources in the evaluation stages.
Because artificial neural networks (ANNs) are efficient to model complex problems (Barbosa et al. 2011;Nascimento et al. 2013;Azevedo et al. 2015;Brasileiro et al. 2015), they may also be effective in the indirect phenotyping of vitamin A content by using colorimetric data.The ANNs are computational models of the human brain that can recognize patterns and regularities of the data, becoming an alternative as universal approximator of complex functions (Gianola et al. 2011).Consequently, they may perform better than conventional statistical models, with the advantage of being non-parametric, do not require detailed information about the physical processes of the system to be modeled, and tolerate data loss (Azevedo et al. 2015).
Thus, the objective of the present research was to verify the phenotyping potential of the vitamin A content in banana, using ANNs and colorimetric data.
The banana bunches were harvested when the first signs of yellow color appeared in the fruits of each cultivar.The bananas were removed from the second, third and fourth tiers hands, and the damaged, diseased and malformed ones were discarded.Subsequently, they were immersed in ethephon solution (1.2 g•L -1 ) for 8 min to even the ripening.After drying in air for 15 min, they were dipped in Prochloraz fungicide solution (0.49 g•L -1 ) for 5 min.Then, the fruits remained at room temperature until the complete ripening.
The completely randomized design was adopted, with 15 treatments (cultivars) and four replications (clusters) -six fruits per sample unit.The bananas were peeled, cut longitudinally, and the colorimetric reading was performed inside the fruit using the colorimeter Konica-Minolta, model CR 10.The values of L*, a*, b*, C* and Hue angle (°hue) were determined.L* (brightness) ranges from 0 (black) to 100 (white); a* varies from green (−60) to yellow (+60); b* ranges from blue (−60) to yellow (+60); C* is chroma/saturation or color intensity.The °hue value ranges from 0° to 360°, being 0° (red), 90° (yellow), 180° (green) and 270° (blue) (McGuire 1992).Carotenoids were extracted according to the method proposed by Rodríguez-Amaya (2001) with modifications.A 5-g sample of plant material was weighed; 60 mL of 100% acetone (which was cooled) were added.Then, the material was processed in an Ultra-Turrax homogenizer (model T18 Basic) for 6 min.Subsequently, the extract was vacuum-filtered through Buchner funnel using filter paper; then it was transferred to a separatory funnel containing 20 mL of cooled petroleum ether and washed with distilled water to remove the acetone completely.Anhydrous sodium sulfate p.a. was added to remove the residual water contained in the extract.
The peaks of interest were identified by comparing the retention times of the standard and samples and, especially, through the absorption spectrum.The quantification was performed using the standard curves of concentration versus area, and the results are expressed in μg per 100 g of each plant, on a wet basis.The vitamin A content was obtained according to the recommendations of the Institute of Medicine (2001).The carotenoids (β-carotene and α-carotene) were quantified, and the steps were performed, being protected from direct light to prevent degradation of the material.
Thus, the colorimetric parameters (L*, a*, b*, C* and °hue) and vitamin A content for 60 samples (four replicates of 15 cultivars) were obtained.These data were analyzed in the R software (R Development Core Team 2012) by ANNs.For the best efficiency in the training of networks, both input (color data) and output (vitamin A content) data were normalized to the range between 0 and 1 by the "normalize Data" function of the RSNNS package (Bergmeir and Benitez 2012).
The analysis by ANNs showed that 70% of the data (42 samples) were used to train the network and 30%, for validation (18 samples).The samples that formed the training and validation fractions were randomly selected.The Multi-Layer-Perceptron (MLP) networks were used for the analysis and developed using the "mlp" function of the RSNNS package with back propagation algorithm and learning rate of 0.1.The maximum number of training/epochs was 1,000, the activation function for the hidden layer was the logistics and the output layer was the linear.Ten network architectures were tested to determine a trained network with good fit, with 1, 2, 3, …, 9 and 10 neurons in the hidden layer.Considering that, at the beginning of the training, the free parameters were randomly generated and that these initial values can influence the final result of the training (Soares et al. 2014), each ANN architecture was trained 1,000 times.The network with the best fit was selected using the mean squared error (MSE) for the validation sample.
For the best-selected network, the diagram of the network topology was obtained using "plotnet" function (Neural Net Tools package).In addition, the relative importance of the input traits was obtained using the Garson method (1991) and the "garson" function (Neural Net Tools package).To determine the efficiency of network training, we performed the regression analysis of vitamin A levels predicted and observed for the training and validation samples.The multiple comparison test by bootstrapping (Ramos and Ferreira 2009) was used to compare the best network architectures, and the BCa bootstrap test, to obtain the 95% confidence intervals.The vitamin A level estimates, observed and predicted by the ANNs, were compared by the bootstrap paired test.In all analyses using the bootstrap technique, 10,000 simulations were used.

RESULTS AND DISCUSSION
The vitamin A content of the studied samples varied greatly, from 1.330 to 141.968 µg per 100 g of pulp, with a coefficient of variation equal to 149.144% (Table 1).This high variability is essential, so that the trained networks are general enough (Azevedo et al. 2015).Among the colorimetric data, the parameter a* had the greatest variability, with the highest coefficient of variation (103.527%).The red/green opponent colors are represented on the a* axis, where the positive values are red; the negative ones are green; and 0 value is neutral (Trevisan et al. 2008).On the other hand, the L* parameter displayed the lowest variability, with coefficient of variation of 6.191%.This parameter relates to light, ranging from 0 (perfect black) to 100 (perfect white).
For the ten tested network architectures, the smallest MSEs were observed for the lower numbers of neurons in the hidden layer (Figure 1a).Small MSE estimates indicate that the values, actual and predicted by the ANN, are close, or, in other words, indicate great efficiency of the networks.The multiple comparison test by bootstrapping (Ramos and Ferreira 2009) showed that, when using only one neuron in the hidden layer, the average network efficiency was better.A similar conclusion was evident when analyzing the coefficient of determination (r 2 ) in Figure 1b in which smaller numbers of neurons in the hidden layer also yielded better results.The use of non-parametric tests such as the bootstrap for multiple comparisons (Ramos and Ferreira 2009) is feasible in studies similar to this, when, in general, the MSE and r 2 do not follow a normal distribution.
Generally, the increased number of neurons per layer does not ensure the best network performance.Similar results were found by Soares et al. (2014) and Azevedo et al. (2015).An explanation for this is that the increased number of neurons in the network may lead to overfitting, which occurs when the network training process stores the data in the training sample and does not identify the associations between the data in the input and output layers (Silva et al. 2010).In this case, a good fit is observed for the sample training while a very poor one is found for the validation sample.Therefore,

Parameters
network efficiency should always be checked with a sample whose data were not used in the training process, which is the validation sample.The evaluation of the relative importance of the explanatory variables by Garson method (1991) showed that the parameters a* and °hue were the most important (Figure 2), with relative contribution of 28.87 and 20.08%, respectively.This is important, especially when it becomes advantageous to exclude traits to reduce the computational effort (Paliwal and Kumar 2011).A major contribution is expected for these traits due to the highest correlation estimates with vitamin A (Table 1), 0.821 and −0.765 for a* and °hue, respectively.
Although there was on average a good fit with only one neuron in the hidden layer (Figure 1a,b), the best fit was observed when using four neurons in the hidden layer (Figure 3).This can be explained by the high number of trainings (1,000) for each network architecture.The use of a large number of trainings for each network architecture is suggested, since, at the beginning of the training, the synaptic weights are randomly generated (Soares et al. 2014) and, therefore, at each training, different results are found for the same architecture.
For the best-fitted network, optimum fittings were found, with r 2 = 95.11% for the training sample (Figure 4a) and 98.50% for the validation sample (Figure 4b).The high r 2 value estimated for the validation sample indicates that the trained network is efficient and has the power of generalization.The prediction efficiency found in this study is higher than that observed by Carvalho et al. (2005), r 2 = 0.90% for lycopene prediction, using the colorimetric data of tomatoes.On the other hand, Seroczyńska et al. (2006) and Doka et al. (2013) found r 2 of 0.92 and 0.96%, respectively, when predicting the β-carotene content using the colorimetric data of pumpkin.
The good results of this work can be explained by the good fit of neural networks for non-linear systems (Gionola et al. 2011).Also, this technique allows considering many explanatory variables simultaneously, which can become impractical for multiple linear regression.Fernandez-Ruiz et al. ( 2010) also found high r 2 (0.99%) estimates to predict lycopene content in tomatoes by ANNs, using colorimetric data.
The actual and predicted vitamin A levels were compared by the non-parametric bootstrapping paired test for training and validation samples, with estimated p-values of 44 and 48%, respectively.This means that, at 5% significance level, there is not enough evidence to reject the null hypothesis.In this case, the null hypothesis considers that the mean difference of each observation between the actual and predicted data is zero.This reinforces the conclusion of the prediction efficiency found in this study.Thus, the content of vitamin A can be easily estimated by using only color data.This strategy allows reducing evaluation time, labor and financial costs (Fernandez-Ruiz et al. 2010).

CONCLUSION
The colorimetric parameters a* and °hue were the most important in predicting the level of vitamin A in ripe banana pulp.High-level phenotyping of vitamin A in banana pulp by colorimetric data and artificial neural networks is feasible, allowing reducing evaluation time, labor, and financial costs.
C.F.Aquino et al.

Figure 1 .
Figure 1.Mean square error (a) and coefficient of determination (b) for different numbers of neurons in the hidden layer.

Figure 2 .Figure 3 .
Figure 2. Relative contribution, obtained by the method of Garson(1991), of the colorimetric parameters in the input layer to predict the vitamin A level using artificial neural networks.The deviations refer to the 95% confidence intervals obtained by bootstrap BCa with 10,000 simulations.

Figure 4 .
Figure 4. Regression of the predicted and estimated vitamin A (µg per 100 g of pulp) in the training (a) and validation (b) samples using artificial neural networks.
ACKNOWLEDGMENTST h a n k s a r e d u e t o C o n s e l h o Na c i o n a l d e Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for the grants and financial support.

Table 1 .
Descriptive analysis and Pearson correlation between colorimetric parameters and vitamin A level in ripe banana pulp.