Yield prediction of experimental plots based on the harvest of specific fruit clusters for selection of fresh market tomato hybrids

The objective of this study was to propose models based on the harvest of specific fruit clusters to estimate the plot yield of trials containing fresh market tomato hybrids. Three experiments were conducted at the Experimental Station of Syngenta Crop Protection in the municipality of Holambra, state of São Paulo, Brazil, in 2016 and 2017. The experimental design was randomized complete blocks with 12 genotypes (experiments 1 and 3) and 13 genotypes (experiment 2), with 4 replicates and 10 plants per plot. Multiple linear regression models were fitted (stepwise method) with experiment 1 data (cross-validation), the best models were selected (higher adjusted R²) and then tested with experiments 2 and 3 data by mean of absolute percentage error (MAPE), for the traits mean weight of marketable fruit per plant A (MWA), AA (MWAA), AAA (MWAAA), AA and AAA (MW23A), and the ratio between the market weight and the number of marketable fruits (MFW). The models with four fruit clusters (2, 3, 6 and 9) showed the best balance between prediction capacity and the number of fruit clusters to harvest. The traits MWAA, MW23A and MFW generated reliable predictions, with MAPE approximately 5%. The multiple linear regression can be used to estimate the plot yield what, in the last instance, contributes to the reduction of the costs to conduct fresh market tomato hybrids.


Research
Horticultura Brasileira 39 (1) January - March, 2021 T he tomato is the most popular vegetable because of its ease of use and versatility. It is also an economically important crop due to the social value of tomato production, which directly and indirectly creates jobs (Souza et al., 2010;Alvarenga, 2013;Stajcic´ et al., 2015).
Approximately 30 national and international tomato seed companies conduct research and breeding programs in search of cultivars that are better adapted to local conditions and have higher productivity, resistance to diseases, and quality fruits with a good color, flavor, and long shelf life (Mohamed et al., 2012;Treichel, 2016;Rothan et al., 2019). Of the several tomato types in the market, Redondo or Salada Longa Vida are the most marketed (Treichel, 2016).
Palavras-chave: Solanum lycopersicum, melhoramento do tomateiro, regressão linear. will decrease travel and labor costs, and make the process more efficient. In tomato, there is a huge difference of earliness among the different genotypes, which is the reason why this study did not consider evaluating yield per week because the genotypes with shorter cycle would have more yield collected, it means, more harvested clusters than the ones with longer cycle. Therefore, this research hypothesizes that a model based on the harvest data of specific fruit clusters predicts the tomato plot yield without compromising the experiment's capability of differentiating genotypes.
The yield of an experimental plot is given by the sum of the yield of individual fruit clusters, usually 12. Therefore, due to this linear relationship between fruit cluster yield and plot yield, and the influence of multiple fruit clusters in the plot yield, the multiple linear regression model has been chosen to test the hypothesis proposed.
However, in contrast to the hypothesis proposed in the present study, most studies relate yield prediction not only with plant yield factors but also with environmental factors (Higashide. 2009;Qaddoum et al., 2013;Hussain & Hatibaruah, 2015).
The yield of fresh tomatoes is split into different classes according to the quality. Therefore, the main challenge of this study is to select one group of fruit clusters, which will be the parameters of different models, each one to a class of fruits, that maximizes the overall predictive capability.
Thus, this study proposes multiple linear regression models based on harvest data of specific fruit clusters to predict the plot yield of experiments with fresh market tomato hybrids.

MATERIAL AND METHODS
Three experiments were conducted at the Experimental Station of Syngenta Crop Protection in the municipality of Holambra, state of São Paulo, Brazil (22°38'58"S; 47°05'34"W; 600 m altitude).

Experimental conduction
Experiment 1 was conducted between February 16 th and August 17 th , 2016; experiment 2, from January 23 rd to June 27 th , 2017, and experiment 3, from March 10 th to August 21 st , 2017. All under field conditions.
The seedlings were produced in greenhouses at the Holambra experimental station in 200-cell polystyrene trays. Coconut fiber (type 11, Amafibra) was used as substrate, with the following characteristics: pH 5.8, electrical conductivity = 1.1 dS/m, density = 89 kg/m 3 and moisture retention capacity = 308 mL/L. The irrigation was performed by micro-sprinkler systems, and fertilization with 0.5 g/L calcium nitrate, 0.4 g/L potassium nitrate, 0.6 g/L monopotassium phosphate (MKP), and 0.2 g/L magnesium sulfate was applied three times per week. Preventive plant health management was performed with the application of fungicides, insecticides, and acaricides registered for tomatoes. Weed control between tomato beds was performed by mechanical weeding as needed.
Seedlings were transplanted into the experimental area, 30 days after sowing, when the seedlings presented three to four leaves.
One plant was planted per hole, and axillary buds were removed, leaving only two stems per plant. Plants were grown until the development of the 12 th fruit cluster. Then, tip pruning was performed, which is a common practice in fresh market tomato production. Tip pruning consists of eliminating the terminal bud of the tomato stems to stop the vertical growth of the plant (Alvarenga, 2013).
Top-dressing fertilizations were performed twice a week, beginning 25 days after setting the experiment in the final location. A drip fertigation system was used. Fertilization was divided into three stages based on the crop developmental stage (Alvarenga, 2013). From weeks 3 to 5 after seedling transplantation (WAT), 1.67 kg of calcium nitrate, 1.25 kg of MKP, and 0.535 kg of potassium nitrate were applied. In the second stage from 6 to 11 WAT, 2.1 kg of calcium nitrate, 1.73 kg of MPK, and 0.945 kg of potassium nitrate were applied. In the third stage from 12 to 17 WAT, 1.1 kg of calcium nitrate and 1.74 kg of potassium nitrate were applied.
Fruit clusters were numbered from 1 to 12 as follows: the first and second fruit clusters in the main stem were numbered 1 and 2, respectively, the first cluster of the secondary stem was numbered 3, and so forth, in a zigzagging pattern up to the 12 th cluster.
The fruit clusters were labeled and harvested separately depending on the plant development and fruit ripening stage. The data were expressed as the mean value of the cluster per plot.

Evaluated traits
Tomato fruits were classified using the system typically adopted in the tomato production chain for group Salada Longa Vida (PBMH, 2003), which classifies fruits into three categories according to their transverse diameters: AAA (>85 mm), AA (65 to 85 mm) and A (<65 mm).
Fruit classification was performed using a wooden box with two sieves to separate the fruits. The following traits were evaluated during harvest: mean weight of marketable fruit A (MWA), mean weight of marketable fruit AA (MWAA), mean weight of marketable fruit AAA (MWAAA), mean weight of marketable fruit AA and AAA (MW23A), all in kg/plant; and, the ratio between the market weight and the number of marketable fruits (MFW), expressed in grams/fruit.

Experimental design
A randomized complete block design with four replicates in linear plots with 10 plants each was used. Spacing was 1.5 m between rows and 0.5 m between plants.

Yield prediction
The yield prediction method for plots of tomato hybrids experiments (Figure 1) consisted of fitting multiple linear regression models to the results from experiment 1 (cross-validation dataset with 5 folds) and selection of the group of fruit clusters which are part of the best model for each evaluated trait. For each one of these groups, we created five models (one for each trait) and their overall performance was performed with the data from experiments 2 and 3 (test dataset). Finally, we identified a group of fruit clusters with the best predictive capacity.
The models were fitted for each response variable studied (MWA, MWAA, MWAAA, MW23A, and MFW) through the stepwise method with forward selection of terms, considering the maximum adjusted coefficient of determination (R² adj) as criteria to enter the independent variables to the model (Equation 1).
In cultivars with indeterminate growth in which the plants have two stems, the first bud immediately below the first cluster should be selected as the second stem. After bud removal, new inflorescences are emitted every two or three leaves depending on the cultivar and the environmental conditions (Alvarenga, 2013). In the field, the first cluster of the secondary stem begins developing slightly after the second cluster of the main stem. Consequently, fruit ripening and the harvest period for those two clusters are almost concomitant. This characteristic is very important when creating the model because it allows travel costs to be reduced because data for the two clusters can be collected in a single visit. For this reason, the models were always initiated with the selection of these two clusters (C2 and C3) (equation 1) Y i = β 0 + β 1 X 1 i + β 2 X 2 i + β k X k i + e i where Y is the dependent variable (yield) for the i th plot, X 1i , X 2i , ..., X ki are the independent variables (fruit cluster) for the i th plot, β 0 , β 1 , β 2 , ..., β k are the regression parameters and e i is the residual error for the i th plot.
The selection of the group of fruit clusters, which are part of the best model for each evaluated trait, was based on the visual analysis (Le Clerg, 1967) of the chart R 2 adj by the number of fruit clusters. Despite the higher the number of terms in the model the higher is the predictive capacity, this relation is not linear. In general, there is a decrescent increment in the predictive capacity as the model size increases up to a point where the prediction gain is practically null, the beginning of a plateau. The selected group was the one in the plateau with fewer components.
For each group of selected fruit clusters, five models were fitted, one for each evaluated trait. A multicollinearity check for each one of these models was performed by the variance inflation factor (VIF), considering values lower than 5.0 as acceptable (Marquardt, 1980). The minimum value that VIF can assume (absence of collinearity) is 1.0.
In order to evaluate the predictive performance of these groups of clusters, the mean absolute percentage error (MAPE) was calculated (equation 2) with the test dataset.
Where n is the number of plots Y' i , is the predicted value of the i th plot and Y i is the observed value for the i th plot.
The MAPE was chosen to evaluate the prediction performance of the models because it measures the prediction error, which is easily interpretable (difference between observed and predicted value), in a standardized way (percentage) because the traits are on different measurement scales.
The MAPE of groups of fruit clusters was compared by Kruskal-Wallis test followed by Steel-Dwaas pairwise comparisons test, both at 5% significance.
Aiming to illustrate the predictive performance of the best group of fruit from a practical perspective of a breeder point of view, the analysis of variance followed by the Scott-Knott grouping test at 5% significance level was performed with observed and predicted data of the test dataset.
All analyses were performed using JMP® version 15.2 (SAS Institute Inc., Cary, NC, USA, 1989-2020) except for the Scott-Knott grouping test, which was performed using R (R Development Core Team, 2016).

RESULTS AND DISCUSSION
The adjusted coefficient of determination (R 2 adj) increased with the increment of the number of clusters added to the model for each response variable (Figure 2). The fruit clusters C2, C3, C6, and C9 were selected in the case of variables MWA, MWAA, and MW23A, with R 2 adj equals 0.90, 0.92, and 0.92, respectively. In the case of MWAAA, the best model (R 2 adj = 0.93) was the one with three parameters (C2, C3, and C6). And for MFW, C2, C3, C5, C6, C7 and C10 (R 2 adj = 0.91).
The difference between the curve of training and validation datasets when the plateau was reached, was minimal for all variables studied except for MFW (Figure 2). This small difference means there is no overfitting which makes the model flexible enough to be applied to other datasets different from the one used to train the model. In the case of MFW, both curves got closer (R² adj 0.97 and 0.91 for training and validation, respectively) from the 6 th parameter onwards (best model selected), which should not cause any decrease in the performance of the model.
For MWA (smaller fruits), there was an increment on R 2 adj, from 0.23 to 0.84, when fruit clusters in the upper part of the plant (C9) were added to the model with clusters in lower positions only (C2 and C3) (Figure 2). In contrast, the variable MWAAA (larger fruits) was best estimated using combinations of fruit clusters in the lower part of the plant (C2 and C3), with R 2 adj equal to 0.86. These findings are in accordance with results of other studies (Streck et al., 1996;Azevedo et al., 2010), who observed that fruits from upper inflorescences were small in greenhouse crops, and also with those who reported that fruit size was closely related to several parameters, namely the position of the fruit in the cluster and the position of the cluster in the plant (Kinet & Peet, 1997;Wamser et al., 2012).
The check for multicollinearity shows that all 15 models generated with the three groups of fruit clusters selected are acceptable (VIF <5) ( Table  1). As an expected trend, the higher the number of terms in the model, the higher the VIF values. Despite getting ripe concomitantly, there is no evidence of collinearity for fruit clusters C2 and C3 in any model, once the values are around 2.0.
The analysis of MAPE mean across all traits shows that the group of fruit clusters C2, C3, and C6 has an overall lower predictive capacity (16.48%) compared to the other two ones (C2, C3, C6 and C9: 14.16%; C2, C3, C5,  C2 to C10 = identification of the position of fruit clusters on the tomato plant. The second fruit cluster in the main stem was numbered 2; the first cluster of the secondary stem was numbered 3, and so forth, in a zigzagging pattern up. number of harvest trips can be reduced through the usage of a predictive model, for example, one visit to harvest clusters 2 and 3 and two or three to harvest the other two fruit clusters, the total number of trial visits and their cost can be significantly reduced. Therefore, the proposed multiple linear regression models are consistent with the aims of the study and hybrid development. Considering the balance between operational efficiency (number of fruit clusters in the model) and the predictive capability (MAPE), the group of fruit clusters selected, as the best for plot yield prediction of fresh market tomato hybrids experiments, was C2, C3, C6 and C9 and the regression parameters for the prediction at plot level for each trait evaluated are available in Table 3. In essence, only these four fruit clusters of the ten plants/plot can be assessed and data feed into the regression models to predict the total yield of each plot.
The grouping (Scott-Knott test) was statistically differentiate the models with four and six fruit clusters for any variable or the overall mean. A fresh market tomato hybrids experiment requires, on average, seven visits during the harvest period. If the C6, C7 and C10: 11,77%) ( Table 2). This behavior is also valid for the isolated analysis of variable MWA. The performance of the three models was similar in the case of MWAA, MWAAA and MFW. There was no evidence to  (Figure 3). This outcome was expected because the errors associated with the value estimation were approximately 5% (Table 2). D e s p i t e t h e l o w p r e d i c t i v e performance of the model when applied to variables MWA and MWAAA, according to CEPEA (2017), larger fruits classified as AA and AAA are more important than those classified as A for the grower to obtain a return over investment. For this reason, the models can still be considered useful for predicting the yield of plots.
For the variable MFW (MAPE equals to 3.50%), considering experiment 2, no difference in the groups was observed (Figure 4). For experiment 3, the estimated values for hybrids 214452 and 414929 were underestimated. Although MFW is not important for growers because fruits are marketed by size and not by their mean weight, it is relevant for hybrid characterization given that fruit size can be used to infer fruit weight.
Of the variables studied, reliable estimates were obtained for MWAA and MW23A, as indicated by the low MAPE (Table 2), 5.34 and 4.28%, respectively, and by their similar groupings according to the Scott-Knott test (Figure 4). Although MFW did not present the same grouping order according to the Scott-Knott test (Figure 4), this variable could also be used as a reliable deciding factor  in tomato hybrid selection because the MAPE that was 3.50%. Table 3. Coefficients of the linear equations (Y = I + aC2 + bC3 + cC6 + dC9) to predict, at plot level, the variables mean weight per plant of marketable fruits AAA (MWAAA), AA (MWAA), A (MWA), AA and AAA (MW23A) and mean weight of marketable fruits (MFW). Holambra, Syngenta Proteção de Cultivos, 2016-2017.