Path analysis under multicollinearity for papaya production of the Solo and Formosa groups

– Correlations and path analysis allow the understanding of the interrelations between characteristics of interest to the plant breeding. However, in order for their results to be reliable, the undesirable effect of multicollinearity must be excluded. The objective of this study was to estimate the correlations and their partitioning in direct and indirect effects, by path analysis, on the fruit production per plant (FP) of papaya from the heterotic Solo and Formosa groups, using two different strategies to circumvent multicollinearity. Eleven agronomic variables were evaluated in twelve papaya genotypes from the Solo group and nine from the Formosa group. Path analysis was obtained with the FP as the basic variable and to eliminate multicollinearity were used the discard of variables and the ridge path analysis. For the Solo group, fruit length and pulp thickness had greater direct effects on FP. In the Formosa group, the number of commercial fruits had a direct and indirect effect on FP. The two methodologies used to circumvent multicollinearity had high coefficient of determination, with better values for the ridge path analysis. The results indicated that the interrelation between the study characters was different in the Solo and Formosa groups. Thus, indirect selection strategies should be specific for each heterotic group.


Introduction
Due to the low number of papaya cultivars (Carica papaya L.) available to the producer, this crop becomes more susceptible to damage caused by pests, diseases and climatic adversities.Therefore, it is necessary to develop new papaya cultivars, ensuring their sustainability (OLIVEIRA et al., 2010;SILVA et al., 2016).The search for new genotypes is facilitated when the relationships between characters used in selection are known, especially when the variable of interest is polygenic, has its phenotype expressed late or of medium or low heritability, such as fruit production.
Correlations indicate the direction and degree of association between a pair of characters, and if the interrelationship is due to genetic factors -pleiotropy or gene linkage -or environmental factors, increasing the possibility of genetic gains.However, the correlation does not evaluate the cause and effect relationships, that is, they do not indicate the direct and indirect effects of other variables on the pair of characters under study (TEIXEIRA et al., 2012).
To circumvent this limitation, Wright (1923) proposed the path analysis, which consists in quantifying the direct and indirect effects of explanatory characteristics on a basic variable.Estimates of this analysis are obtained by regression equations, where the variables are previously standardized (WRIGHT, 1923).In a review on path analysis, Olivoto et al. (2016) have shown that it is used in different areas of knowledge, such as plant and animal breeding, and environmental and social sciences.
In order for the results of the path analysis to be reliable, it is necessary that the assumptions of the model are met, among them, the lack of multicollinearity.Multicollinearity occurs when the basic variable is linked to a large number of explanatory variables, or when the correlation between the basic and explanatory characters is very high (TOEBE; CARGNELUTTI FILHO, 2013a, b;CRUZ et al., 2014).If path analysis is performed under multicollinearity, estimates of direct and indirect effects may be biased, leading to erroneous conclusions.To eliminate it, it has been recommended to remove from the analysis the highly correlated variables (TOEBE; CARGNELUTTI FILHO, 2013a, b).However, when character deletion is not interesting to the breeder, we can alternatively perform ridge path analysis, where a constant k is added to the elements of the diagonal of the correlation matrix (CRUZ et al., 2014).
For papaya, studies that seek to understand the correlations and their direct and indirect effects through path analysis for the production, number and mass of fruits are available in the literature (OLIVEIRA et al., 2010;SILVA et al., 2016).However, these analyzes do not consider the different heterotic groups of papaya, divided in Solo and Formosa, that have phenotypic patterns, such as size and fruit mass and different plant architecture.This may cause misunderstandings in the interpretation of the results since the characteristics considered ideal are not necessarily the same for the two groups.In addition, these studies did not consider the different alternatives to circumvent multicollinearity.
The objective of this research was to study the phenotypic, genotypic and environmental correlations and their partitioning in direct and indirect effects, by path analysis, on the production of papaya fruits of the Solo and Formosa groups, and to evaluate different methods to overcome the effect of multicollinearity.

Material and methods
The experiment was conducted in the municipality of Pinheiros, ES, (18°30'59" S, 40°17'38" W) in a private area with commercial papaya planting.The climate is hot humid and monsoon (Am), with average annual air temperature of 23.6°C and average annual rainfall of 1,308 mm (ALVARES et al., 2013).
The seedlings were produced in commercial substrate and transplanted to a previously prepared area.In the transplant, three seedlings per hole were used and, after the emergence of the flower buds, sexing was carried out, maintaining one hermaphrodite plant per hole.The spacing was 3.50 x 1.80 m, and irrigation was carried out with a central pivot.The planting area was monitored for elimination of plants with virus symptoms.Other cultural treatments were carried out according to the recommendations for culture (MARTINS;COSTA, 2003).
Were evaluated the following characteristics: fruit yield per plant (FP), in kg plant -1 ; number of commercial fruits (NCF); number of deformed fruits (NDF); number of nodes without fruit (NWF); fruit mass (FM), in g fruit -1 ; Rev. Bras.Frutic., Jaboticabal, 2018, v. 40, n. 3: (e-110) fruit length (FL), in cm; fruit diameter (FD), in cm; pulp thickness (PT), in cm; soluble solids content (SS) in ºBrix; plant height (PH) in m, measured from the soil level to the insertion of the youngest leaf; and, stem diameter (SD), in cm, measuring 20 cm from the soil.For the NCF, NDF, NWF, PH and SD evaluations, were sampled three plants per plot.For the other variables, a sample of five fruits per plot harvested at maturation stage 2 (25% of its yellow surface).
For each heterotic group, the original data were submitted to analysis of variance.Genotypic (r g ), phenotypic (r p ) and environmental (r e ) correlations were estimated for all combinations of characters.The phenotypic correlations were partitioned on direct and indirect effects through path analysis (WRIGHT, 1923), considering FP as the basic variable and the other agronomic characters as explanatory variables.
The multicollinearity diagnosis was performed based on the condition number (CN), which represents the ratio between the highest and the lowest self-value of the correlation matrix X'X (MONTGOMERY et al., 2012).According to Toebe and Cargnelutti Filho (2013a), when the CN is greater than or equal to 100 there is multicollinearity.In this experiment, after verification of multicollinearity, two strategies were adopted to circumvent its effects: the discarding of variables that contributed to its appearance, that is, those that have the highest correlation among them; and the methodology of ridge path analysis, where a constant (k) is introduced to the diagonal of the X'X matrix (CRUZ et al., 2014).For the data analysis, was used the Genes Program (CRUZ, 2013).

Results and Discussion
There were significant differences (p<0.05) for all traits studied, except for SD in the Solo group and FL for the Formosa group, indicating that there is variability among genetic materials within the same heterotic group.The cultivars of the Solo group had, on average, FM of 591.7 g and SS of 13.7 °Brix.Within the heterotic Formosa group, the FM was 1,298.8g and the SS was 13.0 °Brix (Table 1).These values corroborate with Costa et al. (2013) and Luz et al. (2015), which indicated FM for the Solo group between 450 and 900 g and FM for the Formosa group from 1,100 to 2,700 g.For the soluble solids content, the values obtained were higher than 12.0 °Brix, a value considered optimal for the papaya, independently of the heterotic group (LUZ et al., 2015).
For the FP, the cultivars of the Solo group ranged from 7.3 to 32.0 kg plant -1 .In the Formosa group, the FP was between 9.5 and 43.4 kg plant -1 .These values indicate the wide genetic variability available for the character, justifying the need to identify the direct and indirect effects of different variables on FP and that can be used in the selection of more productive genotypes.
The coefficients of experimental variation (CV) ranged from 4.2 to 34.8% for the Solo group and from 4.9 to 39.7% for the Formosa group.For the two heterotic groups, the lowest CV was obtained for SS and the highest for NDF (Table 1).The CV value for FP was also considered high, from 28.1% for Solo and 36.1% for Formosa.This result evidences that this character, being the result of the expression of multiple variables, is complex and influenced by environmental variations.However, these high CV did not prevent the identification of significant differences for the study variables.
For the Solo group, the FP had a significant phenotypic correlation with NDF, FM, FL, FD and PT, reinforcing its complexity.With NDF this correlation was negative (-0.58), which was expected because the deformed fruits are not considered in the evaluation of the production per plant.The significant correlations between FP and the other variables were positive and ranged from 0.66 to 0.81 for FD and PT, respectively.The characteristics FM, FL, FD and PT also had phenotypic correlations with each other and higher than 76% (Table 2).These results are in agreement with Oliveira et al. (2012), indicating the strong interrelation between these variables and, because they have high magnitude and same sign, indicate that the selection practiced for one of them will favor the others.
For the heterotic Solo group, significant phenotypic correlations (p <0.05) were identified between FM x NCF, NDF, NWF, FD, with values of 0.70 (FM x NDF) at 0.95 (FM x FD).Then, the variable with the greatest number of significant interactions was NCF, which correlated with NWF, FM, FD and FP (Table 3).Contrary to that observed for the Solo group, for the Formosa group the FP had a significant correlation only with NWF (0.83).This variation observed between the correlations in the two heterotic groups was possibly due to differences in the fruit pattern of each group.In the Solo group, the fruits are smaller and more numerous, thus, small variations in FD, FL and PT can alter FM and FP.In the Formosa group, which has a pattern of larger and fewer fruits, NCF seems to have a greater influence on FP when compared to the morphological characteristics of the fruit.
In the Solo group, in 83.64% of the relationships studied, the genotypic correlations surpassed the phenotypic correlations.For the Formosa group, this value was 74.55%.This indicates the overlap of the genetic effects on the environmental in the manifestation of the evaluated characters and that the sampling errors were less expressive (CRUZ et al., 2014).The multicollinearity diagnosis considering all variables was CN = 96,978.88for the Solo group and CN = 311,784.92for the Formosa group, indicating severe collinearity.According to Toebe and Cargnelutti Filho (2013a;b), the high degree of multicollinearity can result in trail analysis with direct and indirect biased effects, with values higher than 1 or lower than -1 and, therefore, without biological sense.To eliminate the multicollinearity of the analysis, the first strategy used was the exclusion of variables that had high correlation between themselves.However, although this is the statistical criterion, it is necessary to consider the objectives of the breeding program, and, if it is prudent, do not discard variables of interest despite the correlation (CRUZ et al., 2014).In this study, FM was considered the most interesting variable, thus, it was preserved in the analysis even though it had a high correlation.
For the Solo group, CN <100, indicating weak multicollinearity, was obtained after the exclusion of FD, FL, PT and NCF (CN = 85.53), in that order.In the Formosa group, FD, PT, NWF and NCF were excluded to obtain CN = 62.31.Thus, after the exclusion of variables, the new matrix of phenotypic correlations had weak multicollinearity, being possible to carry out the analysis of the trail without its harmful effects.
In the Solo group, considering the FP as main variable and NDF, NWF, FM, SS, PH and SD as explanatory variables, we observed that the greatest direct acted under FP (TEODORO et al., 2016).These results corroborate with those obtained by Oliveira et al. (2010) for path analysis for the number of papaya commercial fruits, whose R 2 was 0.87 and the residual effect was 0.25, indicating that the model in explaining the genetic effects related to the variable under analysis was well adjusted.
Despite the consistent results, the exclusion of characteristics should be evaluated with criteria, since the removal of variables with high explanatory power can reduce R 2 and increase the residue (OLIVOTO et al., 2016).On the other hand, Toebe and Cargnelutti Filho (2013b) observed that the elimination of correlated variables, besides providing better R 2 , can represent labor savings, since fewer variables would need to be measured to explain the main characteristic.The results of this research indicate that it is possible to exclude four variables, one of them being destructive (PT) and others difficult to measure (NCF and NWF), reducing labor costs and loss of commercial fruits.On the other hand, the elimination of variables was different for the two heterotic groups of papaya, evaluated under the same environmental conditions, probably due to the differences in plant and fruit architecture patterns between groups.Therefore, it is necessary to evaluate the direct and indirect effects on the variable of interest for the expected phenotypic pattern.
In some instances, the exclusion of features that generate multicollinearity cannot be performed by the breeder, due to, for example, a small number of explanatory variables or the importance of knowing their effects (OLIVOTO et al., 2016).In these cases, a second option is to perform the analysis with all explanatory characteristics, but with the addition of a constant (k) to the diagonal elements of the X'X matrix, known as ridge path analysis.The value of k to be used should be the lowest value able to stabilize most of the path coefficient estimators (TOEBE; CARGNELUTTI FILHO, 2013b;CRUZ et al., 2014).Thus, values of k = 0.05 were previously employed by Amorim et al. (2008) and Moreira et al. (2013), and k = 0.10 by Toebe and Cargnelutti Filho (2013a), allowing path analysis with all available variables.
The coefficients of the ridge path analysis obtained for the Solo group indicated that the greatest direct effects under FP were via NCF (0.39), PT (0.36) and FD (0.29), and FD still had large indirect effect via PT.Other characteristics related to FP were FM and FL, however, despite the large overall effect of FM and FL under FP, this effect is not direct, but indirect, via FD and PT (Table 6).Both PT and FD were previously excluded in path analysis to circumvent the effects of multicollinearity (Table 4).However, in this ridge path analysis, it is observed that these variables, besides the direct effects, exert indirect effects on each other, being able to be used both for the direct selection and for the indirect selection of papaya materials of the Solo group with higher FP.The two analyzes, however, allowed determination coefficients effect was the FM (0.52), followed by the SD (0.38).The NDF and NWF variables had direct negative effects (-0.49 and -0.35, respectively) (Table 4).In plant breeding, variables with greater direct and positive effects are more favorable to selection.Thus, the correlated response through indirect selection will be efficient, facilitating the breeder work (TEODORO et al., 2016).In this sense, for the Solo group, the FM for having greater direct and total effect on the main variable should be prioritized in the selection, since they indicate the relation of cause and effect.In addition, the indirect effect of FM via NDF was 0.26, increasing the possibility of success due to the indirect selection of genetic materials of papaya from the Solo group with higher FP.
For the heterotic Formosa group, the explanatory variables used in the trail analysis were NDF, FM, FL, SS, PH and SD.The highest positive direct effect under FP was obtained by PH (0.36), which added to the indirect effects had a total effect of 0.56 (Table 5).Therefore, according to this methodology, the PH is the characteristic that must be prioritized to obtain greater production of papaya fruits of the Formosa group, for its direct and indirect effect and for making possible the early selection.
The direct negative effects of NDF (-0.56),FM (-0.60) and SS (-0.77) on FP were observed (Table 5).According to Cruz et al. (2014), the relation of cause and effect on the basic variable is given by high direct correlation and indirect effect in a favorable sense.Thus, characteristics with a large direct effect, but with an indirect effect in the opposite direction, indicate that the auxiliary character is not the main determinant of the changes in the basic variable, and others may provide greater impact in terms of selection gain.In this sense, both NDF and FM are not a good determinant of the FP variation for papaya in the Formosa group, because despite having a great direct effect, its indirect effect has an opposite meaning, indicating absence of cause and effect relationship.On the other hand, these results indicate that in order to obtain higher FP, it would be necessary to decrease the soluble solids content of the fruits.The negative correlation between SS and FP observed in this study and by Oliveira et al. (2010) indicate the negative direction in the degree of association between these variables.This correlation may be related to a dilution effect of soluble solids by the fruit, that is, the larger the fruit size, the greater the amount of carbohydrates needed to raise the °Brix.
In the trail analysis in which the elimination of variables was used to circumvent multicollinearity, the coefficient of determination (R 2 ) was 94% for the Solo group (Table 4) and 89% for the Formosa group (Table 5).The residual effect, in turn, was 0.25 for the Solo group and 0.33 for the Formosa group.This indicates that, although FP is a complex characteristic, the variables maintained in the analyzes explained their variation.However, despite a low residual effect, some indirect effect may have Path analysis under multicollinearity for papaya production ... Rev. Bras. Frutic., Jaboticabal, 2018, v. 40, n. 3: (e-110) above 94%, allowing the breeder to decide which of these variables to use in the selection.This choice can be based on the easiness of measurement, the precocity or the greater heritability of the characteristic.
In the heterotic Formosa group, the greatest direct positive effects were obtained by NCF (0.77) and FL (0.18) and negative by NDF (-0.20) and SS (-0.17).However, the NCF besides the great direct and total effect (0.83), provided indirect effects for the selection of the other analyzed variables, except FL and SD (Table 7).Therefore, this characteristic should be prioritized to evaluate the production of papaya from the Formosa group.However, NCF is a variable measured only when the fruits are at harvest point, delaying the selection process.In a trail analysis for NCF in papaya, Oliveira et al. (2010), observed that leaf width, plant height and number of flowers per peduncle are determinant characteristics of variations in NCF, are easy to measure and that manifest themselves before production, making possible the early selection.
The coefficient of determination of ridge path analysis for papaya from the Formosa group was 0.93 (Table 6) for the Solo group was 0.97 (Table 5).This result corroborates to those obtained by Moreira et al. (2013), who had R² of 89.59% and 82.41% with the ridge path analysis and with the discard of variables, respectively.On the other hand, Toebe and Cargnelutti Filho (2013a, b) with use of k = 0.10, indicated that the elimination of variables is more adequate than the ridge path analysis.For both heterotic groups, this analysis explained better the FP variation than the exclusion of variables, allowing the analysis with all the evaluated characters and the better understanding of the relationships between them.

Table 1 .
Average and coefficient of variation (CV, in %) of eleven agronomic characters in papaya from the heterotic Solo and Formosa groups.

Table 2 .
Estimating the coefficients of phenotypic (r p ), genotypic (r g ) and environmental (r e ) correlation between eleven agronomic characters in papaya from the heterotic Solo group.

Table 3 .
Estimates of phenotypic (r p ), genotypic (r g ) and environmental (r e ) correlation coefficients among eleven agronomic characters in papaya from the heterotic Formosa group.
1 NCF: number of commercial fruits; NDF: number of deformed fruits; NWF: number of nodes without fruit; FM: fruit mass; FL: fruit length; FD: fruit diameter; PT: pulp thickness; SS: soluble solids content; PH: plant height; SD: stem diameter; FP: production of fruits per plant.*significant at 5% probability by t test.

Table 4 .
Estimates of direct (diagonal in bold) and indirect (off diagonal) effects of six agronomic characters on fruit yield per plant in papaya from the heterotic Solo group obtained by path analysis.
1 NDF: number of deformed fruits; NWF: number of nodes without fruit; FM: fruit mass; SS: soluble solids content; PH: plant height; SD: stem diameter.

Table 5 .
Estimates of the direct (diagonal in bold) and indirect (off diagonal) effects of six agronomic characters on the fruit yield per plant in papaya from the heterotic Formosa group obtained by the path analysis.

Table 6 .
Estimates of the direct (diagonal in bold) and indirect (off diagonal) effects of ten agronomic characters on fruit yield per plant (FP) in papaya from the heterotic Solo group obtained by the ridge path analysis.
1 NCF: number of commercial fruits; NDF: number of deformed fruits; NWF: number of nodes without fruit; FM: fruit mass; FL: fruit length; FD: fruit diameter; PT: pulp thickness; SS: soluble solids content; PH: plant height; SD: stem diameter.

Table 7 .
Estimates of the direct (diagonal in bold) and indirect (off diagonal) effects of ten agronomic characters on the fruit yield per plant in papaya from the heterotic Formosa group obtained by the ridge path analysis.