INTRODUCTION:

When several variables are measured in an essay, there is a possibility to study the linear relation between them. This type of information is mainly used in plant breeding and involves identifying variables that can be used in indirect selection of superior genotypes (^{CRUZ & REGAZZI, 1997}). In addition, knowledge of the relation between productivity parameters and total fruit yield may assist in the improvement or choice of management practices.

Pearson correlation coefficient is a dimensionless measure that determines a linear relation between two variables. Its value varies from -1, when there is a perfect negative linear relation, to +1, when there is a perfect positive linear relation. The closer this value to zero, the smaller is the degree of linear relation. From the Pearson correlation coefficient, many other statistics are calculated, such as partial correlation, direct and indirect effects between variables in track analysis, and canonical correlation (^{HAIR et al., 2005}). Thus, the precision of these statistics depends on accuracy of the estimate of Pearson’s correlation coefficient.

Sample size has a large impact on statistical significance and interpretation of a statistical result. In large samples, the coefficients of low magnitude tend to show statistical significance, even when the relation between the parameters is not important from the practical point of view. Nevertheless, when the sample size is small, the reliability of the estimates is low and may not represent the true relation between two variables (^{HAIR et al., 2005}; ^{CARGNELUTTI FILHO et al., 2010}; ^{CARGNELUTTI FILHO et al., 2011}; ^{CARGNELUTTI FILHO et al., 2012}). Hence, we realized that an adequate sample size with acceptable accuracy is important.

The confidence interval of the Pearson correlation coefficient can be estimated from the sampling distribution transformed by Fischer (^{FERREIRA, 2009}; ^{CARGNELUTTI FILHO et al., 2011}). Another way to obtain it is by means of bootstrap confidence intervals with replacement. In this methodology, there is no need to know the variable’s probability distribution, and this approach is effective in the case of variables with an unknown or non normal distribution (^{CARGNELUTTI FILHO et al., 2010}; ^{CARGNELUTTI FILHO et al., 2012}). This methodology is of great value because it can be used to determine the confidence interval amplitude of any variable and for any cultivated plant. The sample size required for estimation of the mean of parameters is common in the literature, but there are few studies that determine the sample size needed to estimate the Pearson correlation coefficient (^{CARGNELUTTI FILHO et al., 2010}; ^{CARGNELUTTI FILHO et al., 2011}; ^{CARGNELUTTI FILHO et al., 2012}). In the specific case of olive groves, studies determining a sample size for estimation of the Pearson correlation coefficient are practically nonexistent. Studies are limited to only determine a sample size for the mean of parameters (^{SILVA et al., 2009}; ^{SANTOS et al., 2010}; ^{HAESBAERT et al., 2011}; ^{SILVA et al., 2011}; ^{LÚCIO et al., 2012}).

The appropriate sample size will be related to the accuracy of the estimate of Pearson’s correlation coefficient. In the case of cherry tomatoes, there are no studies in the literature that make this determination. Consequently, the aim of the present study was to determine the sample size for estimation of the Pearson correlation coefficient between productivity variables of cherry tomato.

MATERIALS AND METHODS:

Two uniformity tests were carried out in plastic greenhouses in the Plant Science Department of the Universidade Federal de Santa Maria (latitude 29°43’ S, longitude 53°43’ W and 95m altitude). Greenhouses were covered with a low density polyethylene (LDPE) film, with 150 micron thickness and an anti UV additive, placed in the north-south direction, and the greenhouses had the following dimensions: 1) 20 × 10m (200m²), 3m right foot and 4m in the central part; 2) 25 10m (250m²), 4m right foot and 5.5m in the central part. Climate at the site of the experiments is classified as Cfa, and the soil is classified from Paleudalf (^{EMBRAPA, 2006}).

A soil correction in the two greenhouses was performed 30 days before planting and served to raise pH to 6.5 and the phosphorus level to 300mg dm^{-3}. Next, eight ridges were constructed in each greenhouse, with 0.20m height and 0.30m width, and spaced at 1m. These ridges were covered with a mulching black opaque LDPE film. Planting fertilization was conducted by means of 150kg ha^{-1} N, 250kg ha^{-1} P, and 125kg ha^{-1} K for each greenhouse. The cover fertilization was performed every 23 days, with 30kg ha^{-1} N, 15kg ha^{-1} P, and 30kg ha^{-1} K. All the procedures were carried out based on soil chemical analysis and according to recommendations of the Official Network of Soil and Plant Tissue Analysis Laboratories of the states of Rio Grande do Sul and Santa Catarina (^{ROLAS, 2004}).

The seedlings were Lily Hybrid Cherry tomatoes, and the transplant took place in the spring/summer season, on October 26, 2014, during the first flowering. Spacing between seedlings was 0.5m. Plants were vertically staked with a ribbon and driven on a double stem with elimination of lateral shoots. Every 14 days, sprays of calcium, boron, fungicides, and insecticides were administered to control pests, diseases, and physiological disturbances such as the “blossom end rot” of fruits.

All the plants in both greenhouses were evaluated, totaling 247 plants in the 200m² greenhouse and 347 in the 250m² greenhouse. All the fruits were harvested, and the following variables were determined: total weight fruit (TWF), mean fruit length (MFL), mean fruit width (MFWi), mean fruit weight (MFW), number of bunches per plant (NBP), number of fruits per plant (NFP), and the number of fruits per bunch (NFB). Variables MFWi and MFL were measured in centimeters with a caliper. MFW and TWF were measured in grams using a scale with 0.01g precision.

From the observed data, two Pearson correlation matrices were constructed, one for each for greenhouse. The correlation coefficients were tested for their significance by Student’s *t* test at an error probability of 5%. The third matrix (matrix of means) was compiled from the mean values of the Pearson correlation coefficients of the two matrices previously mentioned. A total of 68 sample sizes were planned for the 250m² greenhouse and 48 for the 200m² greenhouse. The initial size was 10 plants, and the others were calculated by adding five plants. For each of them, 3000 estimates of the Pearson correlation coefficient were obtained through bootstrap resampling with replacement. Later, from the 3000 estimates, we calculated the minimum value, 2.5% percentile, mean, 97.5% percentile, maximum value, and the 95% confidence interval (the difference between the 97.5% and 2.5% percentiles). The confidence interval, which represents the variation caused by the variable’s random behaviour can be represented by 1000 re-samplings; the larger the number of re-samplings, the more accurate is the interval (^{FERREIRA, 2009}). Then, the use of the 3000 re-samplings had the objective to obtain intervals with high precision.

The amplitude of the confidence interval determines variability of the correlation coefficients estimated by means of different sample sizes. The optimal sample size to estimate the Pearson correlation coefficient was determined when the amplitude value of the 95% bootstrap confidence interval was less than or equal to 0.4 (Figure 1a). We also calculated the mean of each of the 21 sample size estimates obtained for each of the two matrices. Data processing and statistical analysis were conducted using the R software (^{R DEVELOPMENT CORE TEAM, 2012}) and Microsoft Office Excel^{®}.

RESULTS AND DISCUSSION:

The Pearson correlation coefficients of the 21 pairs of parameters varied between -0.019 and 0.97 in the 200m² greenhouse and between -0.083 and 0.963 in the 250m² greenhouse. Of these, 20 had statistical significance in the 200m² greenhouse and 12 in the 250m² greenhouse (Table 1). We reported that the Pearson correlation coefficients, even at low magnitude, were statistically significant, as a consequence of the large number of observations (347 in the 250m² greenhouse and 247 in the 200m² greenhouse). This finding should be interpreted with caution because a statistical test often indicates the presence of a linear relation between variables, when in fact this relation is of no practical importance (^{HAIR et al., 2005}; ^{CARGNELUTTI FILHO et al., 2010}; ^{CARGNELUTTI FILHO et al., 2011}; ^{CARGNELUTTI FILHO et al., 2012}).

^{1}TWF = total weight fruit (g), MFL = mean fruit length per plant (cm), MFWi = mean fruit width per plant (cm), MFW = mean fruit weight per plant (g); NBP = the number of bunches per plant, NFP = the number of fruits per plant, NFB = the number of fruits per bunch. ^{*}Significant according to the *t* test (P<0.05), with 345 degrees of freedom in the 250m² greenhouse and 245 degrees of freedom in the 200m² greenhouse.

The association between fruit size variables (MFL, MFWi, and MFW) and the number of fruits per plant (NBP, NFP, and NFB) differed between the greenhouses, resulting in different Pearson correlation coefficients, both in magnitude and sign. Hence, the use of a matrix mean may not provide an adequate Pearson correlation coefficient; and consequently, the sample size calculated from it may not be reliable. Thus, we chose to determine the sample size for each of the trials separately (Table 1).

The amplitudes of the 95% confidence interval-obtained from 3000 bootstrap re-samplings with replacement-revealed that the Pearson correlation coefficient estimate is more accurate for variables with a strong linear relation than for those with a weak linear relation. With the increasing sample size, the confidence interval amplitude is still high in large samples when the correlation coefficient is close to zero (Figure 1). The estimated sample sizes for the 21 correlation coefficients were inversely proportional to the magnitude of the Pearson correlation coefficient in both greenhouses (r = -0.95 in the 250m² greenhouse and r = -0.91 in the 200m^{2} greenhouse), confirming that the weaker the linear relation between variables (Pearson correlation coefficients close to zero), the larger is the sample size needed for its estimation.

A suitable sample size is necessary, especially when the correlation coefficient is close to zero, because there is great variability of magnitude and sign inversion (Table 2); this situation definitely compromises the statistical analysis. ^{CARGNELUTTI FILHO et al. (2010}), after studying the linear relation between corn parameters, reported that an inadequate sample size may be linked to the discrepancies in results of scientific publications. Because of significance of the mathematical sign for interpretation of the Pearson correlation coefficient, the use of a suitable sample size for its determination is highly relevant, even more than the required sample size for mean estimation.

¹TWF = total weight fruit (g), MFL = mean fruit length per plant (cm), MFWi = mean fruit width per plant (cm), MFW = mean fruit weight per plant (g); NBP = number of bunches per plant, NFP = number of fruits per plant, NFB = number of fruits per bunch.

In the case of cherry tomatoes, we reported that the use of small samples is even more detrimental for the study of linear relations involving fruit size variables (MFL, MFWi, and MFW) and the number of fruits per plant (NBP, NFP, and NFB). For example, the estimate of the correlation between variables MFL and NBP obtained in the 250m² and 200m² greenhouses was -0.083 and 0.133, respectively. When the sample consists of only 10 plants, the amplitude of the 95% confidence interval of the 3000 bootstrap estimates with replacement ranged from -0.744 to 0.657 in the 250m² greenhouse and from -0.699 to 0.764 in the 200m² greenhouse. These results showed that at reduced sample sizes, the inferences about relations of variables can be contradictory, compromising interpretation of the results. This problem did not occur between variables TWF and NFP. In this case, the correlation coefficients obtained in the 250m² and 200m² greenhouses were 0.963 and 0.970, respectively. The 95% confidence interval of the estimates varied between 0.810 and 0.995 in the 250m² greenhouse, and between 0.864 and 0.995 in the 200m² greenhouse (Table 2).

The sample size needed to estimate the Pearson coefficient of correlation between productivity variables of cherry tomato, with a 95% confidence interval equal to 0.4, varied between 10 and 275 in the 250m² greenhouse and between 10 and 200 in the 200m² greenhouse (Table 1). Therefore, a sample size of 275 plants in the 250m² greenhouse and 200 plants in the 200m² greenhouse allows for estimating the coefficient of correlation between the cherry tomato’s productivity variables with a 95% confidence interval of at least 0.4, regardless of the relation between the parameters being studied. The use of this amplitude of the confidence interval to determine a sample size is justified because at this value, the amplitude tends to stabilize when the correlation coefficient is of low magnitude (Figure 1a).

Sample sizes determined by means of the same amplitude of a confidence interval have different precision values (this value is greater in the correlation coefficients of greater magnitude than in those with lower magnitude). Hence, it is important to highlight that if a researcher wants to estimate the Pearson correlation coefficient with the same precision, the sample size for each analysis should be different. If the same sample size is used, the precision estimation will not be the same. The better amplitude definition of the 95% confidence interval will be selected by each researcher according to the experimental precision of its estimates.

CONCLUSION:

The sample size should be larger for determining a linear relation of the variables associated with the size and number of fruits per plant, owing to the low magnitude of this correlation.

To estimate the Pearson coefficients of correlation between cherry tomato variables with a 95% confidence interval amplitude equal to 0.4, it is necessary to sample 275 plants in the 250m² greenhouse and 200 plants in the 200m^{2} greenhouse.