Sample size affects the precision of the analysis of variance in experiments with cauliflower seedlings

ABSTRACT: This study verified whether sample size would affect the precision of the analysis of variance in experiments with cauliflower seedlings. An experiment was carried out where the number of leaves and shoot, root and total length were measured. For each variable, resamplings with repositions were performed in sample scenarios of 1, 2, …, 100 seedlings per experimental unit, and the sample size was defined for the variance components through Schumacher models and maximum curvature points. The mean squares of the analysis of variance suffer direct interference from the number of sampled seedlings. The sampling of 16 seedlings per experimental unit is enough to estimate the analysis of variance reliably, promoting satisfactory precision gains compared to the sampling of only one seedling per experimental unit.

In a previous research, four methods based on the maximum curvature point were compared to determine the optimal sample size per experimental unit to estimate the overall experimental mean of cauliflower (Brassica oleracea L. var.botrytis) seedlings (BITTENCOURT et al., 2022), where a reduction in the 95% confidence interval width (CI 95% ) of the statistic was verified as sample size increased, up to a stabilization point.Thus, the methods that reported values closer to the stabilization point of the curve were chosen, once precision gain up from this point would no longer be enough to justify increasing the number of sampled plants (CARGNELUTTI FILHO et al., 2018;SOUZA et al., 2022).That example highlighted the importance of quantifying precision gain when defining sample size, which would not only facilitate the decision on the number of plants to be sampled per experimental unit but would also guarantee a minimum acceptable precision to the results.However, the previous approach focused only on the overall experimental mean without exploring other components of the analysis of variance.
The analysis of variance is widely performed to summarize data in experiments with experimental designs (WELHAM et al., 2015).
Ciência Rural, v.53, n.5, 2023.Nonetheless, in order to find actual significant differences through the F test that follows it, mean squares must be estimated reliably, reducing the probability of type I and II errors (ANDERSON et al., 2017).For this, sample size plays a crucial role, as verified by SOUZA et al. (2022) for soybean crop, based on its impact for estimating other statistics in experiments performed with crotalaria and maize (TOEBE et al., 2018;CARGNELUTTI FILHO & TOEBE, 2021).Therefore, considering that studies connecting sample size and the precision gain of the analysis of variance have not been reported in the literature for horticultural crops such as cauliflower, this study verified whether sample size would affect the precision of the analysis of variance in experiments with cauliflower seedlings.
The experiment was carried out at the Federal University of Pampa (UNIPAMPA), Itaqui, Rio Grande do Sul, Brazil.Cauliflower cultivar Teresopolis Gigante was sown using three substrate mixtures (50% Mecplant ® + 50% Carolina Padrão ® , 75% Mecplant ® + 25% rice husk, and 75% Carolina Padrão ® + 25% rice husk), and trays with 72 and 128 cells, forming a 32 two-factor scheme, in a completely randomized design with four repetitions.Seedlings were kept in a greenhouse for a period of thirty days.During the sampling, twenty seedlings were randomly collected from each experimental unit, considering the sample numbers used in cauliflower experiments (THOMSON et al., 2013;TEMPESTA et al., 2019;COSTA et al., 2020).Then, the following traits were measured: a) Number of Leaves (NL) in units; b) Shoot Length (SL), from neck to leaflet insertion, in cm; c) Root Length (RL), from neck to root apex, in cm; and d) Total Length (TL), as the sum of SL and RL, in cm.Other experiments with 1, 2, …, 100 seedlings per experimental unit were simulated using bootstrap resampling, with 10,000 resamples with reposition (EFRON, 1979).
The statistical analyses were performed using native functions and packages from R software (R DEVELOPMENT CORE TEAM, 2022).First, the database was stratified into experimental units, and in each sample size, an analysis of variance was performed through the following mathematical model: where Y ijk is the value observed in the response variable in plot ijk, m is the overall mean, T i is the fixed effect of level i (i = 1 and 2) of the tray-cell-size factor, S j is the fixed effect of level j (j = 1, 2, 3) of the substrate factor, (TS) ij is the interaction fixed effect of level i of the traycell-size factor with level j of the substrate factor and ɛ ijk is the experimental error effect.Thereafter, the mean squares of T i , S j , (TS) ij , and ɛ ijk were extracted in the sample scenarios per experimental unit.This process was carried out using sample() and aov() functions.
Resamplings for each planned sample scenario were subjected to descriptive analysis defining minimum values, percentiles of 2.5, mean, percentiles of 97.5, and maximum values.The 95% confidence interval width (CI 95% ) was estimated as the difference between percentiles of 97.5 and percentiles of 2.5.Posteriorly, the precision gain criterion was estimated in percentage, assuming that the greater the CI 95% , the lower the precision of the analysis-ofvariance mean squares' estimates (SOUZA et al., 2022).Thus, the sample size of one seedling per experimental unit (CI 1 ) was taken as a reference, where the CI 95% is maximum and the precision is minimum.The following formula was used to estimate precision gain: where CI i is the 95% confidence interval width, obtained from the sample sizes of 2, 3, ... Finally, the precision gain was fitted using nls() function through Schumacher's model (SCHUMACHER, 1939): The variance components fluctuated in response to the variation of the number of seedlings sampled per experimental unit, also varying for each specific trait (Figure 1).In all cases, CI 95% tends to reduce gradually as the number of sampled seedlings is increased, which means estimates become more accurate (TOEBE et al., 2018;BITTENCOURT et al., 2022;SOUZA et al., 2022).Conversely, small sample sizes (≤ 5 seedlings per experimental unit) result in greater CI 95% , making the mean squares estimates more biased.These results are similar to the ones observed by SOUZA et al. (2022) when analyzing the response of variance components in soybean.
From this response, it was observed that the precision of the analysis-of-variance mean squares was increased as sample size increased, establishing a direct relationship between result reliability and the number of seedlings used for data collection, especially considering the influence of the analysis of variance in the determination of significant differences between treatments.In general, the sufficient sample sizes for obtaining reliable estimates of the analysis of variance varied from 13 to 16 cauliflower seedlings per experimental unit, with precision gains oscillating from ≥ 76.52% to ≤ 93.42%, depending on the variance component and trait analyzed (Table 1 and figure 2).These values were obtained through the parametrization of precise Schumacher models (SCHUMACHER, 1939),  with coefficients of determination (R 2 ) ≥ 0.78, root mean square error (RMSE) oscillating from 1.43 to 4.83, and d index ≥ 0.93.Furthermore, in sample sizes ≤ 3, a considerable precision gain is observed every time there is an increase in the number of sampled seedlings.This response remains until the sampling number reaches 10 seedlings per experimental unit, up from where precision gain starts becoming lower and lower, until finally reaching the maximum curvature point, that is, the ideal sample size for each trait and variance component.
In that perspective, considering all traits and variance components jointly, the minimum sampling number of 16 seedlings per experimental unit can be recommended as sufficient to make accurate mean square estimates for the analysis of variance of experiments with cauliflower seedlings, corroborating the results obtained by BITTENCOURT et al. (2022).They suggested the sampling of at least 15 cauliflower seedlings per experimental unit to estimate the overall experimental mean.The collection of greater samples normally demands more resources and manpower that are not justified by the little precision gain obtained (TOEBE et al., 2015), and in some cases, oversampling may even result in greater variations between experimental units that can inflate the error mean square (SOUZA et al., 2022).This harms the detection of significant differences between treatments due to the occurrence of type II error (ANDERSON et al., 2017).Importantly, the practical results here obtained should be applied cautiously in cauliflower seedlings' experiments with experimental designs, and should not be used for other horticultural crops without performing preliminary studies, serving only as a support to researchers that conduct experiments with other species from the Brassicaceae family.
where PG i is the i th precision gain observation per statistic, in each n sample size,  and β are parameters of the model, exp is the exponential function and ɛ i is the error of random effect.A maximum curvature point was defined over the fitted models through the perpendicular distances' method (SILVA & LIMA, 2017), as recommended byBITTENCOURT et al. (2022) for cauliflower, using the maxcurv() function from the soilphysics package(SILVA & LIMA, 2015).

Figure 1 -
Figure 1 -Minimum, 2.5 percentile, mean, 97.5 percentile and maximum values of the mean squares of the error, tray cell size, substrate, and tray cell size × substrate interaction in the number of leaves (a, b, c, and d), total length (e, f, g, and h), shoot length (i, j, k, and l), and root length (m, n, o, and p) of cauliflower seedlings.

Table 1 -
Coefficient of determination (R 2 ), root mean square error (RMSE), and d index of the Schumacher models, precision gains, and sample sizes for the analysis of variance of the number of leaves (NL), shoot length (SL), root length (RL), and total length (TL) of cauliflower seedlings.