AgroReg: main regression models in agricultural sciences implemented as an R Package

ABSTRACT Regression analysis is highly relevant to agricultural sciences since many of the factors studied are quantitative. Researchers have generally used polynomial models to explain their experimental results, mainly because much of the existing software perform this analysis and a lack of knowledge of other models. On the other hand, many of the natural phenomena do not present such behavior; nevertheless, the use of non-linear models is costly and requires advanced knowledge of language programming such as R. Thus, this work presents several regression models found in scientific studies, implementing them in the form of an R package called AgroReg. The package comprises 44 analysis functions with 66 regression models such as polynomial, non-parametric (loess), segmented, logistic, exponential, and logarithmic, among others. The functions provide the coefficient of determination (R2), model coefficients and the respective p-values from the t-test, root mean square error (RMSE), Akaike’s information criterion (AIC), Bayesian information criterion (BIC), maximum and minimum predicted values, and the regression plot. Furthermore, other measures of model quality and graphical analysis of residuals are also included. The package can be downloaded from the CRAN repository using the command: install.packages(“AgroReg”). AgroReg is a promising analysis tool in agricultural research on account of its user-friendly and straightforward functions that allow for fast and efficient data processing with greater reliability and relevant information.


Introduction
Agronomic experiments are generally laborious, expensive, and often take years to be performed.Moreover, they are often tricky and complex in planning and execution.They depend on many factors that affect both the efficiency and reliability of results due to the natural variability of biological and agricultural systems (Piepho and Edmondson, 2018).
Experimental design and analysis depend on the measurement structures of treatment factors, and their understanding is essential to a correct analysis (Piepho and Edmondson, 2018).In the case of qualitative factors, the levels do not have a specific level order on a numerical scale (Montgomery, 2017).In these cases, if the experiment has adequate repetitions, they can be compared by the standard error of the differences or tests of means (Hsu, 1996;Bretz et al., 2011;Piepho and Edmondson, 2018).
Treatment factors are quantitative when the levels can be ordered numerically (Montgomery, 2017).In this case, regression analysis is recommended, which uses distance information on the scale of quantitative predictors, allowing for the estimation of values even if they were not observed in the study (Cochran and Cox, 1986;Pimentel-Gomes, 2009;Banzatto and Kronka, 2013;Storck et al., 2016).However, an experiment for such a purpose requires at least three levels of the factor, although at least five are desirable.
Experiments that study quantitative factors in agricultural sciences have been reported in several articles such as studies of plant density or population (Van Roekel and Coulter, 2011;Williams et al., 2021), fruit post-harvest quality (Marodin et al., 2016), seed germination (Motsa et al., 2015), weed control (Noel et al., 2018), and growth curves (Lúcio and Sari, 2017).In these studies, polynomial models were predominantly used and, although this is not incorrect, many natural phenomena do not present such behavior but rather specific models (Archontoulis and Miguez, 2015).
Programming languages, such as SAS or R, usually perform non-linear models.For instance, regression analysis can be performed in the base R language using functions such as lm, nls or glm.Nevertheless, nonlinear models performed by the nls function require the prior specification of values to obtain model coefficient estimates, which is time-consuming.Implementing the R package may help carry out these analyses, as it is more straightforward and accessible for users.Thus, this work presents regression models found in scientific studies and implements them as an R package called AgroReg.

Creation of the AgroReg package
The package was built using the R (version 4.1.0)language (R Core Team, 2021), and documentation and checks were generated by the devtools packages (Wickham et al., 2021) and roxygen2 (Wickham et al., 2021) to facilitate the construction and adequacy of the

Installation
All developed functions were written in the R programming language; therefore, they can be executed in the R environment or any GUI (Graphical User Interface) that uses this language, such as RStudio (https://www.rstudio.com/).R can be installed on Windows, Linux, or Mac systems.Thus, the scientific community can freely use this package regardless of the operating system.The AgroReg package can be installed from the CRAN repository using the following command: install.packages("AgroReg", dependencies=TRUE) The following command must be run to load the package:

Data set
The collection of functions available in the AgroReg package implements several methods to describe many of the phenomena observed in quantitative studies in agricultural sciences, as mentioned in Table 1.Thus, the data obtained in real experiments were implemented to better exemplify the applications of the package.The use of the functions available in the package and the interpretation of their results are best presented in the form of an applied example using real data.

Regression models
All regression models implemented in the package are shown in Table 1, in addition to functions and descriptions, as well as applications in articles in the field of agricultural sciences.The models were grouped into non-parametric (loess), polynomial, logistic or S-shaped, logarithmic, bell-shaped, segmented, and exponential models.They were primarily extracted from scientific journals with original works such as the one from Sadeghi et al. (2019) or review articles, including the one written by Archontoulis and Miguez (2015), aiming to cover as many regression models as possible.Furthermore, modifications of specific models were also implemented.
Polynomial models, also called linear models, were implemented from the lm base R function.The same procedure was used to obtain the logarithmic curves, the Valcam model, and specific exponential models.On the other hand, the non-parametric loess regression, also known as local regression, was performed using the loess base R function.
Logistic equation models, also called sigmoid curves, are S-shaped and mainly used to describe plant growth curves, seed germination over time, or herbicide dose-response studies (Archontoulis and Miguez, 2015).They are implemented in the drc (Ritz et al., 2015) and aomisc (https://github.com/OnofriAndreaPG/aomisc)packages.Thus, in AgroReg, these functions were imported and summarized in a more straightforward function with more information.
Finally, nls from the stats package was used for the other functions, relying on the methodology of ordinary least squares.In these cases, pre-established algorithms were used to automate the initial information.Thereby, most functions do not require initial information to generate models, although the problematic convergence of coefficients may occur, owing to not estimating good initial values.In such a situation, the user can specify a priori information by the "initial" argument, according to each regression model.

Statistical information and parameters
The functions were developed to provide estimates of the maximum and minimum predicted values obtained in the curve within the studied range.In addition, statistical parameters such as AIC (Akaike's information criterion), BIC (Bayesian information criterion), R 2 (coefficient of determination) or Pseudo-R 2 (correlation between observed and predicted outcome), RMSE (root mean square error), and p-value from the t-test of coefficients were also returned.In the case of polynomial models, the variance inflation factor (VIF) is also given.The root mean square error is calculated by the following formula: where Ŷi is the response predicted by the model, Y i the observed response, and n the sample size.The Akaike's Information (AIC) and Bayesian Information (BIC) Criteria are calculated by the formula: where: L i and p i are the likelihood function and number of parameters for each model, and n the number of observations.The VIF is calculated using the formula: where: p is the number of predictor variables; R j 2 the multiple correlation coefficient, resulting from the X j regression on the other p-1 regressors.
Other goodness-of-fit statistical parameters such as MBE -mean bias error, MBER -relative mean bias error, MAE -mean absolute error, RMAE -relative mean absolute error, SE -standard error, MSE -mean squared error, rMSE -relative mean square error, EF -modeling efficiency, SD -standard deviation of differences, and CRM -coefficient of residual mass are provided separately from the analyses, through the "stat_param" function.Graphical analysis of residuals can be performed by the command "extract.model"as follows: where: Ŷi is the response predicted by the model, Y i the observed response, Y i the mean of the observed response, ∈ the mean of the difference between the predicted and the observed response, and n the sample size.

General information
The package has 44 regression analysis functions, which can also be accessed using the "regression" function and defining the "model" argument according to the requested regression model (Table 1).This function has the simple linear model (model="LM1") by default, as follows: > data("aristolochia") > with(aristolochia, regression(trat, resp, model="LM1")) For more information, access the documentation for the function ("?regression").
In all analysis functions, the first two arguments are mandatory, representing the independent variable and dependent variable, respectively.In the case of polynomial functions (LM and LM_i), the argument "degree" defines the polynomial degree, while for logistic functions and some exponentials, such as logistic, LL, BC, CD, GP, weibull, lorentz, and MM, the argument "npar" sets the number of parameters.
Figures 1A and B show the plot of the simple linear regression analysis and Brain-Cousens fourparameter logistic model, respectively.Curves joining can be accessed by the "plot_arrange" function (Figure 1C), requiring a list with the outputs of each analysis as the only mandatory argument, as follows:  1000 min with subsequent stability.This behavior is like exponential models or models that behave similarly, such as Michaelis-Menten, logistic models, or Mitscherlich.
Segmented models, such as linear-linear, linear-plateau, and quadratic-plateau models, are also used to explain this behavior.Next, the routine of these functions and the appearance of the curve (Figure 3) is presented.
Finally, the package provides essential information on selecting regression models such as AIC and BIC.For both statistical criteria, a lower value indicates a preferable model.BIC differs from AIC only in the second term of the equation which depends on n.Thus, as n increases, BIC favors the simpler models (fewer parameters), which is why, sometimes, AIC and BIC indices disagree (Archontoulis and Miguez, 2015).In addition, the coefficient of determination (R 2 ) is also returned, in which values close to 1 are desirable, although, in the case of linear models ("LM" function), attention should be paid to the problem of multicollinearity, which is evaluated as VIF in the function and should be less than 5 or 10 according to Myers and Montgomery (2002) and Petrini et al. (2012), respectively.The information can be summarized in a table using the "comparative_model" function and inserting a list with the variables returned in each analysis function.

Applied example
To exemplify and guide the use of the AgroReg package and interpret the results generated, an applied example with the dataset "granada" was inserted.The first step of any statistical analysis is to study descriptive exploratory information, obtaining, for example, position measures such as mean, median, maximum, minimum, and measures of dispersion such as variance and standard deviation.On the other hand, in the case of regression analyses, a procedure that must be carried out in advance is the graphical visualization of the results (Archontoulis and Miguez, 2015) because, with such information, it is possible to identify patterns and thus target specific models, avoiding unnecessary processes and clearing the path to reach a biologically acceptable explanation.This critical stage was further explored from the dataset known as the Anscombe quartet, proposed by the statistician Francis Anscombe in 1973, who observed that in four datasets, identical fitted and regression coefficients were produced; however, when viewed graphically, they revealed surprisingly different patterns of covariation between x and y.
In the case of the "granada" dataset, exploratory plots using the "Nreg" function were generated (Figure 2A and B).The dataset exhibited a visually noticeable low variability and a sharp rise in growth up to After obtaining the model, the next step is the analysis of the residues.In AgroReg, this analysis can be performed graphically, as follows: > a = with(granada, asymptotic_neg(time, WL)) > extract.model(a,type = "qqplot") Based on the theoretical quantile graph (Figure 4A-J), all models presented points close to the normal distribution curve, even though there are better-fitted models, such as three-parameter log-logistic and four-parameter Brain-Cousens.Table 2 presents the statistical parameters of each model used in Figure 4A-J.In addition, the package also implemented bar graphs that facilitate the visualization of the model choice parameters (Figure 5).Thus, in this example, the biexponential model had the lowest AIC (307.46) and BIC (318.44)values and was among the models with the lowest RMSE and higher R 2 , in addition to presenting all significant coefficients by the t-test (p < 0.05).However, although there are models statistically more adequate, almost all the models used could be applied to explain the behavior of this study.

>
Figure 1 -Exemplification of the output of a linear (A) and Brain-Cousens logistic (B) function and union of the curves in a plot (C) using the functions from the AgroReg package for the "aristolochia" dataset of the germination of seeds of Aristolochia elegans depending on the temperature.

Figure 2 -
Figure 2 -Visualization of all observations by scatter plot (A) and mean and standard deviation (B) for the "granada" dataset.WL = weight loss, Time (min) = Time after pomegranate peel begins to dry.The function returns a "not significant" label as this function can be used to represent the absence of a trend when you want to join the plots.

Figure 3 -
Figure 3 -Regression plot with the ten regression models used to exemplify the commands in the AgroReg package for the "granada" dataset.WL = weight loss, Time (min) = Time after pomegranate peel begins to dry.

Table 1 -
Functions, descriptions, mathematical model, and applications of the models implemented in the AgroReg package.