RANDOM FOREST MODEL TO PREDICT THE HEIGHT OF EUCALYPTUS

Eucalyptus ( Eucalyptus urograndis ) production has significantly advanced over the past few years in Brazil, especially with regard to acreage and productivity. Machine learning has made significant advances in most varied fields of agrarian sciences. In this context, this study aimed to use physicochemical variables of the soil as well as climatic and dendrometric variables of eucalyptus to predict its height using the random forest algorithm. The study was conducted in the municipality of Três Lagoas, in Mato Grosso do Sul, Brazil. The original database consisted of 49 soil physicochemical variables collected at 0–0.20 m and 0.20–0.40 m, two dendrometric and four climatic variables, and one response variable related to the height of eucalyptus . A correlation matrix was applied to select variables. Furthermore, modeling was performed using the random forest algorithm, which performed well (r = 0.98, R 2 = 0.96) in predicting the height of eucalyptus. Overall, the most important variables to predict the eucalyptus plant height included diameter at breast height (DBH), phosphorus content (P1), gravimetric moisture (GM1) at a soil depth between 0.00 m and 0.20 m, and exchangeable aluminum content (Al2) between 0.20 m to 0.40 m of soil.


INTRODUCTION
High demand for wood for different purposes (sawmill, lamination, charcoal, and cellulose) has led to a substantial increase in the area of planted forests. Consequently, it has contributed to the national economy by generating employment (direct and indirect) and income in primary and secondary sectors (Pichelli & Soares, 2019). In 2019, the total planted forest area in Brazil was 10 million hectares (IBGE, 2019). Of this, eucalyptus cultivation contributed 77% (6.97 million hectares), with an average productivity of 35 m 3 ha -1 per year (IBÁ, 2020).
Predictive models for eucalyptus growth have great potential to expand cultivated areas by aiding the decisionmaking process for regions whose species adaptation and growth conditions are not well established (Santos et al., 2017). Thus, predictive modeling can be useful for assessing growth and production in eucalyptus-cultivated areas to support the management of forests (Castro et al., 2016). According to Scolforo et al. (2013), choosing the ideal management practice for each forest area cultivation is crucial for successful actions.
The total height of trees in forest inventories is important as it is strongly correlated with wood volume (Campos et al., 2016). However, its estimation is a longterm process that demands financial resources and is subject to errors (Souza et al., 2016;Vendruscolo et al., 2015). In this context, hypsometric models that express the relationship between tree diameter and height are commonly used to predict tree height (Martins et al., 2016). However, many factors influence thypsometric relations, including age, edaphoclimatic conditions, cultivar, management system, competition status, and productive capacity (Campos & Leite, 2013;Finger, 1992). In addition, it is difficult to find the right relationship between diameter and height. This is because tree trunks have portions that are not usable or are uneven, which lead to overestimation in diameter and underestimation in height; the opposite can also hold true (Ferraz Filho et al., 2018;Hess et al., 2014;Martins et al., 2016).
Machine learning algorithms are a promising approach in the most varied fields of agrarian sciences Marçal et al., 2021;da Silva et al., 2021;Tavares et al., 2018); of these, the random forest (RF) algorithm is considered to be one of the most accurate methods (Biau, 2012;Wang et al., 2016). It is an unsupervised learning method that assesses the performance of a set of independent regression tree-type algorithms using different bootstrap samples of the training data to predict the value of a given variable and express the final results through the mean values of individual trees (Breiman, 2001). RF is advantageous because of its high processing speed, easy implementation, high precision, and ability to handle a large number of input variables without overlap (Biau, 2012).
Considering the difficulty in predicting the stem height in eucalyptus plantations using traditional methods and the high predictive potential of the RF model, this approach can be applied to predict eucalyptus growth using correlated variables. The objective of this study was to predict the growth of eucalyptus via the RF machine learning model, using physicochemical soil components with climatic and dendrometric variables.

Experiment Location
The experiment was conducted in Três Lagoas (20°27′ S, 52°29′ W), which is a municipality in the state of Mato Grosso do Sul, Brazil. According to the Köppen-Geiger climate classification system (Köppen & Geinger, 1928), it belongs to class Aw and is characterized as rainy in the summer and dry in the winter. Furthermore, this region has a mean annual rainfall precipitation of 1,300 mm and mean temperature of 23.7 °C. According to the Brazilian system of soil classification (Santos et al., 2018) and the Soil Taxonomy System (Soil Survey Staff, 2014), the soil of the experimental area is Neossolo Quartzarênico and Etisols Quartzipsamments, respectively.

Description and history of the experimental area
Fifty years ago, the experimental area was cultivated with degraded pasture. Since 2013, it has been cultivated with Eucalyptus urograndis. The present study was conducted over the crop years of 2014-2015.

Analyzed variables
The following dendrometric variables were assessed: individual height of the eucalyptus trees (HGT), diameter at breast height (DBH), and wood volume (VOL). Data on tree height were collected using a 5 m graduated ruler; DBH data were collected at a height of 1.30 m from the soil using a digital caliper rule. Individual VOL was obtained using a standing tree. Cutting down of trees was ruled out because the experimental area was located in a privately owned commercial area. Therefore, Huber's formulation was used to establish the VOL because it assumes that the mean area of a sectioned tree is at its midpoint; however, such an assumption is not always true, which indicates that its accuracy is intermediary (Campos & Leite, 2013). In this scenario, a form factor of 0.4 was used to correct individual wood volume, assuming that wood was not a perfect cylinder (Oliveira et al., 2009). Huber's formula, adapted and described by Péllico Netto (2004), was obtained from the product of half the sectioned area and the section length, and determined using [eq. (1)]. VOL = [DBH 2 * (3.14 / 4) * HGT] * 0.4 (1) Where: VOL is the wood volume (m 3 ); DBH is the diameter at breast height (m), and HGT is the tree height (m).
Temperature and rainfall in the experimental area were monitored using an automatic meteorological station located in the municipality of Três Lagoas-MS, which was ~50 km from the experimental area. The obtained data enabled the evaluation of climatic conditions during the experimental period.

Data mining
For each crop year, 150 plants were sampled and soil samples were collected around each tree. This process was carried out for two crop years, totaling 300 observations. According to Table 1, the original database consisted of 49 physicochemical variables of the soil collected at two depths (0-0.20 m and 0.20-0.40 m), two dendrometric and four climatic variables, as well as one response variable related to the height of Eucalyptus urograndis (Table 1). The covariance between two variables is related to their variance with each other. Therefore, a correlation matrix was established to verify the simple linear correlations for the two-to-two combinations of all the variables contained in the database. Positive correlations were expressed through blue staining: the more intense the blue staining, the more positive the correlation degree; in contrast, the more intense the red staining, the more negative the degree of correlation. Null correlations were expressed in the absence of color. The aim was to only select variables that could contribute to the model. Variables with null variance or high correlation with each other were eliminated. In the case of two highly correlated variables, one was randomly maintained and the other was eliminated because it added no additional information to the model.
The RF algorithm (Breiman, 2001) was applied to elaborate the predictive modeling for the height of the eucalyptus plants. It consisted of a set of combined decision trees to solve possible classification and regression issues. Each decision tree was built using random initial data sampling, and each data division used a random subset of attributes to select the most informative ones (Breiman, 2001;Hastie et al., 2001).
In data mining applications, input predictor variables differ in relevance. Often, few variables have a substantial influence on the response, and most are irrelevant and are discarded. In this context, it is useful to learn the relative importance or contribution of each input variable to predict the response. Each tree was trained on a bootstrap sample, and the optimal variables at each split were identified from a random subset of all variables. The selection criteria for classification and regression problems were the Gini index and variance reduction, respectively (Hastie et al., 2009).
The RF algorithm was implemented in the R program (R Core Team, 2017), while model validation was conducted via the hold-out method, in which 70% of the data were used for training and 30% for testing. The results were graphically expressed through a regression, and the final result was the mean of all results of the regression tree (Breiman, 2001). The model performance was assessed by calculating the correlation between the observed and estimated values through the coefficient of determination (R 2 ), given by the ratio between the sum of squared regression residuals (SSR) and the total sum of squares (TSS), using the following equation: Where: R 2 = coefficient of determination; Yi = observed value of the dependent variable; Ŷi = estimated value of the dependent variable, and Ῡ = mean of the dependent variable. Figure 1B illustrates the selection of the variables using a correlation matrix. Of the 49 predictive variables in the original database, 29 (59%) were eliminated.

Before the selection of variables (A)
After the selection of variables (B) FIGURE 1. (A) Correlation matrix and (B) selected variables through a correlation matrix. Those with null or high correlation variance with each other are eliminated. PR, GM, BD, PD, SAND, CLAY, P, OM, pH, K, Ca, H.Al, Al, CEC represent soil penetration resistance, gravimetric moisture, bulk density, particle density, sand content, clay content, phosphorus content, organic matter, soil pH, potassium content, calcium content, potential acidity, aluminum, cation-exchange capacity, respectively. In contrast, variables PR1, PR2, GM1, GM2, BD2, PD1, PD2, SAND1, SAND2, CLAY2, P1, OM1, OM2, pH2, K1, K2, Ca1, H.Al1, H.Al2, Al2, and CEC2 were used in the predictive model using the RF algorithm. It must be emphasized that all selected variables are related to the physicochemical attributes of the soil.
Considering that the correlation matrix is a symmetric matrix, that is Corr [X, Y] = Corr [Y, X], Figure  2 shows the values of the correlations between each pair of variables in the upper matrix. The lower matrix refers to the distribution of each pair of variables, while the main diagonal represents the correlation of a variable with itself. The analysis of data dispersion, frequency distribution, and Pearson's correlation coefficient confirmed that null and strongly correlated variables were fully excluded and only those with a correlation coefficient below 70% were retained. FIGURE 2. Variables selected to be used in the Random Forest (RF) classification model. PR, GM, BD, PD, SAND, and CLAY represent soil penetration resistance, gravimetric moisture, bulk density, particle density, sand content, and clay content, respectively. Numbers 1 or 2 along with each attribute refer to sampling layers at a soil depth of 0.00-0.20 m and 20.0-0.40 m. FIGURE 2 (continuation). Variables selected for use in the Random Forest (RF) classification model. P, OM, pH, K, Ca, H.Al, Al, CEC represent phosphorus content, organic matter, soil pH, potassium content, calcium content, potential acidity, aluminum, and cation-exchange capacity, respectively. Number 1 or 2 along with each attribute refer to sampling layers at a soil depth of 0.00-0.20 m and 20.0-0.40 m.
Given the high level of correlation between DBH and certain soil and climatic variables, DBH was discarded in the random variable selection process through the correlation matrix (Figure 1 A and B). However, considering its high influence on eucalyptus height prediction, DBH was reincorporated into the database. This resulted in a hybrid approach, wherein the DBH variable was added to the set of variables previously selected through formal methods.
DBH is the most important variable for predicting eucalyptus height, reaching the maximum value of importance in the predictive process (100% -normalized by the attribute with the highest contribution). This is followed by P1, Al2, and GM1, having degrees of importance ranging between 15 and 19% (Figure 3). Finally, the results obtained by validating the RF model showed a high predictive capacity. The correlation between predicted and observed values were 0.98, with R 2 equal to 0.96 (Figure 4). The regression analysis was used to observe the formation of two data clusters, that is above 6 and below 6, which were related to the first and second crop years. The correlation between predicted and observed values was 0.98 and R 2 was 0.96. This revealed significant potential of the RF model in predicting the height of eucalyptus using physicochemical variables of the soil and DBH. The results obtained in this study were superior to those obtained by da , who used different machine learning algorithms (based on spectral indices) to predict the eucalyptus total height, and obtained a correlation coefficient of 0.79.
Several national and international studies have established the efficiency of the RF algorithm in other predictive analyses, with emphasis on its use against other data analysis techniques (Chen et al., 2018;Parmar et al., 2020;Singh et al., 2017). For instance, in a study to predict the basal area and volume in eucalyptus stands using Landsat TM data in Brazil, dos Reis et al. (2018) observed that RF was the best method for multiple linear regression, support vector machine, and artificial neural network. Likewise, to classify the growth of five species of eucalyptus and Corymbria citriodora, de Oliveira et al. (2021) reported that the RF algorithm using 24 features was the most accurate (0.76), as compared to other algorithms (0.66).
This study was able to correctly classify data for different height classes, which were induced by the formation of two clusters between the sampling of the first and second crop years. de  focused on the classification of eucalyptus species based on their growth (total height and diameter at breast height) and revealed that the development of eucalyptus trees over time induced changes in clusters. Raudys & Jain (1991) indicated that a small sample size reduced statistical power for pattern recognition. In this study, the model performed well due to the higher number of observations. The purpose of machine learning algorithms is to learn from the data (Mahesh, 2020). Therefore, the quality of training data has a significant impact on the efficiency, accuracy, and complexity of machine learning tasks (Gupta et al., 2021).
In practice, data are split randomly between 70-30 and 80-20 for training and test datasets, respectively (Dangeti, 2017). This division is necessary to obtain greater reliability of the generated model (Camilo & Silva, 2009). In our study, 70% of the data were used for training and 30% for testing. This was a consistent way to validate the performance of the machine learning model because a portion of the data was separated before developing a model and used only for validation (Vabalas et al., 2019). In addition, the method used to divide the training and test sets was critical. Therefore, representativeness of the original dataset in the samples should be maintained to make the model more efficient and reliable.
When selecting model parameters, datasets with a finite number of training samples require closer attention, including the number of variables used in decision making (Raudys & Jain, 1991). According to Speiser et al. (2019), the prediction efficiency of the model can be improved through variable selection techniques by identifying a subset of predictor variables to be included in a final, simpler model. Although the RF algorithm helps rank variables based on their predictive importance, it is difficult to distinguish relevant from irrelevant variables based on only this ranking (Degenhardt et al., 2019). Our results indicated that the correlation matrix, which is a variable selection method in our study, was highly efficient in selecting a minimum dataset, which was capable of representing the variability of the height of eucalyptus. This was proven by the high values of the correlation coefficient and determination between the predicted and observed data during model validation. These results were consistent with the findings of Everingham et al. (2016), who investigated the accuracy of RF to explain annual variation in sugarcane productivity, and observed that the variable selection process reduced the number of predictor variables in each model and improved the forecast performance.
Soil is an important component for wood production because it is responsible for water and nutrient supply to the plants (Bini et al., 2013). Lima et al. (2010) emphasized that the growth of eucalyptus can strongly influence certain physicochemical characteristics of the soil, namely DBH, P1, Al2, and GM1. Azevedo et al. (2015) used genetic selection in Eucalyptus camaldulensis progenies in the savanna area of Mato Grosso State, Brazil, and reported a high correlation (r = 0.72) between DBH and plant height variables. However, Taylor et al. (2016) pointed out the absence of a linear relationship between height and DBH. For most species, variations in height increased with increasing diameter, which led to precision problems in linear regression equations that were designed to estimate the growth of trees.
During early eucalyptus development, phosphorus (P) is directly related to wood productivity; in addition, its highest absorption rates appear during the second year of the tree, that is, the treetop closing (Barros et al., 2000;Melo et al., 2015). Graciano et al. (2006) and Fontes et al. (2013) pointed out that P is the most essential nutrient at an early development stage and for eucalyptus wood productivity. Lack of P in the soil leads to a nutritional imbalance in plants and irreversible falls during the final wood production.
High concentration of aluminum (Al) in the soil reduces the development of roots and diminishes nutrient absorption (Miguel et al., 2010). Although eucalyptus is more tolerant to exchangeable Al than annual crops (Silva et al., 2012), Brazilian forest activity is usually implemented in sandy and low-fertility soils, often with high levels of toxic elements, with emphasis on aluminum (Basso et al., 2007;Guimarães et al., 2015).
As eucalyptus is a fast-growing species, it has high energy expenditure, which leads to higher water consumption (Vital, 2007). Thus, any possible variation in the water supply of the culture reflects directly on plant growth and productivity. Jung et al. (2017) indicated that decreased soil water content reduces the plant water potential, which directly affects its growth in terms of height and diameter, due to reduced cell expansion and cell wall formation. In addition, lower availability of carbohydrates influences the production of plant hormones. Melo Neto et al. (2017) carried out a study on eucalyptus cultivation and verified a high variability in the mean soil moisture, especially in the superficial layer; homogeneity was observed between 30 and 100 cm of soil depth. They concluded that a steeper moisture reduction in these layers was due to: i) quick response to rain events, ii) demand for soil evaporation being met, and iii) greater exploitation by the root system of the plants.
In addition, the surface layer had higher accumulation of organic matter, which preserved the soil structure and contributed to a higher water flow, both in terms of depth and width. Consequently, this increased the variability of soil moisture (Melo Neto et al., 2017).
Overall, this study provides promising results for forest management purposes as it offers producers and technicians with guidelines to carefully plan the viability of new production areas by allowing the estimation of the height of the culture based exclusively on physicochemical attributes of the soil and identifying areas with high or low production potential. Moreover, the model allows the establishment of predictions in crops previously implemented for the purpose of forest inventories, considering the high cost of direct measurements of eucalyptus height as well as its difficult resolution in the field.

CONCLUSIONS
The random forest (RF) model generated in our study performed well (r = 0.98 and R 2 = 0.96) in predicting the height of eucalyptus using physicochemical variables of the soil and diameter at breast height (DBH). Therefore, this method can be used to support the decision-making process in the management of eucalyptus plantations.
The most important variables to predict the eucalyptus plant height consisted of DBH, phosphorus content (P1), gravimetric moisture (GM1) at a soil depth pf 0.00-0.20 m, and exchangeable aluminum content (Al2) at a soil depth of 0.20-0.40 m.