ALTERNATIVES TO ESTIMATE THE VOLUME OF INDIVIDUAL TREES IN FOREST FORMATIONS IN THE STATE OF MINAS GERAIS, BRAZIL

The objective of this study was to compare different alternatives to estimate the stem volume of individual trees in four different forest formations in the Minas Gerais state, Brazil. The data were obtained in a forest inventory procedure performed by the Minas Gerais Technological Center Foundation. The stem volumes were computed by the Smalian expression up to the outside bark diameter equal to 4 cm. The volume data of outside bark, diameters (DBH) and total heights were used to fit a Schumacher and Hall equation for each forest formation, considering the structures of the linear fixed and mixed models. Next, 100 Multilayer Perceptron artificial neural networks (ANN) were trained in a supervised manner. In addition, we evaluated eight support-vector machine regression (SVMR). The criteria to evaluate the performance of all the alternatives studied were: the correlation between the observed and estimated volumes, the square root of the mean square error and the frequency distribution by percentage relative error class. After the analyzes, all the alternatives were verified to estimate the volume of the individual trees in the different forest formations. Although the alternatives presented close statistics in the validation process, the graphical analysis of the error distribution showed greater precision of the estimates of the mixed linear models for the four formations. Given the results, it is concluded that there is no absolute superiority of one alternative over the others, and that all of them should be evaluated to find the one which best describes or explains the dataset. 1Amapa State University, Macapá, Amapá, Brazil. ORCID: 0000-0001-9273-7533a, 2Federal University of Viçosa, Viçosa, Minas Gerais, Brazil. 3Federal University of Espirito Santo, Department of Forestry Engineering, Jerônimo Monteiro, Espírito Santo, Brazil ALTERNATIVES TO ESTIMATE THE VOLUME OF INDIVIDUAL TREES IN FOREST FORMATIONS IN THE STATE OF MINAS GERAIS, BRAZIL


INTRODUCTION
One of the main purposes of forest inventories is to estimate tree volumes (Machado and Figueiredo Filho, 2009), in which it is necessary to define the measuring unit at which the volume is to be expressed, the minimum tree inclusion diameter and the way of obtaining the estimates (Soares et al., 2011). In this sense, there are several methods to estimate tree volume, among them: form factors, shape quotient, volume equations, multivolume or tapering equations (Burkhart and Tome, 2012).
Volume equations are commonly used to estimate the volume of trees in forest inventory procedures due to the accuracy of their estimates. They are an expression in which the wood volume is presented as a function of other quantities or variables of the tree (usually the diameter at breast height (DBH) and the height), which can be directly measured or estimated by non-destructive means (Campos and Leite, 2017). However, the volume equations in tropical native forests are fitted considering all species, thus decreasing their accuracy due to the data heterogeneity. This procedure is common due to a lack of trees of the species in all diameter classes, in most cases preventing the fit of specific volume equations for each of them.
In addition to the above mentioned methods, tree volumes can be estimated through the use of artificial neural networks -ANN (Silva et al., 2009;Görgens et al., 2014;Souza et al., 2018), mixed models (Hall and Clutter, 2004;Gouveia et al., 2015) and support-vector machine regression (SVMR) (Cordeiro et al., 2015;Binoti et al., 2016), which have mainly been explored in commercial species plantations. However, the methodological alternatives are still little used in estimating the tree volume in native forests in Brazil, so that there is a gap which can be filled by developing studies in different natural forest typologies or biomes where it is difficult to estimate the volume of individual trees with precision due to the great heterogeneity of species, as well as the sizes and shapes of the stems/trunks found in these environments.
Machine learning methods can reduce most of the data heterogeneity, as this previously mentioned information can be entered into the model as categorical variables and improve the model accuracy. In addition, this information can be considered as a random effect in a mixed model structure, generating several equations depending on the random parameters.
In view of the above, the hypothesis of this study is that machine learning (ANN and SVMR) methods and the inclusion of random effects in the linear mixed models improve the volume estimate precision in comparison to the Schumacher and Hall model in different forest formations in Minas Gerais state, Brazil.

Database
The data used in this study came from the forestry inventory carried out by the Minas Gerais Technological Center Foundation in four forestry formations in the State. The following is a brief description of the study formations based on Cetec (1995).

Cerrado sensu stricto
This formation comprises the "sensu strictu" cerrado, meaning that it is characterized by typical cerrado vegetation, with a predominance of tree-shrub individuals, and as a rule present tortuous stems, thick bark and a predominant height of 4 to 5 m. This formation has wide geographical distribution, significantly occurring in the northwest, north, jequitinhonha and central-north regions of the state.

Primary forest
Primary forest are forest formations originating in evergreen or semi-deciduous, however, in the present case not considering those formations located in the alluvial plains, marginal to the water courses or in their sources.
They predominantly consist of arboreal elements with high shafts and great diameters, and with a significant occurrence of noble species.

Secondary forest
Secondary forest comprises the strata of capoeirão, capoeira and capoeirinha, being constituted by evergreen or semi-deciduous vegetation formations in different regeneration stages developed from cutting or burning of preexisting virgin forest. This formation comes from the sprouting of primary forest stumps, roots and the germination of previously fallen seeds on the soil. These formations are predominantly located in the south, southeast and northeast regions of the State.

Jaíba transitional forest
Jaiba transitional forest is a forest complex comprising deciduous, semi-deciduous and transition forms between these and hyperxerophilous caatinga, which occurs in the Jaíba area and surroundings. It was distinguished from similar types due to occupying a relatively large area and having very different characteristics from the different types of forests existing in other areas of the State. Its wood yield is approximately equal to that of the mesophilic forest, standing at around 240 st/hectare. The municipalities with the highest occurrence of these formations are: Manga, ltacarambi, Januária and Varzelândia.
The number of sample trees in each formation was established according to the distribution proportional of the trees in the respective diameter classes, totaling 1479 trees, namely: 414 in the Cerrado sensu stricto; 266 in Primary forest; 448 in Secondary forest; 351 in Jaíba Transitional forest.
Information in the rigorous cubing process was collected from all trees to identify the species by measuring the diameters at 1.30 m of height (DBH) and the total heights, counting the number of branches, and measuring the outside bark diameters along the stems. The stem volumes of individual trees were obtained by successive application of the Smalian formula (Soares et al., 2011), considering sections of 1 m length and the minimum commercial outside bark diameter equal to 4 cm. Descriptive statistics were made for all variables by forest formation.

Linear fixed and mixed models
Mixed models are used to model the random parts of forests by including a matrix of variances and covariance (Resende et al., 2014). In addition, this modeling approach analyzes hierarchically structured data more efficiently than other approaches, and can increase the accuracy of the estimates (Hao et al., 2015). These models have three fundamental aspects: the estimation and hypothesis testing of fixed effects, prediction of random effects and estimation of variance components.
The linear mixed model was in the following form (Wu, 2009), where β are fixed effects, b i are random effects, x i is a design matrix containing covariates of individual i, z i , is a design matrix, e i are random errors, R i is a n x n variance-covariance matrix within individual measurements (R i = I x s 2 , where I is an identity matrix), and G is the variance-covariance matrix of random effects.
including the outside bark, in m 3 ; dbh = outside bark diameter at 1.30m aboveground, in cm; H = total height, in m; β 0 to β 2 = model parameters; ε = random error.
The Schumacher and Hall model (1933) in its linearized form with the outside bark volume, diameters and total tree height (fixed effects) data was initially adjusted for each forest formation, and its functional relationship was determined as per. In which: Ln = Napierian logarithm; V = commercial volume of the stem The adjustments of the equations referring to model 1 were performed using the Restricted Maximum Likelihood method with the glm 2 package in the R software program (R Development Core Team, 2014).
In order to verify the influence of the inclusion of random effects on the accuracy of the equations, the Schumacher and Hall model (1933) was modified considering the structure of a mixed linear model, including random slopes and inclination coefficients, considering the species in each forest formation as random effects, defining the following model. In which: β 0 , β 1 and β 2 = fixed parameters of the model; a i = random intercept for the i th species; b 1i , b 2i = random slope coefficients for the i th species.
The adjustments of the equations referring to the mixed models (models 2) were performed by the Restricted Maximum Likelihood method using the nlme package in the R software program (R Development Core Team, 2014).
The variances of the errors in this study were considered to be homogeneous, since the logarithmic transformation of the data usually provides attendance to this assumption of the classical regression model, as well as the covariance of the errors equal to zero using longitudinal data in the analyzes (Gujarati and Porter, 2011).
The results of the inclusion of the random effects on the intercept and slopes of the models were verified using the maximum likelihood ratio (MLR) test (Resende et al., 2014), where the significance of the difference (D) between the deviances (-2log(L)) for the models with and without the random effect was verified by comparing the calculated value with the tabulated value by the χ 2 test, with 1 degree of freedom and 5% significance.
Thus, the model selected as the best model for each forest formation at the end of this modeling process could be the complete mixed linear model or a partial model, which means with the random effect only being associated with some parameters of the model, or the model still considering only the fixed effects due to the non-significance of the random effects.
The following evaluation criteria were used in order to avoid personal judgments in evaluating the adjustments of the equations for fixed and mixed effects models, being calculated in the original dependent variable [3] [4] CERNE ABREU, et al volume unit (m 3 ): correlation coefficient (ŷ y r ) between the observed and estimated volumes and the root-meansquare error (RMSE) (Silva et al., 2009;, and analysis of the relative error percentage. Artificial neural networks and support-vector machine ANNs are computational models inspired by the nervous system of living beings. They have the ability to acquire and maintain knowledge (based on information) and can be defined as a set of processing units characterized by artificial neurons, which are interconnected by a large number of interconnections (artificial synapses), and represented by vectors/synaptic weight matrices (Macukow, 2016).
First, 100 Multilayer Perceptron artificial neural networks (ANN) using the Backpropagation and Simulated Annealing training algorithms with the sigmoid activation function were separately trained in a supervised way for the four studied forest formations (70% of the data).
The input variables (inputs) in training the networks (ANN) were: DBH, total height (H), number of branches and species (categorical variable); while the output variable was the volume of the stem outside bark. The stop training criteria adopted for the ANNs were: root-mean-square error (<0.001) or number of cycles (equal to 3000).
Next, eight configurations formed from two error functions and four kernel functions were tested for training the support-vector machine regression (SVMR). The optimized error functions were: type I and type II functions, given by: Type I function: Subject to the following restrictions. In which: w = coefficient vector; c = error penalty parameter; * ,    variables that characterize, respectively, the error above and below the etube; i = training cases; total number of training cases;   bias; y i = output data and e = maximum allowed error.
and represents the kernel function applied to the input data; g=shape parameter; d = polynomial degree; C = error penalty parameter.
The same input variables (inputs) for training the networks (ANN) were considered for training the eight support-vector machine regression (SVMR) configurations: DBH, total height (H), the number of branches, in addition to the variable categorical species; and the stem plus the outside bark volume as the output variable.
All the training of the artificial neural networks (ANN) and support-vector machine regression (SVMR) were performed in NeuroForest 4.06 (Neuroforest, 2017) and R software program (R Development Core Team, 2014), respectively. The evaluation criteria were the same as those used for the regression analysis.

Validation of alternatives
In order to compare the performance between the artificial neural networks (ANNs) and the support-vector machine regression (SVMR) configurations, 30% of the database was used as test samples, i.e. samples not used in of the ANN training and the SVMR adjustment, and then the following statistics were calculated: correlation coefficient between the observed and estimated volumes ( ), and the root-mean-square error (RMSE) in percentage (%); as well as a graphical analysis of the distribution of the frequencies per class of errors in percentage (Silva et al., 2009;Binoti et al., 2015).
As there was no data separation for validating the equations for the fixed and mixed models, the comparison between these methodological alternatives and the ANN and SVMR was performed by separating the estimates in

Data description
Considering the total number of sample trees used in the analyzes (1479), tree diameters (DBH) ranged from 3.80 to 66.20 cm and heights from 2.30 to 33.40 m ( Table 2). The forest formations where the smallest and largest number of trees were covered were the primary forest and the secondary forest, respectively. In terms of the number of species, the Jaíba Transitional Forest and Secondary Forest formations were those with the lowest (36) and the highest (112) amounts, respectively.

Fixed and mixed effects models
The Schumacher and Hall model equation (1933) (only fixed effect) adjusted well in the four formations (Table 3), and the parameter estimates were all statistically significant (p-value <0.05). The inclusion of the species as a random effect in the mixed model structure was significant in the four formations. It should be noted that only the random coefficient associated with the height variable was not significant for the Cerrado (p-value> 0.05). All three coefficients were significant for the other formations (p-value <0.05).

Artificial neural networks (ANNs) and supportvector machine regression (SVMR)
The ANN and SVMR results by forest formation are presented in Table 4. The ANN and SVMR configurations which presented the best training statistics are provided. The Backpropagation algorithm was the best for the ANN related to the Cerrado formation. The best training algorithm for the other formations was Simulated Annealing, with 1 neuron in the hidden layer.
The best SVMR for the Cerrado training was with the RBF kernel function and type II optimized error function (RBF-II) ( Table 4). The kernel function selected as the best for the other formations was the Polynomial, also with the optimized type II error function (Polynomial-II).
The ANN configurations which had species as categorical variables did not present the best training statistics, and the ANN configurations which presented the best statistics did not have a categorical variable. Inclusion of the species only improved the results in the secondary forest and transient forest of Jaíba for the SVMR.
The SVMR presented lower errors (RMSE) and higher correlations between the observed and estimated volumes in the Cerrado, Secondary forest and Jaíba Transitional forest formations, while only in the Primary Forest formation for ANN.

Validation of the alternatives
In the validation process of the evaluated methodologies, a high correlation (Table 5) was observed between the observed and estimated volumes, with values between 96.74% and 98.68%. In terms of root-meansquare error (RMSE), the estimates were from 19.83% to 51.20% for the evaluated methodological alternatives.
The histograms of the percentage error frequencies (Figure 1) show that the mixed linear model was the alternative with the best performance in the validation process for the four formations, with a concentration of errors close to zero and lower amplitude of distribution.   In which: Y or N represents the presence or not of the species and/or forest formation as a categorical variable.

DISCUSSION
In analyzing the significance of the coefficients of the fixed effect model in Table 3, we can confirm the importance of the explanatory variables of diameter and height in the volumetric model (Calegario et al., 2005). Similar results were found by Chicorro et al. (2003);Scolforo et al. (2008); Rufini et al. (2010) and Stolariková et al. (2014). All parameters of the equations were statistically significant (p-value < 0.05).
The model predicts random coefficients for each species (BLUP) with the inclusion of random factors, and instead of the regression curve tending towards the sample mean, it predicts a curve for each species, i.e. the random factors. The values which would be incorporated into the model error end up being incorporated and explained by random factors.
Equations adjusted for a group of species or forest formation are more common due to the lack of treesamples for all forest species (Huff et al., 2018). If response variable information is available for a new species in a mixed model, then random coefficients are obtained and estimated by considering the species-specific response rather than just an average (or expected) response to the population. In the mean response of the population, it is assumed that the vector of the random coefficients for a new individual has an expected value of zero (Burkhart and Tomé, 2012).
The performance of a mixed model may be better than that of the model with only fixed effects when a sample is available to predict the random parameters (BLUP) (Temesgen et al., 2008;Huff et al., 2018). This behavior was observed for the four studied forest formations by including the species as a random effect in the volumetric model. Other variables can be inserted as random effects in addition to this variable, such as region and local quality classes (Ou et al., 2016), precipitation, soil, elevation and other geographical characteristics observed to increase the accuracy of the estimates (Meng et al., 2007).
The histogram of the distribution of residuals confirms the good performance of the mixed model ( Figure  1), and it should be noted that there was a greater dispersion of errors for the Cerrado formation (up to -60%). However, this can be considered insignificant since it is a small number of estimates compared to the sample, and does not strongly interfere with the model accuracy (Costa et al., 2012). This result confirms our second hypothesis, namely that the inclusion of the species as a random effect would improve the estimate of the Schumacher and Hall model.
Although the artificial neural networks (ANNs) were accurate in the training phase, their performance was not good over the test data in the validation process.
In the distribution of errors (Figure 1), it was observed that the networks overestimated the smaller volumes and their performance in some formations was lower than the Schumacher and Hall model (fixed effect model). This result is already in line with one of our hypotheses, in which machine learning methods would present the best results. Similar results were found by Görgens et al. (2015), whose scatter plots of the eucalyptus plantations obtained by ANN showed overestimation for lower volumes and underestimation for larger ones when compared to Random Forest, a support-vector machine regression. These results are probably due to some algorithms having difficulty learning lower values, and that they end up overestimating the lowest values when the residual is calculated in relative form. Araújo (2015) found good results for ANN for this same data set, but using different combinations of DBH and height variables, as well as different algorithms such as: NEAT and Skyp Layer Connections.
The difference in ANN and SVMR performance in the training and validation process evidences the careful separation of data in these two modeling phases. In this study, the data selection was performed in a random manner and because there were no sample trees in all diameter classes and in all forest formations, and so the data were unbalanced in such a way that good training and not such good validation were performed. In the CERNE ABREU, et al case of regression models (fixed and mixed effects), which can be adjusted using the method of least squares and maximum likelihood, the estimates refer to the mean values (expected) (Gujarati and Porter, 2011), and therefore presented better results than ANN and SVMR.
Superior results with regression models to ANNs have already been found, such as by Özçelik et al. (2013) who modeled the hypsometric relationship of Juniperus excelsa in southern and southwestern Turkey using mixed models; and Mendonça et al. (2018), who adjusted fixed height models for Zeyheria tuberculosa. This shows that although ANN is an efficient and accurate method for predicting dendrometric variables, other methodologies may also sometimes present better results.
The support-vector machine regression (SVMR) was statistically the best in the training phase. However, the SVMR presented the same problems as ANN in the validation phase. Although they presented good accuracy, they could not "learn" to estimate small volumes in some formations. Cordeiro et al. (2015) found more accurate volume estimates for SVMR compared to the Schumacher and Hall model in an Acacia mangium plantation in the state of Amapá; however, there was no separation of data for training and validation in this study . Costa Filho (2019) found higher precision of the SVMR to estimate height for Pinus taeda plantations in São Paulo. Binoti et al. (2016) found close values between the estimates obtained by the Schumacher and Hall model (regression) and SVMR, also for eucalyptus plantations in the South of Bahia. It is observed that all these works were for artificial plantations with little variation in the data. For unequal forests, we can mention Abreu et al. (2017), who found a good fit to estimate volume with a support vector machine in semi-deciduous seasonal forest in the state of Minas Gerais; and Montaño (2016), who in addition to modeling dendrometric variables for commercial plantations, also used SVMR for dry biomass in tropical forests. However, the SVMR in the latter study were lower than the allometric models suggested by Chave et al. (2014).
As a result, we can infer that machine learning methods in artificial plantations are much better compared to traditional methods. However, this does not always happen when the data comes from native forests.
If there were more individuals per species, they would probably have a positive influence on the results, because despite a large number of species, most of them are represented by a single individual, and have situations that they are in training and are not in validation, or vice and vice versa. In addition, we can insert more categorical variables in future studies such as relief, climate, and soil, among others.

CONCLUSION
With the development of this study, it was verified that all the alternatives were precise in estimating the volume of the individual trees in the different forest formations.
Despite presenting satisfactory results, the machine learning methods were not superior to the mixed model and the conventional regression model.
The species as a random effect considerably improved the fit of the Schumacher and Hall model.
Given the results, it is concluded that there is no absolute superiority of a methodological alternative to the others, and that they should all be evaluated to find the one which best describes or explains the dataset.