Introduction
Understanding growth and yield processes in forests is important to their rational management (^{Cosenza et al., 2015}). Statistical techniques for data modelling can support decision making during forest planning, and information on site productive capacity can help delineating resource allocation strategies for management units, silvicultural treatments, and wood volume prognosis (^{Campos & Leite, 2017}).
The potential for trees to grow at a specific location can be determined by site index estimates, obtained by direct or indirect methods (^{Binoti et al., 2012}). Among direct methods, the guide-curve is the most commonly used, whose application includes the adjustment of regression models that relate dominant height and forest age data (^{Scolforo, 2006}). When this procedure is used for the classification of site productive capacity, its evaluation generally includes the stability analysis (^{Machado et al., 2011}), which deals with the estimation of the number of plots (or samples) that remain in the same site index class over time.
Databases used for dominant height modelling in forest stands are obtained from permanent or temporary plots, during a forest inventory, or from stem analyses (^{Scolforo, 1997}). These measurements, although simple, must be done carefully in order to avoid non-sampling errors (^{Soares et al., 2011}), which can impact the analysis when one or more values are out of the general trend of the data and are considered as outliers. Because of their marked difference, according to ^{Santos et al. (2015)}, the number of outliers should be as low as possible. There are several ways to identify them, among which stands out the boxplot graphic (^{Schwertman et al., 2004}).
It should be noted that sometimes outliers are not measurement errors, but values with a different distribution. In these cases, a deep evaluation about the cause of their discrepant behaviour is necessary before excluding them from database. In other situations, the set of discrepant data can result in new strata for modelling. When the analysis of their origin is not possible, due to the amount of data or to the impossibility of assessing sampling procedures, it is important to keep the outliers in the database. Therefore, mathematical techniques suitable for other types of distribution are required to incorporate these values into the database, aiming at a robust modelling process, without the possibility of phasing out real information, although discrepant.
The use of the quantile regression (QR) and artificial neural network (ANN) is common for this kind of robust modelling. The first type of analysis was proposed by ^{Koenker & Basset (1978)} and is rather robust when there are outliers in the database, since they have no effect on the distribution of the conditional median and can be used to model any specified quantile of a dataset (^{Abellanas et al., 2016}). The second is part of the artificial intelligence field. It was firstly described by ^{McCulloch & Pitts (1943)} and has been considered as an excellent alternative to the traditional regression models. Both the QR and ANN have already been used with success to fit dominant height data, with focus on site index estimates (^{Cosenza et al., 2015}; ^{Araújo Júnior et al., 2016}); however, these studies did not consider the presence of outliers in the database.
The objective of this work was to compare methods of obtaining the site index for eucalyptus stands, as well as to evaluate their impact on the stability of this index in databases with and without outliers.
Materials and Methods
The study was developed using data from 22 rectangular, permanent plots (approximately 330 m^{2}) of a continuous forest inventory carried out in forest stands in the Vale do Rio Doce region, in the east of the state of Minas Gerais, Brazil. Measurements were done in trees with ages from 20 to 83 months, with six observations per plot. Dominant height was obtained for each measurement, considering the average height of the 100 largest trees per hectare, and diameter at breast height was determined as in ^{Assman (1970)}. Descriptive statistics are shown in Table 1.
Age class (months) | Basal area (m^{2} ha^{-1}) | Dominant height (m) | Total volume (m³ ha^{-1}) | |||
Average | Standard deviation | Average | Standard deviation | Average | Standard deviation | |
18 | 6.09 | 1.40 | 10.43 | 1.40 | 25.44 | 8.48 |
30 | 12.05 | 2.44 | 17.92 | 3.23 | 90.46 | 32.45 |
42 | 17.74 | 2.59 | 22.80 | 2.60 | 166.86 | 34.71 |
54 | 20.23 | 3.06 | 25.44 | 2.94 | 209.67 | 42.30 |
66 | 22.43 | 3.73 | 27.43 | 3.32 | 257.45 | 60.62 |
78 | 23.54 | 3.49 | 28.30 | 3.92 | 284.44 | 66.10 |
Observations were grouped according to age classes, with a 12-month range, in order to identify possible discrepant data. A boxplot was built for each age class, and data out of the confidence interval were considered as outliers, as proposed by ^{Schwertman et al. (2004)}. Therefore, two databases were created: one containing outliers and another without them.
Schumacher’s model was adjusted for each database, as in ^{Demolinari et al. (2007)}, and is described as:
where n is the number of observations; ρ_{τ} is the weight for residual i, determined as 2q (quantile) if r_{i}>0 or 2 (1 - q) if otherwise; r_{i} is the residual for the i^{th} observation; y_{i} is the i^{th} observed value; k is the number of parameters to be estimated; x_{ij} is the value of the j^{th} independent variable for the i^{th} observed data; and β_{j} is the j^{th} parameter to be estimated.
For the QR method, a percentile of 50% (median) was considered. In this case, the adjustment was made using the quantreg statistical package (^{Koenker, 2013}), developed for the R software (^{R Core Team, 2014}).
The two databases were, then, used for training the ANN, considering a multilayer perceptron structure, with age as an input variable and dominant height as an output variable. ANN training was carried out using the resilient propagation algorithm, 5 neurons in the hidden layer and 3,000 epochs of training; these parameters were defined according to ^{Binoti et al. (2015)}. The NeuroForest software was used to obtain ANN parameters (^{Binoti et al., 2014}).
The average percentage relative error (E_{APR}), the mean absolute error (MAE), the root mean square error (RMSE), and the correlation coefficient between observed and estimated values (r_{yŷ}) were used to evaluate the quality of the adjustments. In addition, histograms of the E_{APR} for each method were built. The estimated parameters were obtained with the following equations:
where n is the total number of observations, ŷ_{i} is the estimated value for observation i, y_{i} is the correspondent observed value, cov is the covariance, and var is the variance. The estimated parameters for the QR and linear regression (LR) were evaluated by the t-test, at 5% probability.
The guide-curve method, applying a reference age of 72 months to Schumacher’s model, was used to determine the site index using the estimates obtained from the LR and QR models (^{Campos & Leite, 2017}), through the equation:
A new training - considering a data structure similar to the one used for Schumacher’s model - was performed to obtain the site index by a neural network. Therefore, the ANN was trained considering a multilayer perceptron structure. Dominant height at the current age, current age, and future age were considered as input variables, and dominant height at a future age, as an output variable. The trained ANN was applied using the index age of 72 months as the future age in order to obtain the site index for each measurement.
The site index estimates obtained from the LR, QR, and ANN were classified using the equation:
where int is a function that returns the integer value of a real number, S is the site index estimated for each plot in each measurement time, and R is the range of class (meters) considered. This equation returns the central value of the class as a function of the class range - in this study, an interval equal to 5.0 m was considered.
The results of the classification of productive capacity using the site index were subjected to the stability analysis, as suggested by ^{Scolforo (2006)} and ^{Chaves et al. (2016)}. In this case, five intervals of last measurements were considered: interval 1, measurements from 1 to 6; interval 2, from 2 to 6; interval 3, from 3 to 6; interval 4, from 4 to 6; and interval 5, from 5 to 6. The amount of plots that remained in a same class, in all measurements, was counted for each range. Stability was obtained from the percentage of stable plots.
Results and Discussion
Using a boxplot, six outliers, all located bellow data distribution, were detected for each dominant height measurement period (Figure 1). When the presence of outliers is as strong as the one observed here, considering them in modelling generally enhances database consistency. It should be highlighted that, on the one hand, their use may cause a loss in the explanatory power of regression models (^{Schwertman et al., 2004}), whereas, on the other hand, it can completely change the analysis if the causes for their presence cannot be determined.
The boxplot analysis did not evidence discrepant observations after the exclusion of the outliers (Figure 1). The comparison of databases with and without outliers indicated little discrepancy for the averages and medians, in each class of dominant height (Table 2).
Dominant height class (m) | With outliers | Without outliers^{(1)} | ||
Average | Median | Average | Median | |
CL18 | 10.43 | 10.54 | 10.43 (0.00%) |
10.54 (0.00%) |
CL30 | 17.92 | 18.27 | 17.92 (0.00%) |
18.27 (0.00%) |
CL42 | 22.80 | 22.72 | 23.14 (1.49%) |
22.87 (0.66%) |
CL54 | 25.44 | 25.76 | 25.84 (1.57%) |
26.17 (1.59%) |
CL66 | 27.43 | 27.87 | 27.80 (1.35%) |
27.87 (0.00%) |
CL78 | 28.30 | 28.77 | 29.15 (3.00%) |
28.81 (0.14%) |
^{(1)}Values within parenthesis define the variation percentage between databases with and without outliers.
Considering Schumacher’s model for the two databases, the adjusted equations generated parameters statistically different from zero (p<0.05), both for the LR and QR (Table 3). It is possible to note that the values obtained for the parameters using the different methods differed in the database with outliers, but were similar in the one without them. This result is due to the fact that the LR is less tolerant to the presence of noise than the QR (^{Araújo Júnior et al., 2016}), mainly when the discrepant values promote a change in average values, but interfere little in the median, as was the case for some dominant height classes (Table 2).
Regression type | Parameter | Value | Standard error | p-value |
Database with outliers | ||||
Linear | β_{0} | 3.7430 | 0.0282 | < 0.05 |
β_{1} | -29.7093 | 1.0902 | < 0.05 | |
Quantile | β_{0} | 3.7747 | 0.0312 | < 0.05 |
β_{1} | -30.2833 | 1.1967 | < 0.05 | |
Database without outliers | ||||
Linear | β_{0} | 3.7886 | 0.0224 | < 0.05 |
β_{1} | -30.9034 | 0.8490 | < 0.05 | |
Quantile | β_{0} | 3.7807 | 0.0304 | < 0.05 |
β_{1} | -30.4042 | 1.1436 | < 0.05 |
Graphically, it is possible to observe a discrepancy between the curves for the values estimated by the LR and QR only when there are outliers in the database (Figure 2). This is in alignment with the results from ^{Araújo Júnior et al. (2016)}, who reported a dislocation of the curve in the direction of the outliers. These variations can promote imprecise values, mainly when there are no guarantees that the outliers really must be part of the sampling.
The statistics for regression and for the ANN (Table 4) reveal better MAE, RSME, and r_{yŷ} values for the latter, which showed greater robustness in estimating dominant height values. The QR had a slightly superior adjustment than the LR, which was also found by ^{Araújo Júnior et al. (2016)}.
Statistic^{(1)} | Linear regression | Quantile regression | Artificial neural network |
Database with outliers | |||
E_{APR} | 1.0259 | 2.8944 | 1.5875 |
MAE | 2.0960 | 2.0931 | 1.8407 |
RQEM | 2.9424 | 2.9850 | 2.7128 |
r_{yŷ} | 0.8984 | 0.8981 | 0.9133 |
Database without outliers | |||
E_{APR} | 0.5613 | 0.9764 | 0.9737 |
MAE | 1.7256 | 1.7128 | 1.5267 |
RMSE | 2.1710 | 2.1557 | 1.9739 |
r_{yŷ} | 0.9468 | 0.9471 | 0.9553 |
^{(1)}E_{APR}, average percentage relative error; MAE, mean absolute error; RMSE, root mean square error; and ryŷ, correlation coefficient between observed and estimated values.
When the database had no outliers, the statistics were improved, compared with the previous analysis (Table 4). However, the order of the quality of the adjustment remained the same, i.e., the ANN results were superior to the ones obtained by the QR and LR. Indeed, recent literature shows accuracy gains due to the adoption of neural networks instead of classic regression models (^{Binoti et al., 2015}; ^{Miguel et al., 2015}).
The histograms of the estimated values from the database with outliers showed errors above 50% (Figure 3). Even when techniques considered robust to outliers, such as the QR and ANN, were used, errors still occurred at high levels. This can be explained by the behaviour of the techniques that consider outliers - or give them a great importance - when modelling a denser set of data.
Considering the second ANN, trained to obtain the values of dominant height at a future age, errors between -20 and +30% were observed (Figure 4), as well as E_{APR}, MAE, and RSME values near zero for databases with and without outliers (Table 5). The difference between the estimates of the two ANNs can be attributed to the projection of the dominant height in the second one, where the neural network output (dominant height at a future age) was dependent on the dominant height at the current age. This might provide a better learning to the ANN, compared with the patterns of inputs and outputs of the system. These results agree with those found by ^{Cosenza et al. (2015)}, who pointed out that the classification of productive capacity using neural networks provided consistent results, superior to those observed with the application of a support machine vector.
Outliers | E_{APR} | MAE | RMSE | r_{yŷ} |
With | 0.4678 | 1.4336 | 1.9099 | 0.8942 |
Without | 0.3131 | 1.2623 | 1.7327 | 0.8840 |
^{(1)}E_{APR}, average percentage relative error; MAE, mean absolute error; RMSE, root mean square error; and ryŷ, correlation coefficient between observed and estimated values.
The analysis of site classification stability showed that the ANN provided good results, both for databases with and without outliers, especially when six or five measurements were considered (Figure 5). This reinforces the quality of the estimates obtained with this technique, mainly for the dominant height ranges, for which a lower stability is expected. The LR provided good estimates when less than four measurements were considered. The worse results were observed for the QR (six measurements) when the outliers were excluded; however, when they were considered, this regression provided good results. When low amounts of measurements are available, the percentage of stable plots increases for all methods, which does not depend on the presence or absence of outliers. This pattern is commonly reported in similar studies (^{Chaves et al., 2016}) and occurs because stability is greater in more advanced ages, represented here by the two last measurements.
Although better results were expected for the QR with the presence of outliers in the database, the opposite occurred when five and four measurements were considered. These discrepancies were caused by one plot, in each case. In the first one, the site index estimates for the QR were 30.27 m for class 32.5 m, and, for the LR, they were 29.95 m for class 27.5 m; in the second, for the QR, they were 30.10 m for class 32.5 m and, for the LR, 29.96 m for class 27.5 m.
Conclusions
The artificial neural network (ANN) is a robust technique to cope with the presence of outliers in databases, and can be used for the classification of the productive capacity of even-aged eucalyptus (Eucalyptus spp.) stands.
A better stability in the classification of forest sites can be obtained using the ANN, both with the presence or absence of outliers.