Anthropogenic Disturbances Affect the Relationship Between Spectral Indices and the Biometric Variables of Brazilian Savannas

According to previous studies involving biometric variables modeling using remote sensing (RS), data did not consider the effects of anthropogenic disturbance as relevant factor. The main objective of this study was to model aboveground biomass (AGB) and total wood volume (TWV) of Brazilian Savanna biome using vegetation indices (VI) from LANDSAT 5 TM. Multiple linear regression (MLR) and random forest (RF) algorithm were applied across 641 field plots of cerrado sensu stricto of the state of Minas Gerais, Brazil, comparing two models: non-stratified, and stratified according to plot’s anthropization degree. AGB and TWV obtained from non-anthropized plots presented linear relation with VIs (R2 = 0.82 and 0.74, respectively) and, on the other hand, presented nonlinear relation when plots were affected by anthropogenic disturbances or were not stratified. This finding helps improving estimates by stratifying plots into their anthropization degree, mainly in the Brazilian Savanna biome undergoing anthropogenic disturbances.


INTRODUCTION
The Savanna biome represents a large area in Brazil, occupying the central part of the country, totaling about 204 million ha (Sano et al., 2010). In the state of Minas Gerais, savanna is a significant portion of the territory, with estimated area of 33 million ha (Scolforo et al., 2015). This biome is among the most endangered eco-regions in the world due to high conversion rates (Bueno et al., 2019;Silveira et al., 2019a) and few protected areas (Hoekstra et al., 2005). Since the 1970s, this biome has suffered large losses of its natural vegetation due to agricultural expansion (Silva et al., 2006). Currently, the deforestation rate is about 1.6% per year (Arantes et al., 2016), leading to large-scale conversion of native vegetation into agricultural areas, which has already affected more than 40% of the original area of this biome (Sano et al., 2010). In addition to anthropogenic disturbance, as the state has large dimensions, the physiognomy grows in places with different climatic and soil conditions. These two sources of variation make it even more difficult to accurately evaluate biometric data.
Several studies have been carried out mainly aimed at knowing the structural formation of this biome, with a qualitative view, in which species are identified, their forms of propagation and cultivation, multiple and alternative uses for various fruit, medicinal and energy species (Aguiar & Monteiro, 2005;Cabral et al., 2015;Veenendaal et al., 2015). Quantitative studies have also been developed in order to identify growth and productivity parameters in the various regions of occurrence of this vegetation (Rufini et al., 2010;Reis et al., 2015;Silveira et al., 2019b). Volumetric reviews of timber potential, biomass and carbon stocks are being held either by environmental agencies and academic institutions, while the private sector carries out quantitative surveys designed to support plans for sustainable management and operations for exploration of timber resources (Alvarenga et al., 2012;Morais et al., 1995;Silveira et al., 2019b). Estimating wood volume and biomass accurately and establishing suitable models can provide a scientific basis for determining reasonable stocking capacity (Yang et al., 2018), with implications to regional and global reports on biomass, carbon stocks, and their changes.
There are several approaches to estimate biometric variables in large areas using remote sensing (RS) data (Lu et al., 2016). In general, these data are empirically linked to measurements of field plots, ranging from simple linear regression (Scolforo et al., 2015) to complex machine learning algorithms (MLA) (Gleason & Im, 2012;Wang et al., 2016). LANDSAT5 TM and 8 OLI images are the most medium spatial-resolution data commonly used due to the large historical series available with 30-meter resolution (Wulder et al., 2008;Zhu & Liu, 2015;Aslan et al., 2016). The normalized difference vegetation index (NDVI) is one of the most often used in applications relevant to the analysis of vegetation biophysical parameters. It has been used in various applications such as estimation of deforestation trends (Silveira et al., 2018a, b), phenological development monitoring (Verbesselt et al., 2010), vegetation re-growth assessment after fire (White et al., 2017), and estimation of biometric variables (Zhu & Liu, 2015). In addition to NDVI, other vegetation indices (VIs) have been used, including but not limited to the soil-adjusted vegetation index (SAVI) (Huete, 1988), the enhanced vegetation index (EVI) (Garroutte et al., 2016), and the LAI (leaf area index) (González-Sanpedro et al., 2008). However, most of previous studies involved statistical models based on a single vegetation index, and each of them has limitations and uncertainties (Zhao et al., 2014). The effects of anthropogenic disturbance that impacts the correlation between biometric variables and spectral indices were also not studied. Anthropogenic disturbance can change the stand successional pattern, subsequent diversity and forest biomass (Pawar et al., 2014).
Thus, the main objective of this study was to analyse how anthropogenic disturbances affect the relationship between biometric variables and remote sensing vegetation indices. Specifically, NDVI, EVI, SAVI and LAI derived from LANDSAT 5 (thematic mapper) TM were used as independent variables to apply a multiple linear regression (MLR) and random forest (RF) algorithm to estimate aboveground biomass (AGB) and total wood volume (TWV). Two models were generated: (1) non-stratified model, considering all plots, and (2) stratified model, classifying plots according to their anthropization degree. The research question that motivated this study was: (1) Do field stratification of plots according to their anthropization degree improve the relationship between biometric variables and vegetation indices derived from LANDSAT5 TM images?

Field data and study area
The state of Minas Gerais (MG) is located in south-eastern Brazil (Figure 1a), and has Savanna, Atlantic Forest and Semi-arid woodland biomes (Figure 1b). The Brazilian Savanna is a heterogeneous biome, comprising vegetation types ranging from grasslands to woodlands (Ribeiro & Ferraz, 2013;Silveira et al., , 2018c, located between latitudes -14.25° and -21.50° 51.52 S and longitudes -41.80° and -50.91° W. The climate is seasonal Tropical, presenting dry winter and rainy summer (AW by Koppen classification). The average annual temperature is between 22 and 23 C. The average monthly rainfall shows a clear seasonality, occurring more concentrated between October and March, with annual average between 1.200 and 1.800 mm (Scolforo et al., 2015).
Field measurements of 641 plots (10 × 100 meters), from 49 forest remnants identified as cerrado sensu stricto vegetation type were obtained, which demonstrate well-defined arboreal and shrub-herbaceous layers with arboreal coverage varying from 10% to 60% , belonging to the "Forest Inventory of Minas Gerais" project conducted by the Federal University of Lavras -UFLA from 2006 to 2008 (Figure 1b). This project generated abundant information regarding qualitative and quantitative analysis of forest remnants. The map provided the delimitation and spatial distribution, and the forest inventory provided quantitative data of vegetation types present in Savanna, Atlantic Forest and Semi-arid woodland biomes in the state.
During field surveys, the diameter at breast height (DBH, 1.3 m) and total height of all trees with minimum DBH of 5 cm were measured. Data used in this study were total wood volume and AGB of plots as dependent variables. Total wood volume for all the trees was estimated by applying Equation 1 (Rufini et al., 2010) for cerrado sensu stricto vegetation type, with R 2 (determination coefficient) = 98.64% and S xy (residual standard error) = 0.12 m 3 . The trees used to determine AGB were all from destructive sampling campaigns as described by Scolforo et al. (2015). where: V is the volume (m 3 ); e is the base of the natural logarithm; ln is the natural logarithm; DBH is the diameter measured at 1.3 meters above the ground (cm); H is the total tree height (m).
Field plots were stratified into two classes according to their anthropization degree: (1) non-anthropized -218 plots ( Figure 2b) and (2) anthropized -423 plots ( Figure 2c). The criteria for this classification are described in Ribeiro & Ferraz (2013). The methodology consists in applying data collected in quantitative forest inventories (number of individuals, basal area, quadratic mean diameter, ratio between number of individuals in the first diameter class -5 to 10 cm and total number of individuals), in addition to visual analysis. The methodology is also described in Scolforo et al. (2008).

Remote sensing data
Thirteen LANDSAT 5 TM satellite scenes (Figure 2a) from the same collection date of forest inventory data were used (from 2006 to 2008), acquired by the US government institute that supports research involving geological surveys and Earth observation, the United States Geological Survey for Earth Observation and Science (USGS\EROS). Images were acquired in the CDR processing level (Landsat Surface Reflectance Climate Data Record), with the necessary geometric corrections and reflectance values at ground level. One image by scene completely free of clouds was selected (Table 1).
From images, NDVI, EVI, SAVI and LAI spectral indices were generated. The NDVI index (Rouse et al., 1973) expresses the ratio between bands that capture electromagnetic wavelength in the red and near infrared frequencies (Equation 2). This equation generates an index that varies from -1 to 1.
where: NDVI is the Normalized Difference Vegetation Index; NIR is the reflectance values for the near infrared wavelength; RED is the reflectance values corresponding to the red wavelength.
In order to improve the received vegetation signal, the Enhanced Vegetation Index (EVI) was  used, whose purpose was to reduce atmosphere influences and improve sensitivity at high biomass density regions (Nakaji et al., 2008). Its calculation is shown in Equation 3. The coefficients adopted by the MODIS program algorithm are: L = 1, C 1 = 6, C 2 = 7.5 and L = 2.5.
where: G is the gain factor; NIR is the reflectance of the near infrared spectral band; RED is the reflectance of the red spectral band; L is the adjustment factor for components below the canopy; C 1 and C 2 are adjustment coefficients of aerosol effects of the atmosphere (dust, smoke, air pollution particles); BLUE refers to reflectance in the blue band, which shows the influence of aerosols. SAVI (Soil Adjusted Vegetation Index) was proposed by Huete (1988), aiming to improve the vegetation response in relation to soil interference, especially in areas of low vegetation cover. This index is obtained by Equation 4. The L value varies according to soil reflectance and vegetation density. In areas with low vegetation density, it is recommended to adopt L=1, for intermediate vegetation density, L = 0.5 and for high vegetation density, L = 0.25. Since our analysis is focused on cerrado sensu stricto vegetation type, L = 0.5 was adopted.
where: NIR is the reflectance of the near infrared spectral band; RED is the reflectance of the red spectral band; L is the soil adjustment constant.
Based on values obtained for SAVI, the Leaf Area Index (LAI) was calculated, which expresses the proportionality between m 2 of leaf per m 2 of soil

Multiple linear regression and random forest algorithm
TWV (m 3 /ha) and AGB (t/ha) from sample plots were used as dependent variable and the NDVI, EVI, SAVI and LAI vegetation indices were used as independent variables. MLR and RF implemented in the R software (R Core Team, 2014) were applied and the stratified method and non-stratified methods were tested. Among parametric methods, linear regression is the simplest one. It uses ordinary least squares to estimate parameters of the linear relationship between predictor variables and field measured variables (Equation 6). This method is often challenged when there is nonlinear relationship (Zhu & Liu, 2015). The Pearson´s correlation was also calculated.
where: i y are the dependent variables (TWV-m 3 /ha; AGB-t/ha); i X are the independent variables (NDVI; EVI; SAVI; LAI); 0 B and 1 B are the parameters to be estimated by regression.
The RF algorithm, initially proposed by Breiman (2001) is an ensemble method that generates a set of individually trained decision trees and combines their results. RF is a robust non-parametric classifier and has the ability to accommodate many predictor variables (DeVries et al., 2016). The advantages of RFs include excellent accuracy, efficient implementation on large datasets, and a structure that enables the future use of pre-generated trees. The number of decision trees (Ntree) was set to 1,000. About 70% of data were randomly selected and used to fit the models. The remaining 30% was used to test the model performance. To compare the performance between stratified versus the non-stratified method modelling by MLR and RF, the determination coefficient was computed (R 2 , in %), which indicates the part of the observed variability that is explained by the model; the root-mean-square error (RMSE), which measures the average difference between values predicted by the model and those observed; the mean absolute error (MAE), which indicates an average bias of model over or underestimation (Vieilledent et al., 2016). Table 2 shows the Pearson´s correlation coefficients (R) between AGB and TWV with the NDVI, EVI, SAVI and LAI remote sensing vegetation indices derived from LANDSAT 5 TM. Values are positively correlated to AGB and TWV. Weak correlations in non-stratified and stratified methods in anthropized areas were found. This result clearly shows how anthropogenic disturbances impact the correlation between biometric variables and spectral indices. This is an indication that the disturbance degree in a biome such as the Savanna, specifically in cerrado sensu stricto vegetation type, affects the correlation between remote sensing vegetation indices with aboveground biomass (Figure 3) and total wood volume (Figure 4), leading to problems in the modelling process. The highest correlation values were found using NDVI and the stratified non-anthropized class, reaching R = 0.81 and 0.82 for AGB and TWV, respectively. The lowest R values were obtained using the stratified anthropized class, mainly with EVI vegetation index, reaching 0.12 of correlation.

Relationship between biometric variables and vegetation indices
The low correlation values found using non-stratified and stratified anthropized classes do not mean that variables are not correlated, but may mean that the relationship is not linear. Figure 3 and Figure 4 show the scatterplots between AGB and TWV and the individual's remote sensing vegetation indices, helping the visualization of its relationship. Figure 5 shows scatterplots of the multiple linear regression (Figure 5a) and random forest algorithm (Figure 5b) for both AGB and TWV, considering the non-stratified method. Comparing the results, it was observed that using all plots together, both MLR and RF algorithm did not present good regression performances, with low R 2 values. For both biometric variables, RF presented better results compared to MLR, decreasing MAE and RMSE, and increasing R 2 values. These results indicate that using the non-stratified method, the relationship between biometric variables and vegetation indices are better explained using nonlinear algorithms, such as random forest regression. Figure 6 shows scatterplots of the multiple linear regression ( Figure 6a) and random forest algorithm Table 2. Correlation coefficients (R) values between biometric variables and spectral indices extracted from the LANDSAT 5 TM images.       Figure 6b) for both AGB and TWV considering anthropized stratified plots. Compared with MLR, both AGB and TWV presented better results using the RF regression model. R 2 values ranged from 0.20 and 0.18 to 0.4, respectively. MAE and RMSE values also decreased. Figure 7 shows scatterplots of the multiple linear regression (Figure 7a) and random forest algorithm (Figure 7b) for both AGB and TWV considering non-anthropized stratified plots. In terms of R 2 , MLR can explain more the AGB and TWV variance than the random forest algorithm, indicating that MLR is effective to improve the relationship between biometric variables and vegetation indices when the area of plots is non-anthropized. Table 3 summarizes the statistics computed using MRL and RF algorithm considering non-stratified, anthropized and non-anthropized plots. Specifically, both AGB and TWV, considering MLR, MAE and RMSE decreased and R 2 increased from anthropized and non-stratified plots to non-anthropized plots. In relation to RF, the lowest MAE and RMSE statistics were found using non-anthropized plots, and the highest were found using non-stratified plots. These results show how anthropogenic disturbances impact the relationship between biometric variables and remote sensing vegetation indices.

DISCUSSION
NDVI, EVI, SAVI and LAI remote sensing vegetation indices are adequate to model TWV and AGB using linear regression in non-anthropized cerrado sensu stricto vegetation. Using random forest (RF) regression algorithm to estimate AGB and TWV had the advantage of estimating any complex non-linear relationship. Thus, acceptable R 2 values were found (0.4), modelling these biometric variables with vegetation indices through plots under anthropogenic disturbances or when they are not stratified. This value is low; however, according to the high area heterogeneity e and anthropic impacts, it is in accordance with previous studies carried out in the Brazilian Savanna biome (Alvarenga et al., 2012;Reis et al., 2015;Scolforo et al., 2015Scolforo et al., , 2016. Prabhakara et al. (2015) evaluated the relationship between biomass and remote sensing indices across six winter cover crop fields in Maryland, United States. They reported that spectral indices are more sensitive to green living vegetation, whereas field sampled biomass included both living and dead plant material. Therefore, the index performance for all sampling dates was poor (R 2 about 0.26). Reis et al. (2018) analysed the correlation between TWV from eucalyptus plantation and vegetation indices derived from LANDSAT 5 TM. In their analysis, NDVI and SAVI presented the highest correlation, reaching r = 0.49 and r = 0.47, respectively, indicating thatthe spatial diversity of forest canopies makes the relationship between forest parameters and remote sensing data a major challenge.
R 2 , as well as MAE and RMSE, are considered acceptable due to the wide variation found for the target variable in this region (see Table 3). This wide variation reflects the heterogeneity of the natural conditions of cerrado sensu stricto remnants in the state of Minas Gerais, which directly impacts biometric variables (Scolforo et al., 2015). In addition, it is important to note that the total wood volume (TWV) and aboveground biomass (AGB) values do not only respond to the anthropogenic disturbances considered, but also to environmental variables, such as chemical and physical soil characteristics (Berner & Law, 2016), as well as the structural conditions of remnants, such as different vegetation successional stages with different age structures (Schwieder et al., 2016). There are forest fragments in different sites, different successional stages, and trees with different diameters and heights, leading to an increase in the vegetation variability (Scolforo et al., 2015).
In general, process-based models assume homogeneous stands and lack the ability to provide data on spatial variability in forest biomass (Lu et al., 2016). Forest disturbance can be natural (e.g., tornados) or anthropogenic (e.g., deforestation and land conversion, fire) (Frolking et al., 2009). In a given location, the reflectance of images and therefore vegetation indices, are expected to change due to disturbance or successional processes (Hermosilla et al., 2018).
For example, in forest plots not affected by disturbance, NDVI is around 0.65 in areas with high tree density. Medium vegetation cover sites have NDVI around 0.49 and low vegetation cover sites have NDVI around 0.17 (Garrigues et al., 2006). In our study, it was concluded that plots with NDVI values lower than 0.49 presented mean AGB values around 19 t/ha, and plots with NDVI values between 0.49 and 0.65 showed mean AGB values around 30 t/ha and finally, plots with NDVI values higher than 0.65 exhibited mean AGB values around 62 t/ha. This pattern indicates a clear positive liner relationship (R = 0.81) between NDVI vegetation index and AGB inside plots not affected by anthropogenic disturbances (see Figure 3a).
On the other hand, when plots are under anthropogenic disturbances, the linear relationship between reflectance values and biometric variables is poor. Vegetation indices have increased spatial variability due to the mosaic of exposed soil with low vegetation index values combined with trees with high vegetation index values (see Figure 2c). For example, harvest activities result in a mosaic of post-disturbance land cover types mainly composed of exposed/barren land, herbs and shrubs, immediately following the change event. Plots undergoing disturbances were analyzed and plots with 10.85 t/ha were found (low AGB); however, presenting mean NDVI values equal to 0.72. This pattern can be explained by the use of images prior to disturbance events, due to gaps in image acquisition originating from cloud cover and LANDSAT temporal resolution (16 days). This pattern can explain the poor performance of models from anthropic plots compared to non-anthropized plots, leading to a weak relationship between biometric variables and vegetation indices.

CONCLUSIONS
Aiming at investigating how anthropogenic disturbances affect the relationship between total wood volume (TWV) and aboveground biomass (AGB) with remote sensing vegetation indices, these biometric variables were modelled stratifying field plots according to their anthropization degree using multiple linear regression and random forest algorithm.
By performing the stratification method, the results revealed that TWV and AGB from non-anthropized plots are better explained using remote sensing vegetation indices performing multiple linear regression instead of nonlinear regression. On the contrary, these biometric variables presented nonlinear relationship with vegetation indices when plots are affected by anthropogenic disturbances or are not stratified, indicating that anthropic activity directly affects the relationship between biometric variables and NDVI, EVI, SAVI and LAI vegetation indices derived from LANDSAT 5 TM.
This finding helps improving the modelling approaches that use remote sensing data to estimate biometric variables in highly heterogeneous areas, affected by anthropogenic disturbances such as the Brazilian Savanna biome, improving estimates by stratifying field plots into their anthropization degree.