Estimating Stand Height and Tree Density in Pinus taeda plantations using in-situ data , airborne LiDAR and k-Nearest Neighbor Imputation

Accurate forest inventory is of great economic importance to optimize the entire supply chain management in pulp and paper companies. The aim of this study was to estimate stand dominate and mean heights (HD and HM) and tree density (TD) of Pinus taeda plantations located in South Brazil using in-situ measurements, airborne Light Detection and Ranging (LiDAR) data and the nonk-nearest neighbor (kNN) imputation. Forest inventory attributes and LiDAR derived metrics were calculated at 53 regular sample plots and we used imputation models to retrieve the forest attributes at plot and landscape-levels. The best LiDAR-derived metrics to predict HD, HM and TD were H99TH, HSD, SKE and HMIN. The Imputation model using the selected metrics was more effective for retrieving height than tree density. The model coefficients of determination (adj.R) and a root mean squared difference (RMSD) for HD, HM and TD were 0.90, 0.94, 0.38m and 6.99, 5.70, 12.92%, respectively. Our results show that LiDAR and k-NN imputation can be used to predict stand heights with high accuracy in Pinus taeda. However, furthers studies need to be realized to improve the accuracy prediction of TD and to evaluate and compare the cost of acquisition and processing of LiDAR data against the conventional inventory procedures.


INTRODUCTION
Light Detection and Ranging (LIDAR) is an optical remote sensing technology which measures properties of scattered light to find range and/ or other information of a distant target.LiDAR measurements is usually acquired at airborne level and is usually also referred to Airborne laser scanning (ALS).It has been widely used in mapping the Earth's surface and especially in forest applications due its capacity to retrieve threedimensional information of the vegetation (Yao et al. 2014).

296
CARLOS ALBERTO SILVA et al.
The resulted cloud points acquired by a LiDAR system allows the reconstruction of the vertical structure of forests at different strata that cannot be obtained from any other remote sensing system.Indeed, several LiDAR-derived metrics can be derived out from these point clouds (McGaughey 2015).These metrics can be used indirectly to predict several other parameters of the forests using either regression or classification approaches to spatially represent these selected attribute over large areas (Dubayah and Drake 2000).Applications and advantages of LiDAR technology can be seeing over both natural and planted forest worldwide (Nilsson 1996, Naesset 1997, Naesset and Bjerknes 2001, Popescu et al. 2003, Popescu 2007).
Pine plantations are the most important long fiber source for pulp and paper production in South Brazil.It covers nowadays an area of 1.59 million hectares, accounting for approximately 20.54% of the country's total reforested area (Ibá 2015).Most of the pine plantations are concentrated in South Brazil, with 34.1 and 42.4% of the total reforested area located in Paraná and Santa Catarina States (Ibá 2015).Pinus taeda L., also known as loblolly pine, is the most planted forest specie in these regions.It has high economic importance due to its high volumetric increment in the colder regions of the southern plateau (Kohler et al. 2014).It has fast growing rates presenting increments up to 50 m 3 •ha -1 •year -1 (Ibá 2015).
Usually, forest companies conduct their own permanent forest inventory in planted forest at annual basis in order to monitor the forest growth, to identify problematic conditions during initial growth stages, and to predict the optimal harvesting time.Important attributes describing the state of the current stand that affect thinning decisions includes both stand height and tree density.On the other hand, stand height is a useful information wefor both dominant heights and site index determination.Whereas on the other hand, tree density is a key information for thinning management.Initially, the seedlings are planted using a higher density configuration and afterwards thinning is conducted in order to proportionate primary and secondary growths, respectively.
Therefore, accurate forest inventory is of great economic importance to optimize the entire supply chain management in pulp and paper companies (Falkowski et al. 2008, Silva et al. 2016a).Although the use of LiDAR technology is relatively new in Brazil, different investigations have been made to evaluate the ability of airborne LiDAR for forest inventories studies.They were mostly restricted to natural forest environments and/or Eucalyptus spp.plantations (e.g., Packalén et al. 2011, Carvalho et al. 2015, Silva et al. 2015, 2017a).As far as we know, Zandoná et al. (2008) were the first and only authors to use airborne LiDAR data for forest inventory purposes in a Pinus taeda plantation located in South Brazil.In their study, they derived individual tree height and crown area measurements using an individual tree approach to model the diameter at the breast height (DBH).However, early studies using both individual and plot based approaches also have demonstrated the potential of airborne LiDAR data to quantify forest attributes in conifer and deciduous forests in Northern Hemisphere (e.g., Nilsson 1996, Naesset 1997, Naesset and Bjerknes 2001, Popescu et al. 2003, Roberts et al. 2005, Popescu 2007).Therefore, a better understanding of how airborne LiDAR data can be used to extract forest attributes from planted forest of Pinus taeda in South Brazil is still necessary.
Currently, the most common approach to retrieve forest attributes from LiDAR data has been the area-based approach in which forest properties for an area of interest are inferred based on the relationship between field measurements and the LiDAR-derived metrics (Nilsson 1996, Naesset 1997, Hudak et al. 2006, Silva et al. 2017b).These relationships are usually linear multiple regression models, which afterwards, can be applied to the m or 2.5 x 2.5 m grid configuration, resulting in an average tree density of 1,667 and 2,000 trees per ha, respectively.
The climate of the region is characterized as warm and temperate (Cfa) (Köppen and Geiger 1928), with annual average precipitation approximately of 1378 mm and the annual average temperature of 18.4 ºC.The topography in the study ranges from mildly to very hilly with elevation ranging of 618 m to 905 m.The Pinus taeda stands are on the plateau where the topography is relatively flat.The plantations are managed by Klabin S.A., a pulp and paper company.

FIELD DATA COLLECTION
A total of 53 rectangular plots of approximately 500 -620 m 2 each were established across the four plantations for stand measurement ranging in ages from three to nine years old.All plots were georeferenced with a geodetic GPS with differential correction capability (Trimble Pro-XR), and in each sample plot, individual trees were measured for DBH (diameter at breast height) and a subsample (15%) of trees for tree height (Ht).For trees in the plot that were not directly measured for Ht, the inventory team of Klabin S.A. predict heights from hypsometric equations.The hypsometric models were adjusted according to Curtis (1967), employing as independent variables DBH, and as dependent variables the Ht, following the model below (Eq.1).
where: ln(Ht) is the logarithm of tree height (m); and are the intercept and the slope of the model; DBH is the diameter at breast height (1.30 m) and is the random error of the model.The coefficients of the hypsometric models were not available, however, according to the company, the adj.R 2 and Syx% of the models ranged from 0.96 to 0.98 entire LiDAR dataset for predicting forest attributes at stand level.One emerging method for forest attribute modeling from LiDAR data is the k-nearest neighbor (k-NN) imputation.It is a nonparametric machine learning method which has widely been used to predict forest inventory attributes in un-inventoried areas at either plot-or stand-levels (Falkowski et al. 2010, Hudak et al. 2014, Racine et al. 2014, Silva et al. 2016b).The non-parametric k-NN imputation method uses a set of predictor feature variables (X) to match each target pixel to a number (k) of most similar (nearest neighbors or NN) reference pixels for which values of response variables (Y) are known (McRoberts 2012).It allows estimating a variable of interest through a weighted average of the known variables of the k-nearest neighbouring plots.The weighted average could be done using either an inverse distance weighting or the square of the inverse distance (McRoberts 2009).Examples of forest applications using k-NN imputation, including mathematical formulation may be found in Hudak et al. (2008), Racine et al. (2014), and Fekety et al. (2014).
Therefore, the aim of this study was to evaluate the use of airborne LiDAR data and k-NN imputation for predicting and mapping standlevel forest variables, which included mean height (henceforth HM), dominant height (HDOM) and stem density (TD).This work is based on the hypothesis that LiDAR data and k-NN analysis can produce precise and accurate inferences of HM, HDOM and TD in Pinus taeda plantations in South Brazil.

STUDY AREA DESCRIPTION
The study area consisted of four Pinus taeda stands located within the Telêmaco Borba municipality in the State of Paraná, Southern Region of Brazil (Fig. 1).The trees were planted using a 3.0 x 2.0 and 5.1 to 6.5, respectively, according to the farm location.
The mean height (HM; m), dominant height (HDOM; m) and tree density (TD; N/ha) were summarized for each sample plot, and the summary statistics of HM, HDOM, and TD per stand ages are presented in Table I.

LIDAR DATA ACQUISITION AND DATA PROCESSING
LiDAR data were obtained by a o Harrier 68i sensor mounted on a CESSNA 206 aircraft.The characteristics and precision of the LiDAR data are listed in Table II.LiDAR data processing consisted of several steps that ingested the LiDAR point cloud data and provided two major outputs: the digital terrain model (DTM), and the LiDARderived canopy structure metrics.All of the  Initially the catalog function in FUSION/LDV was used to generated a descriptive report of the point cloud, which was used to identify the LiDAR coverage extent and outliers.When detected outliers, we removed them using the ClipData tool.A filtering algorithm based on Kraus and Pfeifer (1998) and available in the Groundfilter function was applied to differentiate between ground and vegetation points.DTMs were generated using the classified ground points with a spatial resolution of one meter using GridSurfaceCreate.Afterwards, the ClipData function was applied to normalize heights and to assure that the z coordinate for each point corresponded to the height above ground and not the orthometric elevation of the single point.The PolyClipdata function was then used to subset of the LiDAR points within each of the 53 measured sample plots, and the Cloudmetrics tool was applied to compute the LiDAR-derived metrics using all the available returns of the point cloud.Finaly, the GridMetrics function were used to generate the same LiDAR metrics computed in CloudMetrics, but now within grid cells of 5 m spatial resolution, across the landscape.The list of the LiDAR metrics used in this are presented in the Table III.

LIDAR METRICS SELECTION AND IMPUTATION MODELING DEVELOPMENT
The LiDAR metrics selection was performed through two major steps.First, Pearson's correlation (r) was used to identify highly correlated predictor variables (r > 0.9) as presented in Hudak et al.   (2012), Silva et al. (2014Silva et al. ( , 2017c)).If a given group (2 or more) of LiDAR metrics were highly correlated, we retained only one metric by excluding the others that were most highly correlated with the remaining metrics.Second, the varSelection function from the yaImpute (Crookston and Finley 2008) package in the R statistical software (R Development Core Team 2015) was used to select the most important not highly correlated metrics for the imputation modeling.The varSelection function computes the generalized root mean square distance (grmsd) as variables are added to an k-NN imputation model.By adding variables, the varSelection function keeps variables that strengthen imputation and exclude variables that weaken the imputation the least (Crookston and Finley 2008).

HMIN
In this study, k-NN imputation was conducted using the yai and impute.yai,both also from in the yaImpute package in the R.Many imputation methods can be used for associating target and reference observations, however, recent studies have shown that the Random Forest (Breiman 2001) approach generally produces better results compared to other imputation methods (Hudak et al. 2008, Nelson et al. 2011, Waske et al. 2012).Therefore, for this study we used Random Forest based k-nearest neighbours (RF-kNN) to characterize the relationships between predictor (LiDAR-derived metrics) and response (HM, HDOM and STP) variables used for NN (k=1) imputation.

MODEL ASSESSMENT
The precision of the model predictions was evaluated in terms of adjusted coefficient of determination (adj.R²), Root Mean Square Difference (RMSD) and BIAS (both absolute and relative): where n is the number of plots, is the observed value for plot i, and is the predicted value for plot i.The RMSD is analogous to the RMSE used to assess regression model accuracy (Stage and Crookston 2007).The relative RMSD (%) and Bias (%) were calculated by dividing the absolute values (Eqs. 2 and 3) by the mean of the response variable and multiplied by 100.Ninetyfive percent confidence intervals (CI) for the line of best fit were also plotted to determine whether the relationship between observed and imputed values was significantly different.The acceptable model precision was set up a relative RMSD and Bias of ≤ 15% to have a model precision equal or higher to the conventional forest inventory in the P. taeda plantations in Brazil.We also performed a cross validation on the RF k-NN imputation model with bootstrapping simulations using 1000 iterations.
For each iteration we built an imputation model from randomly selected of 53 observations with replacement.The remaining observations (out of bag) were used to validate the model and to compute the RMSD and Bias.After all iterations, the average and stand deviation of the RMSD and Bias were computed.Finally, we used the AsciiGridPredict function also from the yaImpute package in R (Crookston and Finley 2008) to apply the model across to the landscape to map the spatial distribution of HM, HDOM and SD of pine at the stand level for the benefit of forest managers.An overview of the methodology is outlined in Fig. 2.

LIDAR-DERIVED CANOPY HEIGHT STRUCTURE OF THE Pinus taeda PLANTATIONS
Young forest stands of Pinus taeda are normally defined by a lower canopy coverage, high tree density and low height values.By reaching maturity, intermediate and advanced stands, present higher canopy coverage values, higher height values and lower tree density when compared with young stands due to forest thinning and mortality.LiDAR derived metrics such as H99TH was high correlated (r > 0.9) with field heights (HM, HDOM).An example of variations in height ranging from early (i.e., 4.0 years) to intermediate (i.e., 6.0 and 8.0 years) ages of development are shown in Fig. 3.Although located in different farms and therefore under distinct site index conditions, the LiDAR derived height increased from early (Fig. 3a) to intermediate ages (Fig. 3b, c).

LIDAR METRICS SELECTION
Pearson's correlation test (r) applied to the 30 candidate LiDAR metrics determined that 24 were highly correlated each other (r > 0.9).We kept one of the highly correlates metrics (i.e., H99TH), and the six remaining metrics not highly correlated to each other were also used as input in the VarSelection function.Some LiDAR-derived metrics were positively correlated, such as H99TH and HSD (r = 0.81), whereas both HSKE and COV (r = -0.67)were negatively correlated as shown in Table IV.
The most important metrics selected from the VarSelection to be included in the imputation   model as predictor variables for estimating HM, HDOM and TD according to the gmsd statistics were H99TH, HSD, SKE and HMIN (Fig. 4).When added the HKUR and COV metrics to the model, they did not improve the fit, and therefore, they were automatically removed by the algorithm.

IMPUTATION MODEL PRECISION AND BIAS
The LiDAR-derived metrics were better predictors of HM and HDOM as compared to TD.The imputed model produced a relative RMSD of 6.99, 5.70 and 12.92%; relative Bias of -0.36, -1.02 and -1.00, and adj.R 2 of 0.90, 0.94 and 0.61 for the HM, HDOM, and TD attributes respectively (Table V).With the exception of TD, the RMSD and Bias were less than 10% in the cross validation procedure.The adj.R 2 of the cross validation for the imputation model of TD was markedly lower at 0.40 (±0.17).The Figs. 5,6,and 7 show the relationship between the observed and imputed values of the    forest attributes studied herein.The regression intercepts differed significantly from zero (p ≥ 0.7), and the 1:1 line fell within the 95% regression confidence envelope for all regressions, with the exception of the TD regression.
According to the cross validation performance, the LiDAR metrics were also better predictors for HM and HDOM rather than TD (Table VI).The HM and HDOM imputation models produced estimates that were strongly correlated (r > 0.93) with the validation dataset, whereas the TD imputation model were weakly correlated (r = 0.62).This resulted consequently in high values of RMSD and Bias, respectively.On the other hand, for both CARLOS ALBERTO SILVA et al.
HM and HDOM, the adj.R 2 values were high and RMSD and Bias values were low (Table VI).

IMPUTED FOREST ATTRIBUTES
Imputed HM, HDOM and TD of Pinus taeda for the 53 sample plots ranged from 6.1 to 13.4 m; 8.9 to 15.5 m and 1266 to 1585 N/ha, respectively.In general, imputed values are overestimated compared to the reference forest attribute values.Forest stands containing younger trees (3 to 5 years) showed the highest TD values, while the plantations with intermediated stands ages (6-9 years) contained the lowest TD values (Table VII).
The k-NN imputed model created was applied across the extent of the study area to map the forest attributes at the landscape level.The Fig. 8 is showing the spatial distribution of the imputed HM, HDOM and TD values at three stands with ages ranging from 4 to 8 years old.The stand "A" with the age of four years and area of 14.4 ha (Fig. 8a1-a3) has imputed average values of 7.18 (±0.35) m, 8.02 (±0.32) m, and 1574 (±27) N/ha for HM, HDOM and TD, respectively; and the stand "B" with the age of six years and area of 8.5 ha (Fig. 8b1-b3) imputed average values of 10.80 (±0.56) m; 11.78 (±0.75) m, and 1260 (±93) N/ha for HM, HDOM and TD, respectively; while the stand "C" with the age of eight years and area of 15.4 ha (Fig. 8c1-c3), presented imputed average values of 11.44 (±0.66) m; 12.66 (±0.84) m, and 1253 (±88) N/ha for HM, HDOM and TD; respectively.The predicted total number of tree in the stands a), b) and c) were approximately 22,651; 10,710; 19,296 trees, respectively.Differences in HM, HDOM and TD are mainly related to the ages, site index and management practices of each stand.Moreover, differences also depend on other factors such as the soil type and intensity of land use before the establishment of the stands.

DISCUSSION
LiDAR is increasingly being used as a means for obtaining forest structural measurements over large areas, and LiDAR-derived metrics can be used to  predict forest attributes using imputation modeling (Fig. 8).A proper data mining is required in order to select the variables that have a physical mean or relationship with the desired forest attributes.In this investigation, LiDAR-derived metrics such as H99TH, HSD, HSKE and HMIN were found to be the most relevant variables to predict HM, HDOM and TD.These variables were the most efficiently to describe the canopy structure of the forest stands by capturing the majority of variation contained in the reference dataset.Moreover, these LiDAR metrics have been frequently reported as predictor variable for forest attributes modeling from LiDAR data (Hudak et al. 2012, Silva et al. 2014).
LiDAR is an important tool for forest inventory when combined with appropriately designed sample plots in the field and with appropriate modeling tools (Hudak et al. 2008, Silva et al. 2017d).Since this is most probably the first time that k-NN has been applied to predict forest attributes from airborne LiDAR data in Brazil, the obtained results strong corroborates with those presented by Roberts et al. (2005), Hudak et al. (2008, 2014), Falkowski et al. (2009Falkowski et al. ( , 2010)), Fekety et al. (2014), and Gagliasso et al. (2014) in loblolly pine environments.
The main advantage of imputation modeling is the simultaneous coherent prediction of multiple response variables simultaneity (Crookston and Finley 2008).This is a significant asset in forest inventory where typically multiple forest attributes, in particular the HM, HDOM and TD, are of interest.
In this study, the prediction of TD based on LiDAR-derived predictors has proven to be more difficult than predicting HM and HDOM.The lower performance to retrieve this parameter over planted forest is due to the lower variability among forest stands even under different ages.The difficulties in predicting TD has been also highlighted in other studies (Maltamo et al. 2004, Woods et   Although showing great potential for forest inventory procedures, the use of airborne LiDAR in Brazil is still considered experimental due to its expensive cost (Hummel et al. 2011).This is because the number of companies is low, when compared with other countries where this technology is already being used intensively.LiDAR has the advantage of capturing variability across forest stands and mapping forest attributes at the landscape level with high accuracy (Silva et al. 2017a).
In in-situ conventional forest inventory procedures, the variability of the forest attributes at stand level demands time and is costly, and is not always perceptive and therefore less studied.LiDAR demonstrated to be an established technology that is increasingly and being used to characterize spatial variation (Watt et al. 2014).In addition, LiDAR permits an accurate production of digital elevation models, such as DTM and digital surface model (DSM) that are used for hydrological, geomorphological, and other applications (Hudak et al. 2009).In particular, LiDAR-derived DTM has attracted great interest to support forestry operations industrial plantation, like in Brazil, since it provides thorough and detailed information about terrain topography, which in turn is used to road planning and to choose the best skidding system in complex forest areas (Sterenczak and Moskalik 2014).

CONCLUSIONS
This paper describes a framework for predicting forest inventory data in unsampled areas via an RF-k-NN imputation modeling procedure.This procedure incorporates airborne LiDAR-derived predictor metrics.Airborne LiDAR has proven to be an accurate method to spatially estimate stand heights at regular basis.The tree density was not totally explained by the LiDAR-derived metrics into an imputation model in the area-based approach presented herein, and therefore, future research will be carried out to test new approaches using individual tree framework.In many countries, LiDAR has already moved from the research arena to operational reality in the area of forest resource inventory.We hope that the promising results for forest inventory modeling presented herein will stimulate to operational this technology in Brazil.

Figure 1 -
Figure 1 -Location of study area.The circles indicate the location of the Pinus taeda plantations.
percentileCOVCanopy Cover (Percentage of first return above 1.30m)CARLOS ALBERTO SILVA et al.

Figure 2 -
Figure 2 -Flowchart of the LiDAR data processing and forest attributes modeling.

Figure 4 -
Figure 4 -LiDAR-derived metrics selection according to grmsd (generalized root mean square distance).The gray boxplots represent the four best LiDAR-derived metrics for the imputation modeling.The solid black and the doted gray lines represent the mean and standard deviation of the grmsd.

Figure 8 -
Figure 8 -Imputed forest attributes at landscape level.Maps are with 25 meters of spatial resolution.Examples of Pinus taeda stands with 4.0 years (a); 6.0 years (b); and 8.0 years (c).The numbers 1, 2 and 3 represent in order HM, HDOM and TD, respectively.

TABLE VII Summary of the statistics for the observed and imputed forest atributes at different age intervals.
2008, Hirata et al. 2009).The main reason for this discrepancy are most probably because the LiDAR data distribution accounts more for vertical profile what confirms the need of other methodologies for predicting TD rather than plot-area based approach in this type of environment.The methodology already demonstrated by Zandoná et al. (2008) or more recently by Strîmbu and Strîmbu (2015) based directly on point cloud segmentation have shown that the individual tree based approach provides accurate estimates of tree density in pine plantations.Therefore, future research is required to evaluate new approaches using individual tree detection.