A PERCEPTRON-BASED FEATURE SELECTION APPROACH FOR DECISION TREE CLASSIFICATION

: The use of OBIA for high spatial resolution image classification can be divided in two main steps, the first being segmentation and the second regarding the labeling of the objects in accordance with a particular set of features and a classifier. Decision trees are often used to represent human knowledge in the latter. The issue falls in how to select a smaller amount of features from a feature space with spatial, spectral and textural variables to describe the classes of interest, which engenders the matter of choosing the best or more convenient feature selection (FS) method. In this work, an approach for FS within a decision tree was introduced using a single perceptron and the Backpropagation algorithm. Three alternatives were compared: single, double and multiple inputs, using a sequential backward search (SBS). Test regions were used to evaluate the efficiency of the proposed methods. Results showed that it is possible to use a single perceptron in each node, with an overall accuracy (OA) between 77.6% and 77.9%. Only SBS reached an OA larger than 88%. Thus, the quality of the proposed solution depends on the number of input features.


Introduction
The increasing development of multi and hyperspectral sensors, as well as the object-based image analysis techniques for classifying high spatial resolution satellite imagery, led to a large amount of data, as illustrated by the substantial set of features available to describe classes of interest in the classification step. As claimed by Haertel and Landgrebe (1999), a high dimensional feature space might cause problems in the estimation of the classes' covariance matrices. When dealing with a parametric classifier, as the feature space dimensionality increases, so does the number of samples required to provide a reliable estimate of the covariance matrix, which is known as the Hughes phenomenon. In his work, Hughes (1968) concluded that the rise in the number of features reflects in the rise of the classifier accuracy, until a maximum accuracy value is reached. Therefore, adding new features to the classification algorithm might reduce the accuracy instead of improving it. Other than that, employing all the features available means a higher computational cost, such as happens with the use of spectral, spatial and textural object image attributes, and methods to reduce the feature space dimensionality have been studied by several authors (Guo et al., 2006;Gasca et al. 2006;Zhang and Chau, 2009;Bartenhagen et al 2010;Geng et al., 2014;Xie et al., 2018Habermann et al., 2018. Also, decreasing the number of descriptors in the feature space can also reduce the computational cost, for it requires less storage capacity. Some studies have proposed the use of decision trees as an alternative to the statistical approaches commonly used to conduct image classification and it became popular mainly by its use in the object-based image analysis algorithms (Van Coillie et al., 2007;Mahmoudi et al., 2013;Hamediantar and Shafri 2016;Wang et al., 2016). When it comes to object-based image classification, the system first divides the image into segments according to partitioning parameters and then classifies the segments using spatial, spectral or textural features. The segments classification can be performed with a decision tree proposed by the user. This tree is supposed to represent the human knowledge on recognizing different targets as well as the user's experience, betaking rules and class descriptors derived from the feature space. Once defined the tree, that is, the classes to be represented, the next step lies in proposing the rules to be applied in each tree node. The question addressed in this work is on choosing the features to be analyzed in each node of a given decision tree.
Another critical point to be highlighted is how these classification techniques are part of Machine Learning algorithms. Arthur Samuel introduced the Machine Learning (ML) concept in 1959 based on a simple phrase "How can computers learn to solve problems without being explicitly programmed?". Thus, a comprehension of classification within ML algorithms context, like random forests and decision trees, might be stated along these lines: use of mathematical models, based on sample data in which the algorithm is trained with the training set of samples; then, with the test set, the model accuracy check.
Regarding the focus of this work, a search strategy was drawn to choose the best feature or a combination of two or more features that could best describe and distinguish two classes in each tree node. The proposed method was based on the principles of a single perceptron; one advantage of this technique is that it does not require a specific statistical distribution, unlike the Transformed Divergence, for instance. Another advantage is that it remains proposing a linear combination of two or more features when necessary. Complementing the used method advantages, one of the most important goals in this work is to improve high spatial resolution satellite imagery classification using a simple and yet effective Machine Learning algorithm.

Literature Review
Over the last decade, several scientific studies have been using object-based image analysis along with feature selection methods to improve high spatial resolution image (HSRI) classification. Among the examples found in the literature are Van Coillie et al. (2007), who used genetic algorithms as feature selection methods; Mahmoudi et al. (2013), who applied multi-agent recognition systems; Hamediantar and Shafri (2016), that relied on the C 4.5 algorithm; Wang et al. (2016), who used the support vector machine and random forest classifiers. Meanwhile, even though using spatial information (objects/segments) is highly indexed to classify HSRI, some authors, e.g., Jung and Ehlers (2016), Persello and Bruzzone (2016) and Pu and Bell (2017), still used the pixel-based approach along with feature selection.
Using the object as a processing element opens up a wide array of possibilities in image classification, in which it is possible to combine both spectral and spatial features to describe the classes of interest. Meanwhile, choosing the features that characterize the classes as reliably as possible can be a challenging task due to the large number of combinations that can be made from all of the spatial and spectral descriptors available, thus providing a large dataset.
First of all, techniques aiming to reduce the dimensionality of datasets can be divided in two main parts, the first one regarding feature extraction and the second one, feature selection. In feature extraction, the original variables are combined to create a smaller set of new features that preserve the information of the initial feature space. A typical feature extraction example is to apply the Principal Components transformation in the original feature space and then select the most significant new features based on their percentage of representation in the entire dataset (Weinberger and Saul, 2006). Other well-known approaches to extract information, such as Locally Linear Embedding and Multidimensional Scaling, were also proposed by Zhang and Chau (2009). Over the last years, researches on deep learning techniques are also bringing alternative solutions, as seen in the use of the restricted Boltzmann machine (Sohn et al. 2013). A drawback of such approaches relates to their intentions to represent the original information from the dataset using a lower number of variables without considering what the selection aims for. In this respect, a same set may not be suitable for different classification schemas.
On the other hand, feature selection methods are used to select, from an original set of features, the subset with the most informative and distinctive variables to solve a particular problem according to a search strategy and an evaluation criterion (Tang et al., 2014). Firstly, the search strategy is based on looking for the most suitable subset among all the features available. Then, the evaluation criterion measures the success of each selected feature subset evaluating, for instance, the classification results. More examples of how the search strategy and evaluation criterion work can be found in Aguilar et al. (2012), Guo et al. (2016) and also Xiurui et al. (2014).
Within this context, feature selection techniques require the establishment of a proposed aim as well as both corresponding evaluation criterion and search strategy (Xie et al. 2018). The search strategy can be simple, such as the evaluation of all possible feature combinations, or may consist of optimized search algorithms based, for example, on genetic algorithms (Ribeiro, 2006). A comparison of feature selection methods in remote sensing is available in Serpico et al. (2003). The most popular feature selection methods can be arranged in two groups: sequential backward search (SBS) and sequential forward search (SFS). The SBS method is an iterative process in which the original set of variables is analyzed. In each iteration the less significant variable is discarded until a desired number of variables is reached or when the derived set reaches a satisfactory stopping criterion for the given problem. The SFS method, on its part, starts by selecting the most significant variable. Subsequently, the variables that are more significant between the ones remaining are added iteratively.
Throughout recent decades, studies introduced advances and new methods in feature selection. Gasca et al. (2006) and Ruck et al. (1990) proposed the use of a multilayer perceptron to solve the problem of high dimensionality in the feature space. Such algorithms select a subset of relevant variables by estimating the relative contribution of the input variables from the corresponding classes' problem to the output neurons (Gasca et al. 2006). Another recent example using the perceptron in feature selection was described in Habermann et al. (2018).
The perceptron is an essential learning algorithm in neural networks and machine learning. Although its formulation is relatively simple, it has proved to be very efficient. A Perceptron is also described as a "binary classifier" since it proposes a function, which can decide whether or not an input represented by a feature vector belongs to a specific class (1 in the positive case, 0 in the negative case). Therefore, its input is a set of values (feature vector) and its output is either 1 or 0. This algorithm is considered supervised as it develops its classification rule based on inputs with known output given by the user. Formally, the classification rule combines the inputs (feature values) to produce output according to equation 1.
When the input vector contains two values, equation 1 can be written as displayed in equation 2. Note that the decision border is a line in a bidimensional feature space.
When more input variables are used, the function becomes the equation of a plane or a hyperplane in a polynomial behavior with n-weights and a-bias for n-input variables. The weights and b-bias computation depend on the given training dataset and it is achieved with the Backpropagation algorithm. This algorithm performs the weights correction according to the difference between the computed output and the desired one

Method
Within the framework of object-based image analysis and classification, the first step is image segmentation. The algorithm used in this study was the well-known multi-scale resolution segmentation (FNEA -fractal net evolution approach), that can be stated as a special kind of region growing algorithm. This method depends on homogeneity definitions in combination with local and global optimization techniques and comprises the choice of three main parameters: scale, shape and compactness. Definitions and more information about these parameters and the segmentation approach can be found in Baatz and Schäpe (2000). Besides spectral features, these regions also allow computing spatial and textural features. These features describe the segments and can be used to label them according to previously defined classes of interest, which leads to the next processing step in OBIA: classification.
When a decision tree is proposed, the next problem is to find the best feature (or set of features) that can be used in each node to support the decisions. A decision tree uses an organized tree-like model of decisions and relations between these decisions to obtain the classification result of the branches. The principle is to break up a complex decision (the desired classes of interest) into several more straightforward choices (nodes), hoping that the final product is the desired and most correct classification (Safavian and Landgrebe, 1991). Although the principle is simple, the algorithm requires applying rules in each node, which should be able to perform the binary classification based on a reduced set of variables and thresholds.
In this work, the idea was to use a single perceptron to find the best fitting of variables (or a variable combination) for each classification network node using only one variable, a pair of variables and the SBS. It is relevant to highlight that the decision tree has nodes with binary solutions only. As described in equation 3, the decision is taken based on a polynomial whose size depends on the number of input variables. One solution that the perceptron offers is the computation of the priori unknown parameters of w, given a set of input features and a set of training samples known as the expected classification output. The Backpropagation algorithm is used to estimate the optimal set of parameters (wi) for the linear function used on the classification node, together with the selected features (xi).
The evaluation criterion is the degree of partial accuracy achieved with the probable solution. As there are just two possible classes in each node, the accuracy is measured by the percentage of correct classified elements.
Within the methodology here developed, in a first attempt the perceptron was used to select the best feature for each node. For this purpose, all the features (Tables 1 and 2) were evaluated. The previous step was a relatively simple task for just two parameters had to be computed: the weight w1 and the bias b in equation 4. The feature that best separated the two proposed classes was the solution and the value of the bias was the desired threshold for separating the classes.
Finally, the perceptron was used to select the less significant feature within a SBS. The search started with all possible variables and computed the set of weights that enabled an optimal solution. The feature associated with the lower weight was then discarded and the procedure was repeated. This iterative process was undertaken until the binary classification accuracy of each node was above a given value. The results were finally evaluated using test samples. For the evaluation, the final classification was considered, not the results in each node. A confusion matrix was also computed for comparison purposes.

Experiment
The study was developed using a mosaic of images level 3A from the RapidEye satellite sensor. This product is radiometric, sensor and geometrically corrected and aligned to a cartographic map projection. The mosaic images were acquired in 2014 from the Vossoroca basin having around 136,58 km2, where the Vossoroca dam is placed, more specifically in the city of Tijucas do Sul, located in the state of Paraná, Brazil (figure 1). The images are georeferenced to the WGS84/UTM22 system coordination and have five meters of spatial resolution and five spectral bands; blue (440 -510 nm), green (520 -590 nm), red (630 -685 nm), red edge (690 -730 nm) and nearinfrared (760 -850 nm). The Vossoroca dam is used for power generation and the basin is predominantly rural. The many land cover classes present in the scene are "forest cover", "agriculture", "bare soil" and small "urban areas". Figure 1 displays the basin in RGB composition as well as its geographic location. This work is a branch of Mudak, a research project of hydrology and non-point pollution sources in the Vossoroca basin. The project is a cooperation between universities and enterprises from Brazil and Germany. Therefore, the land cover classes of interest inserted into the hierarchical semantic network were defined based on the knowledge of both hydrology and remote sensing researchers involved in the project, according to the desired inputs for hydrological models. Thus, the main classes of interest were divided into: "agriculture" to represent the agricultural activity portions of the land cover, "woods" to represent the portions with grass and some single trees around, "forest" regarding the natural forest and reforestation areas, "urban area" to represent the roads and the buildings, "bare land" to represent the agricultural portions without vegetation cover and "water" to represent the reservoir which is the main water body, as well as the other ones.
To easy the classification processing steps, the hierarchical semantic network was used. The first node was composed by "water" and "non-water" classes to distinguish water bodies from the other classes. At that point, from the "non-water" node, the classes "vegetation" and "non-vegetation" were used to split segments with and without vegetation coverage. From the "vegetation" node "low-vegetation" was then named as the final "agriculture" class. The "high vegetation" part was then used to separate "forest" and "woods". The "non-vegetation" (named "bare" in Figure 3) was implemented to distinguish segments without any coverage comprising the classes "bare land" and impervious areas which were designated as "urban". In that way, as the main classes are water, agriculture, woods, forest, bare land and urban, those are the ones which will be used to analyze the results confusion matrix wise.
In respect of image segmentation, only one level was used. This decision was made to test the feature selection algorithm without further significant influence of the user, who could optimize the segmentation using his/her knowledge. The parameters for the segmentation were 100 for the scale parameter and 0.5 for both shape and compactness parameters. All the spectral bands were considered in the segmentation step, attributing the weights of 1 for the red, red edge, green and blue bands, and 2 for the near-infrared (NIR) band. The NIR band was used with more emphasis, one might say, due to the scene land cover, which was mainly composed by vegetated areas and water bodies, and the NIR plays an essential role at describing them and their spectral curves.
The spectral and spatial features were computed and stored in a database for each training segment. The samples were separated into training and test sets. It is important to highlight that the frequency of the classes is not regular in order to make more samples available for some classes. Table 1 displays the list of spectral variables that include the mean value in each spectral band, some spectral indexes and the maximal standard deviation of the region in each band. Table 2 displays the spatial variables computed for each segment.  Figure 2a, 2b, 2c, 2d e 2e displays the spectral response of the classes of interest, regarding the normalized indexes as well as the spectral band used as features and using the training sample data.  Table 2 contains examples of the additional spatial features used during the feature selection step along with the spectral features. They were computed using the eCognition software. More detail on how they are calculated, their mathematical formulas and descriptions can be found in Gonzalez and Woods (2002).  The resulting decision tree for the classification is displayed in figure 3 and follows the hydrological relation between the groups of classes.
Source: The author Training areas were selected for each class using the image objects (segments) as samples, respecting the number of segments bigger than 30 as usually carried out in statistics (Shewhart, 1939). The training samples collected by the analyst were based on the objects that could best describe the classes of interest. The quantities of each class are presented below (table 3): The samples were collected in accordance with the logic of selecting well-distributed objects around the image scene based on image interpretation and analyst knowledge of the area. From the total area, less than 8% of the pixels were used to the set of training samples. The test samples under the same rule of well-distributed objects, but they were collected apart, respecting the classes' proportions in the scene after classification as well as a minimum number of 30 samples. The confusion matrices using the test samples are presented in tables 5, 6 e 7.

Results and Discussion
In the first experiment, the perceptron concept was applied in each node to select the best features to distinguish the classes; the set of features for each node is displayed in the second column of Table 3. The third column of Table  3 shows the selected variables for each node and the weights combining two features each time. In some cases, in this part of the experiment, the perceptron converged very quickly, demonstrating that most of the time only one feature would be enough to separate two classes. This event happened when one of the feature weights was too small compared to the other one. Other than that, the solution was a balanced combination of two variables. Figure 4a displays an example in which only one variable is necessary. In this case, the spatial feature "border length" was used, from which one can readily understand that the decision border is a horizontal line. On the other hand, in some nodes the decision border is only achieved when using a linear combination of two variables, as shown in figure 3b. In those, it is clear that a horizontal line was not enough to distinguish the two classes, hence the need of two variables to separate low from high vegetation.
Source: The author Figure 4: Examples of the computed solutions. (a) The classes can be separated using only one feature (class one represented by the black color and class 2, by the red color); (b) A linear combination of two features is necessary to separate classes 1 (blue) and 2 (red).
Finally, the variables selected with the SBS appear in the last column of Table 3. The solutions accuracies were found between 77 and 88% (table 8). Although it was expected that the water index NDWI would be the best option to separate water from other objects, the methods proposed other features. The SBS approach selected various optimal solutions and the red channel was chosen as the best option. As the expected feature NDWI was not accepted, a plot ( Figure 5) using this feature from water and non-water samples was performed. As a result, an overlap of areas with water and the other classes was evident; this justifies the SBS algorithm choice.
Source: The author Figure 5: Chart using the NDWI feature for the classes: water and non-water.
Normalized difference vegetation indices are considered to be a reasonable solution for separating areas covered by vegetation. Still, it did not happen for most of the nodes with vegetation classes. In their case, when only one variable was used as input, the solution proposed by the pairwise method, as well as those recommended by the SBS, seemed more reasonable since they select a vegetation index. For example, the separation of vegetation/ non-vegetation using the SBS proposes to combine the NDVIr with the NDVIre, which are very similar and reflect the contrast between near-infrared and red. The inclusion of the water index has a relatively lower weight.
A significant difference concerning the chosen features was found in node three, which was used to separate low from high vegetation. For this node, the SBS solution was based on the normalized difference indexes, whereas the pairwise selection includes the green channel and the NDWI. As the NDWI formula undertaken was based on the difference between the Near Infrared and the Green bands, the healthy and dense vegetation would present higher values, which is reasonable, since this index can also be called GNDVI (Green NDVI) (Gitelson at al., 1996). The single input perceptron selected the green channel, but with lower performance.
When considering woods and forests, the SBS proposes the use of the normalized difference indexes and the NDWI, composed by the NIR band, mainly because of the leaf water content. The other methods chose the NDVIre calculated with the Red Edge band instead of the Red one, commonly used to separate vegetation species based on the leaf structure (Sims and Gamon, 2002;Centeno, 2009). This is a quite appropriate choice in terms of spectral properties of the leaf.
In the last node, urban areas were separated from regions where the soil had no cover (bare land). The SBS selected the maximum difference between the bands, the green and the blue channels, while the pairwise selection decided by the difference between bands and the standard deviation. The choice of bands in the visible spectrum is suitable since urban areas are better identified in these regions of the spectrum. The single input approach did not achieve a satisfactory solution in the last node (bare land and urban classes) and the brightness was selected with very low classification accuracy as a result.
Once the variables for each node were selected, the classification applying the decision tree with the three computed decision rules for each node was performed. Methods' accuracies were evaluated by using the decision tree for other samples of the same area, the test samples. The number of samples was not equal for all the classes because the classes were not homogeneously distributed in the image. Tables 5, 6 and 7 show the confusion matrix of each experiment.    The single input method and the pairwise selection performed relatively equally, with an overall accuracy around 78% and a Kappa index around 0.73, which is not considered a suitable result, according to Anderson et al. (1976). On the other hand, the overall accuracy of the thematic image applying the rules selected by the SBS method was of 88.20%, a value that characterizes a good result according to the same authors. There was significant confusion between forest and woods, a fact that affects the producer's accuracy of these classes. The Kappa value computed from the confusion matrix was 0.806, which can also be considered adequate, as stated by Jensen (2005) The second method, the pairwise selection, had lower performance. It was possible to notice a considerable confusion between urban areas and bare land. As a consequence, there was a decrease in both user and producer's accuracy of "urban areas" and in the user's accuracy of "bare land". The pairwise selection overall accuracy reached only 77.50% (moderate agreement) and the Kappa index was 0.72, values below those obtained in the SBS experiment. Although the values were lower, it must be taken into account that, in the second experiment, the selection of a maximum number of two features while using the SBS was imposed; the limit was performance-wise. Given the considerations made in the earlier paragraphs, the pairwise selection results were not worthless. Similar results were obtained using a single feature as input. The overall accuracy value was of 77.86% (moderate agreement) and the kappa index was 0.73. The results were similar to those acquired using two variables as input. Although the producer's accuracy values were lower, the user's accuracy values were better. The classes "urban" and "bare land" showed higher confusion. This result was already expected since the accuracy of selecting the features was already low.

Conclusion
In this work, three feature selection approaches based on the perceptron concept were compared. The first advantage regarding the use of perceptron for the methods here developed is that this technique does not require any prior statistical assumption on data distribution, such as the premise of normality within the classes, for instance. Not making any statistical assumptions is a crucial point mainly because, when using the OBIA approach for image classification purposes, pixels are grouped in the segmentation step, reducing the number of available samples.
The pairwise selection demands high computational effort for it consists of an exhaustive search within a bidimensional space. As the size of the feature space increases substantially, including the spatial features, the number of possible combinations also has a significant increase. The advantage of these methods is that they can find solutions to the classes' separation problem by combining more than one feature in a linear discriminant function. Moreover, the use of the perceptron with a single input proved to achieve results comparable to those using a pair of variables as input, but with lower effort.
The SBS algorithm produced the best results and enabled the identification of the most significant variables within a set of spatial, spectral and textural features. Whereas the single perceptron allows computing new variables through a linear combination of the original features that enable the classification in a decision tree.
The use of the perceptron to find the best variables for a decision tree proved to be efficient and can be an alternative for feature selection. Not only does it avoid a human visual comparison of features but also presents the advantage of proposing linear variable combinations as a solution. More sophisticated systems, such as a multilayer perceptron, could as well be used to estimate more complex discriminant functions.