K-NEAREST NEIGHBORS METHOD FOR PREDICTION OF FUEL CONSUMPTION IN TRACTOR–CHISEL PLOW SYSTEMS

Most important farm operations require a significant amount of energy, and this consumes a major portion of the farm’s budget. Consequently, analyzing the fuel consumption of agricultural machinery for farm operations of different sizes makes it possible to predict fuel consumption to set an appropriate budget for energy. The main purpose of this study was to determine the ability of the k-nearest neighbors (KNN) algorithm to predict the fuel consumption of tractor–chisel plow systems correctly. A training-set design of 139 points of 173 data points obtained from the literature was utilized, and the remaining 34 data points were applied as a test set. The input parameters were tractor power, plowing width, depth and speed of plowing, soil percentages of sand, silt, and clay, initial soil moisture content, and initial soil bulk density. The predictive power of the KNN method was compared with that of multiple linear regression (MLR), and experimental data were used to determine the predictive power of both methods. The KNN method generated better results than the multiple linear regression method. The test dataset correlation coefficients were 0.817 for the KNN (k = 2) method and 0.422 for the multiple linear regression method. This study suggests that the KNN method with k = 2 (two nearest neighbors) is suitable for estimating the fuel consumption of tractor–chisel plow systems for input values within the studied range.


INTRODUCTION
Developing the ability to predict fuel consumption of tractor-machinery systems is extremely beneficial for farms for budgeting and management; however, fuel consumption is measured by the amount of fuel used during a specific time period (Grisso et al., 2010). Furthermore, efficient planning of mechanized farming operations is a complex task, because it involves multiple factors related to the soil composition, the implemented machine, and the decisionmaking personnel (Borges et al., 2017). Additionally, predicting tractor fuel consumption can lead to more decisions that are appropriate for tractor management (Karparvarfard & Rahmanian-Koushkaki, 2015). Thus, predictive models capable of forecasting the fuel consumption of tractor-machinery systems under different conditions can help farmers optimize their fuel expenditure. Considerable research has addressed the prediction of fuel consumption during the tillage of selected regions with solutions using traditional, statistical, and modern computational methods, or combinations among those literature results (Karparvarfard & Rahmanian-Koushkaki, 2015;Tayel et al., 2015;Almaliki et al., 2016;Borges et al., 2017;Ranjbarian et al., 2017). Successful prediction of the fuel consumption of tractor-chisel systems can aid in selecting tractors that minimize the cost of fuel for tillage.
The chisel plow is considered a primary tool for tillage, because it is mainly used in initial soil working operations (Kheiry et al., 2017). The performance parameters of chisel plows include measurements of draft, drawbar power, actual field capacity, field efficiency, and fuel consumption rate (Bashir et al., 2015). However, prediction of fuel consumption is one of the most challenging tasks; thus, scientists, manufacturers, and users have shown great interest in developing methods that would predict fuel consumption. Although many algorithms have been proposed, accurate prediction of fuel consumption during tillage continues very difficult. Because the price of diesel fuel is high, the ability to predict fuel consumption accurately is potentially advantageous for controlling the cost of crop production and would enable farmers to adjust their equipment for optimal fuel utilization.
Engenharia Agrícola, Jaboticabal, v.39, n.6, p.729-736, nov./dec. 2019 Tillage is among the most important farm operations, because it requires significant energy and consumes a major portion of the farm's energy budget (Rashidi et al., 2013;Mari et al., 2014). Farmers typically record large numbers of data related to the operation of agricultural machinery, and processing, analyzing, and retrieving significant information from this abundance of farm machinery data are necessary. Utilization of information and modern computational methods, such as machine-learning algorithms, provides knowledge and reveals trends related to the rate of fuel consumption, which in turn affects the choice of appropriate tractor-implement systems and operating conditions, thus reducing the cost of production.
Machine-learning methods are used to solve problems in which the relationship between input and output variables is not known or is difficult to derive. The "learning" term denotes the automatic acquisition of structural descriptions from examples of what is being described (McQueen et al., 1995). Unlike traditional statistical methods, machine-learning methods do not make assumptions about the correct structure of the data model that describes the data. This characteristic is useful for modeling complex nonlinear behavior (Gonzalez-Sanchez et al., 2014). Many methods, such as artificial neural networks (ANNs), radial basis function networks (RBFs), k-nearest neighbors (KNN), and self-organizing maps (SOMs) have been developed for prediction of time series given large datasets with many explanatory variables. The KNN method can be used on nonlinear data for which classical assumptions cannot be made. The KNN method is considered a simple method for analysis of multidimensional data (Alkhatib et al., 2013). Although this method is simple, it is advantageous compared with other methods, allowing the user to generalize based on relatively small training sets (Rokach, 2010).
Originally, KNN was used for classification; however, in the past few decades, this method has also been used for prediction. In the classification approach, a dataset is divided into training and testing datasets. The KNN method uses a similarity measure for comparing the testing data with training data. For prediction of output variables, it chooses k data points from the training dataset that are close to the testing dataset. It is also regarded as a lazy learning method, which does not build a model or a function, but yields the closest k records of the training dataset that are the most similar to the points that are to be categorized (Alkhatib et al., 2013). In the KNN approach, it is especially important to choose the number of KNNs properly, because this choice can strongly affect the predictive power of the method. Small values of k lead to overfitting (high variance), while large values of k result in very biased models. For example, the KNN method has been used for weather prediction. The system generated is relatively accurate for forecasts for months into the future (Jan et al., 2008). The performances of the nearest neighbors (IBk), regression by discretization, and isotonic regression classifiers were compared for the prediction of predefined precipitation classes over Voi, Kenya (Mwagha et al., 2014). The study revealed that the nearest-neighbors method is suitable for prediction of precipitation, given historic rainfall data as a training set. Also, the predictive accuracies with respect to crop yield of multiple linear regression, M5-Prime regression trees, perceptron multilayer neural networks, support vector regression, and KNN methods were compared (Gonzalez-Sanchez et al., 2014). Real data for an irrigation zone in Mexico were used for building the models. The models were tested on two consecutive-year samples, and the M5-Prime and KNN methods yielded the lowest average root mean square errors (RMSEs). To assist investors, management, decision makers, and users in making correct and informed investment-related decisions, the KNN algorithm and nonlinear regression were also applied to predict stock prices for a sample of six major companies listed on the Jordanian stock exchange. The results showed that the KNN algorithm was robust, with a small error ratio; consequently, the results were rational and also reasonable (Alkhatib et al., 2013).
A KNN classifier was utilized for prediction of daily energy consumption by analyzing historical data on hourly consumption of 520 apartments in Seoul, Republic of Korea. The data were divided into training and testing subsets, with different training and testing ratios, and different qualitative and quantitative measures were used to determine the performance and efficiency of the predictor. The highest accuracy of 95.96% was obtained for the 60%/40% training/testing-set ratio (Wahid & Kim, 2016).
In light of increasing diesel fuel prices, and because existing predictive models of fuel consumption are not satisfactory, collecting large amounts of data on tractormachinery systems is crucial for economic and farm management analysis. The main objective of this research was to evaluate the machine-learning KNN method for prediction of fuel consumption for different tractor sizes when carrying an implement (chisel plow) in different specifications. The input parameters were tractor power, plowing width, plowing depth, plowing speed, sand, silt, and clay percentages in the soil, initial soil moisture content, and initial soil bulk density. In addition, the KNN predictive ability was compared with that of the multiple linear regression method.

Construction of the fuel consumption rate model
The fuel consumption rate model was constructed using a machine-learning algorithm -KNN (IBk) -and it was implemented in the WEKA environment. WEKA is an application for performing data-mining tasks that was originally developed at the University of Waikato in New Zealand (Hall et al., 2009). It contains a large collection of state-of-the-art machine-learning and data-mining algorithms written in Java. WEKA has been widely used for many purposes and contains tools for regression, classification, clustering, association rules, visualization, and data preprocessing (Naik & Samant, 2016). The Explorer is the main graphical user interface of WEKA. WEKA has six different panels, accessed by the tabs at the top, which correspond to the various data-mining tasks that are supported by WEKA. In the Preprocess panel, data can be loaded from a file or extracted from a database using an SQL query. The data file can be in the CSV format, or in the system's native ARFF file format. Once a dataset has been read, various data preprocessing tools called "filters" can be applied. The input parameters are the tractor power, the plowing width, the depth and speed of plowing, the soil percentages of sand, silt, and clay, the initial soil moisture content, and the initial soil bulk density. Through the Engenharia Agrícola, Jaboticabal, v.39, n.6, p.729-736, nov./dec. 2019 Explorer's second panel called "Classify", classification and regression algorithms can be applied to the preprocessed data. This panel also enables users to evaluate the resulting models, both numerically, through statistical estimation, and graphically, through visualization of the data and examination of the model (if the model structure is amenable to visualization). Users can also load and save models. In this study, the KNN algorithm was evaluated for its default parameters defined in the WEKA application, except that the k value was varied from 1 to 5.

KNN method
The KNN algorithm is a machine-learning algorithm that is considered a lazy learning algorithm, with a low computational cost and very simple implementation (Alkhatib et al., 2013). It supports classification and regression problems. When making a prediction, it stores the entire training dataset and queries it to locate k data points in the training set that are most similar to the data point to be classified. Therefore, there is no model other than the raw training dataset, and the only computation performed is querying of the training dataset.
When the KNN method is used for regression, the response value is calculated as a weighted sum of the responses of all the k neighbors, where the weight is inversely proportional to the distance from the input record. This distance is generally Euclidean. The Euclidean distance function is defined by Wilson & Martinez (2000) as follows. Where, x and p are the query point and a case from the set of examples, respectively, while m is the number of input variables (attributes).
The algorithm is sensitive to the selection of KNNs. The KNN method has several attractive properties. Beyond the choice of KNNs and the distance metric, no optimization or training is required (Hand et al., 2001). The method takes full advantage of local information and can yield highly nonlinear and highly adaptive decision boundaries. The method's disadvantages are its high computational and memory costs, because all the available data points (i.e., samples) should be scanned to determine the most similar neighbors. The calculation of distances becomes more problematic for higher-dimensional datasets. Despite these issues, the method is popular because of its ease of implementation and the above-mentioned properties (Gonzalez-Sanchez et al., 2014).
When using the KNN algorithm, the dataset should be divided into two subsets: the training dataset, on which the algorithm bases its predictions; and the testing dataset, which is used to test the algorithm's performance on previously unseen data (Imandoust & Bolandraftar, 2013). The training dataset is divided into vectors; then, for each point in the testing dataset, the distance from the data point to its neighbors in the training dataset is calculated using the Euclidean distance measure in the WEKA tool (Vainionpää & Davidsson, 2014). In the present study, the training dataset contained 139 instances and the testing dataset contained 34 instances.
After selecting the value of k, predictions based on the KNN examples can be made; however, a prediction is the average over the outcomes for KNNs, as specified in [eq. (2)] (Imandoust & Bolandraftar, 2013).
yi is the i th example, and y is the prediction (outcome) for the query point.

Multiple linear regression
Multiple linear regression (MLR) was applied using WEKA. The linear regression captures the variation in the fuel consumption as a function of tractor power, plowing width, depth and speed, soil percentages of sand, silt, and clay, initial soil moisture content, and initial soil bulk density.  Wahab, 1994;Al-Taieb, 1998;Gomaa, 1998;Abd Alla et al., 1999;El Raie et al., 1999;Metwally et al., 2000;Younis et al., 2000;Badawy et al., 2001;El Sayed & El Kilani, 2002;Al-Jebory, 2011). The collected data were from field experiments in which different chisel plows were used (only one pass over soil) in different sites with different moisture levels, bulk densities, and textures, and with different changeable working conditions. The dataset contained 173 instances, each with nine attributes. The data were randomized and divided it into two datasets. The first dataset, consisting of 139 data points (inputs and output), was used as a training dataset, and the remaining 34 data points were utilized as a testing dataset. The output variable (Y) in the present study was the rate of fuel consumption. The input variables in this study were tractor power, plowing width, plowing depth and plowing speed, soil percentages of sand, silt, and clay, initial soil moisture content, and initial soil bulk density. Descriptive statistics for the collected literature data on the fuel consumption rates of tractor-chisel plow systems are listed in Table 1. Meanwhile, descriptive statistics for performance parameters that were used as inputs to predictive models are listed in Table 2. Engenharia Agrícola, Jaboticabal, v.39, n.6, p.729-736, nov./dec. 2019 TABLE 1. Descriptive statistics for the fuel consumption rate data of a tractor-chisel-plow system, collected from the literature.

Accuracy metrics
To evaluate the accuracy of the predictive model, different accuracy metrics were used: the correlation coefficient (R), mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), and root relative squared error (RRSE). These metrics collectively constituted the WEKA result panel. The four metrics were defined as follows.
For the metrics definitions, if Yt is the actual observation for period t and Ft is the prediction for the same period, the correlation coefficient determines the linear relationship between the two variables. It is defined in [eq.
(3)] (Makridakis et al., 1998): The correlation coefficient takes values from -1 to +1. A positive correlation coefficient implies that the two variables vary in the same direction with respect to their means. A negative correlation coefficient implies that the two variables vary in opposite directions with respect to their means. A value close to 0 implies that the two variables have little linear dependency. The error et between the two variables is defined as: If there are observations and predictions for n periods, then there are n error terms, and the following standard statistical measures can be defined as shown in eqs 5-8: Here, MAE is the mean absolute error, and it refers to the sum of individual absolute errors normalized by the number of samples. The quantity RMSE is the root mean squared error, and it is a modification of the mean absolute value, with the absolute value of an individual error term replaced with a square. Both MAE and RMSE measure the average difference between the predicated and actual values. However, RMSE is more commonly used to measure the model's goodness of fit. RMSE pays more attention to large errors, owing to its square term. The RAE measure calculates the variance of the model when units are not important when comparing models. The RRSE measure compares the model prediction against the mean. For this metric, a value below 100% indicates a better performance than the average.

Data analysis
The general statistical characterization of fuel consumption rate data, which illustrates the descriptive statistics of fuel consumption data for a dataset of 173 data points, is shown in Table 1. The small difference between the average and the median values of fuel consumption rates indicates that the distribution of data is close to normal. Skewness is a lack of symmetry in a probability distribution, and kurtosis is the measurement of separation of smoothing probability distribution from a normal distribution shape (Everitt & Skrondal, 2010). In Table 1, for fuel consumption data, the skewness coefficient is a negative value, which indicates that data are skewed left; also, the kurtosis coefficient is a negative value, which indicates that data have a platykurtic distribution. However, Borges et al. (2017) also obtained negative values for skewness and kurtosis for tractor fuel consumption data. In addition, the variation coefficient is slightly high (31.3%), because the aforementioned fuel rate data were collected from different sources. The maximal fuel rate was 20.8 Lh -1 , and the minimal was 2.4 Lh -1 ; such a wide scatter indicates that different parameters likely affect the rate of fuel consumption. Table 2 lists the descriptive statistics of the performance parameters used as inputs to the predictive models corresponding to the 173 data points in the dataset. In this table, slight differences between the averages and medians of parameters can be seen for the plowing speed, initial soil moisture, and initial soil bulk density content parameters. Additionally, the kurtosis and skewness coefficients are different values between negative and positive. However, the values for asymmetry and kurtosis between -2 and +2 are considered acceptable to prove a normal univariate distribution (George & Mallery, 2010). The Excel software was used to calculate skewness and kurtosis. Table 2 also shows that the tractor power, plowing depth, plowing speed, silt percentage, sand percentage, and initial soil moisture content variables have the highest variation coefficients, because these data are collected from different sources, under different experimental conditions.

Performances of the KNN and multiple linear regression algorithms
The simplest KNN method assumes k = 1. With this value, the predictive power of the model is rather unsatisfactory; because the model is characterized by high variance, it overfits the training-set data and performs poorly on the testing-set data. Increasing the value of k reduces the variance but may increase the bias. Thus, the algorithm is sensitive to the selection of k (Hand et al., 2001). In this study, the number of KNNs was varied from 1 to 5. For the testing set, the results showed that the correlation coefficient of prediction was high for k = 2, as shown in Figure 1. The KNN method generated better results than the multiple linear regression method. The test dataset correlation coefficients were 0.817 for the KNN method and 0.422 for the multiple linear regression method with k = 2 (two nearest neighbors). Meanwhile, the MAE, RMSE, RAE, and RRSE were small for k = 2, as shown in Figures 2, 3, 4, and 5, respectively. Table 3 tabulates the results obtained for the different algorithms on the training-set data. The purpose was to compare comprehensively the performances of multiple linear regression with those of various KNN algorithms with different values of k. Clearly, the KNN algorithm yields a significantly higher prediction accuracy of fuel consumption, compared with multiple linear regression. Comparing the correlation coefficients reveals the ability of the KNN algorithm to improve the accuracy of fuel consumption prediction, as indicated by noticeable reductions in the MAE, RMSE, RAE, and RRSE measures, as shown in Table 3. This implies that predictions generated by the KNN algorithm-based model exhibit a relatively small deviation from the actual fuel consumption data.       Figure 6 shows the relationship between the actual and predicted fuel consumption rates for data in the testing set, while Table 4 lists the straight-line equations of the "slope-intercept" form (Y = mX + b) for estimating the fuel consumption rate yielded by the KNN method for different k (X corresponds to the actual fuel consumption rate). It is clear from Figure 6 and Table 4 that the best fit is obtained for k = 2, manifested as a near-unity slope of the straight line.

CONCLUSIONS
The results of this study confirmed that the successful prediction of fuel consumption by tractor-chisel systems can aid farmers in selecting appropriate tractors and implements, thus helping to reduce the cost of tillage. An efficient k-nearest neighbors (KNN) algorithm with k = 2 was adopted to perform such tests on the training dataset. The KNN algorithm was stable and robust, and it exhibited a relatively small error ratio, thus providing rational and reasonable results. Regarding actual fuel consumption, the model predictions were close to actual values. This implies that this data-mining technique can help farmers to select proper equipment and operation parameters to reduce fuel consumption during tillage with chisel plows.