Determining the geographical origin of lettuce with data mining applied to micronutrients and soil properties

ABSTRACT Lettuce (Lactuca sativa) is the main leafy vegetable produced in Brazil. Since its production is widespread all over the country, lettuce traceability and quality assurance is hampered. In this study, we propose a new method to identify the geographical origin of Brazilian lettuce. The method uses a powerful data mining technique called support vector machines (SVM) applied to elemental composition and soil properties of samples analyzed. We investigated lettuce produced in Sao Paulo and Pernambuco, two states in the southeastern and northeastern regions in Brazil, respectively. We investigated efficiency of the SVM model by comparing its results with those achieved by traditional linear discriminant analysis (LDA). The SVM models outperformed the LDA models in the two scenarios investigated, achieving an average of 98 % prediction accuracy to discriminate lettuce from both states. A feature evaluation formula, called F–score, was used to measure the discriminative power of the variables analyzed. The soil exchangeable cation capacity, soil contents of low crystalized Al and Zn content in lettuce samples were the most relevant components for differentiation. Our results reinforce the potential of data mining and machine learning techniques to support traceability strategies and authentication of leafy vegetables.


Introduction
Lettuce is among the most consumed vegetables worldwide and is considered the most produced and consumed leafy vegetable in Brazil. Lettuce is low in calories, fat, and Na, while being a good source of fibers, Fe, folate, vitamins and several other bioactive compounds that are beneficial to human health (Kim et al., 2016). Since the consumption per capita of fruits and vegetables is about 40 kg per yr -1 in Brazil, much less than 143 kg per yr -1 consumed in a developed country, such as the United States (Mainville and Peterson, 2005), lettuce is an important source of vegetable-based nutrients for the Brazilian population.
According to the most recent research conducted by IBGE (Brazilian Institute of Geography and Statistics) in 2006, the states of São Paulo (SP) and Pernambuco (PE) are the main lettuce producers in the southeastern and northeastern regions of Brazil, respectively. Growers can sell to a diversity of buyers, including intermediaries (purchase at farm gate), small supermarkets, large supermarket chains, wholesale markets, processors, and directly to consumers. This fragmented production chain makes the traceability and quality assurance of lettuce a difficult task. Moreover, most farmers neglect to use methods and techniques that add value to the product, such as food safety, traceability of inputs, improvements of handling and planting, among others (Carvalho et al., 2014). ii. In order to ascertain the efficiency of the SVM model, we developed simple linear discriminant models for the same data and compared the results and performance measurements.
iii. We also investigated the discriminative power of each variable and identified a subset of variables that mostly impact differentiation through the use of a feature selection technique, called F-score, a novel approach to lettuce discrimination.
iv. We expect to show the potential of data mining and machine learning techniques to support traceability strategies and authentication of leafy vegetables.

Lettuce and soil samples analyzed
We collected 194 lettuce samples and soil samples from farms in São Paulo (n = 72) and Pernambuco (n = 122), Brazil. Coordinates of the sampling sites are shown in Table 1. Soil samples were dried in the shade and then sieved (2 mm mesh). Lettuce leaves were washed in running water to remove impurities, dried (45 -60 o C) and ground in a stainless-steel mill (< 1 mm). Soil texture was obtained by the densimeter method (Gee and Or, 2002) and the pH was obtained by potentiometry using a combined electrode immersed in the soil: water suspension (1:2.5) and soil: 1 mol L -1 KCl solution (1:2.5). Potential acidity (H + Al) was obtained by extraction with 1 mol L -1 calcium acetate (pH 7.0) and titration with NaOH using phenolphthalein as indicator. The organic carbon (OC) content of soils was obtained by dry combustion in an elemental analyzer. A 1 mol L -1 KCl solution extracted the levels of exchangeable Ca, Mg and Al. Levels of available K and P were extracted with a double acid solution (Mehlich-1), following the protocol of Anderson and Ingram (1992). Based on these extractions, we obtained the following values: ΔpH = pH KCl -pH H2O ; CEC T (Ca 2+ + Mg 2+ + K + + H+Al); CEC e (Ca 2+ + Mg 2+ + K + + Al 3+ ); SB (Ca 2+ + Mg 2+ + K + ); V % ([SB × 100]/CEC T ); and m % ([Al 3+ × 100]/CEC e ). Levels of well-crystallized Fe and Al (Fe 2 O 3 DCB and Al 2 O 3 DCB) were extracted with Na dithionite-citratebicarbonate (DCB) (Inda Junior and Kämpf, 2003;Mehra and Jackson, 1960), while amorphous Fe and Al were extracted with acidic ammonium oxalate (Fe 2 O 3OXA and Al 2 O 3OXA ) (McKeague and Day, 1966).
The pseudo-total concentrations of Cu, Ni, and Zn in the soil were extracted by acid extraction in a microwave oven using the EPA 3051A method (1:3 HCl:HNO 3 , v/v). Plant material digestion followed Araújo et al. (2002), using HNO 3 and H 2 O 2 in microwave assisted digestion. Contents of Cu, Ni and Zn were determined by inductively coupled plasma / optical emission spectroscopy (ICP OES) using the conventional sample introduction system. Data quality control was measured using standard reference material (SRM 2709a -San Joaquin Soil) from the National Institute of Standards and Technology (NIST, USA) and an analytical blank in triplicate. The concentrations of analytical blanks were below the quantification limit (0.01 mg L -1 for Cu and Ni; 0.05 mg L -1 for Zn). Precision (n = 3), expressed as relative standard deviation (RSD), was < 10 % for all elements. More details can be found at Santos-Araujo and Alleoni (2016).

Data mining for prediction of food origin
In the past decade, authenticity and traceability of foodstuffs became a desirable feature for consumers and producers worldwide (Baroni et al., 2015) and the search for methods that ensure authenticity of food has received great attention from researchers. A strategy that emerged in recent literature was the use of data mining and multivariate data analysis to discriminate the geographical origin of foodstuffs and vegetables based on their chemical components. Successful applications of this methodology and other similar were reported for rice (Maione et al., 2018), honey (Maione et al., 2019), Italian and Turkish lemon (Potortì et al., 2018),  -Piñeiro et al., 2003), chocolate (Cambrai et al., 2010), alcoholic beverages Ceballos-Magaña et al., 2012;Coetzee et al., 2005), coffee (Oliveira et al., 2015;Serra et al., 2005), tomato (Mahne Opatić et al., 2018), and others. Therefore, we proposed the use of data mining techniques, namely SVM and feature selection algorithms, to determine the geographical origin of lettuce samples based on their chemical composition. Data mining is an efficient process to find hidden patterns and information in large and complex data sets where simple multivariate data analysis techniques and statistical methods are often unable to model efficiently, such as the principal component analysis and the discriminant analysis. Data mining techniques combine concepts and methods from artificial intelligence, mathematical optimization, linear algebra, and statistical analysis in order to perform either predictive or exploratory analysis on labeled or unlabeled data (Kotsiantis et al., 2006). Although data mining processes emerged from the multivariate data analysis and statistical techniques to handle larger and more complex data sets (Izenman, 2008), these processes can be applied to smaller data sets to extract meaningful information, often preferred due to their sophisticated algorithms that are capable of performing probabilistic reasoning. Furthermore, these algorithms are constantly evolving.
Classification models can be described as data mining tools that can predict information, represented by a categorical variable, in data samples. These models observe similar and previously labeled samples and use the information learned from this observation to build a function or model that is capable of generalizing the learned information in new and unknown data samples, as long as they are described by the same set of variables as the observed samples. This learning process is known as supervised learning in the field of artificial intelligence. Support vector machines, created by Cortes and Vapnik (1995), are an example a popular classification model in the recent data mining literature and are successfully employed to discriminate and classify data from different fields for various purposes.
Our previous literature search revealed that this is the first attempt to discriminate the geographical origin of Brazilian lettuce samples based on the machine learning technique for data mining, such as support vector machines, also applied to chemical composition and soil parameters. In order to ascertain the efficiency of the SVM model, we developed simple linear discriminant models for the same data and compared the results and performance measurements. We also investigated the discriminative power of each variable and identified a subset of variables that mostly impact differentiation through the use of a feature selection technique called F-score, a novel approach to the discrimination of lettuce.

Support vector machines (SVM)
SVM is described as an optimization function to find the decision boundary with the largest margin possible to separate the data, minimizing the risk of overfitting and improving the generalization performance. The decision boundary is computed by the following Eq. 1: where: x is the values obtained from the variables of the training samples, w refers to weights whose linear combination computes the class label, and b is a model parameter, the decision boundary with the largest margin possible is achieved by the minimization Eq. 2: Classification models based on SVM are widely used in the literature to perform the predictive analyses on data from several problems and fields. Only in the last two years, SVM has been successfully employed to solve problems in domains, such as geology (García-Nieto et al., 2019;Huang et al., 2017;Jung et al., 2018;Kumar et al., 2017;Mahvash and Hezarkhani, 2018;Pu et al., 2019), hydrological sciences (Choubin et al., 2019bKisi et al., 2019;Rahmati et al., 2019;Sajedi-Hosseini et al., 2018), climate and weather (Fan et al., 2018;Kundu et al., 2017;Yu et al., 2018Yu et al., , 2017, fault detection in various systems and processes (Ali et al., 2018;Fazai et  , and many others. In addition, we chose to work with the SVM models in this project due to two main advantages. First, the models are capable of performing kernel trick and project the data into higher dimensions to better classify non-linearly separable data such as ours. Second, because the SVM models are known to perform relatively better on small data sets in comparison to other machine learning algorithms, such as neural networks, which are heavily dependent on the amount of data available for training. In this study, we employed SVM with the radial basis (RBF) kernel function. The use of kernel functions allows SVM to project the original data into a new dimensional space and find a linear decision boundary The geographical origin of lettuce Sci. Agric. v.79, n.1, e20200011, 2022 into two subsets for training and testing the model, respectively. When the available data set is relatively small, similar to that analyzed in this study, dividing the data set can be unfeasible, as the resulting subsets can be too small to effectively train a reliable classification model (Varma and Simon, 2006). Moreover, since only a single subset is used for training the model and this subset is commonly generated by random selection, meaningful information possibly contained in data samples assigned to the test set is not considered and is thus wasted. In order to tackle these issues, we used a training and validation method called the k-fold cross validation, a solution to the lack of sufficiently large training and testing sets (Duda et al., 2001). This method divides the data set into k mutually exclusive subsets (folds) of similar size. The classification model is trained and tested for k iterations. In each iteration, one subset, different from the subsets previously used, is selected to test the model while the others are used for training. Therefore, all the data samples available for analysis are eventually considered in the construction of the classification model. The final accuracy of the model is computed as the average of the accuracies obtained in each iteration.
After the test phase, the tested samples can be categorized as true positives, true negatives, false positives or false negatives. True positives (TP) and true negatives (TN) are the number of positive and negative samples correctly classified, respectively. False positives (FP) refer to the number of negative samples incorrectly classified as positives and false negative (FN) is the number of positive samples incorrectly classified as negative. Performance measurements of accuracy, sensitivity and specificity (Tan et al., 2005) are computed based on these values. Accuracy refers to the overall probability of the model to correctly classify an arbitrary sample (Choubin et al., 2019a). Sensitivity refers to the overall probability of the model to correctly classify an arbitrary sample, which originally belongs to the positive class. Specificity is the overall probability of the model to correctly classify an arbitrary sample, which originally belongs to the negative class. Therefore, the three performance measurements are computed with Eq. 4-6:

Estimating the relevance of the parameters analyzed
One of our objectives was to evaluate the discriminative power of the descriptive variables and try to build to separate the transformed samples when they cannot be linearly separated in the original dimensional space. In addition to the required parameter C, which can be described as the cost imposed by the SVM model on a misclassification, the RBF kernel also requires a γ parameter, namely the value used by the kernel to perform the kernel trick and handle non-linear classification. Both parameters must be chosen carefully, since increasing their value indiscriminately potentially results in overfitting, high variance, and low biases, while very restrictive values lead to an under-fitted model that cannot capture patterns in the data. We determine these values through a grid search on values C = {0.25, 0.5, 1, 1,5, 2, 3} and γ = {3, 2, 1, 0.5, 0.1, 0.01, 0,02, 0.03, 0.05, 0.06} for each SVM model developed. The model with the best performance was selected.

Linear discriminant analysis (LDA)
The linear discriminant analysis (LDA) is a classification technique to maximize the ratio of between-class variance to within-class variance to achieve maximal separability. The LDA creates a decision boundary called discriminant function (DF), which is a linear combination of the variables that describe the data and that best separates the classes. Considering a problem for classes y 1 and y 2 , the linear DF is defined as Eq. 3 (Duda et al., 2001): where: x is an arbitrary sample, V is the variable set values for sample x, w is the weight vector, and w 0 is a bias value. We aimed to find w and w 0 values for g(x) > 0, otherwise, the class label associated to x is y 1 , and y 2 . The LDA has been widely used recently in several classification problems and, despite traditional, it is still a well-known and popular method to discriminate food data, largely reported in literature reviews (Abbas et al., 2018;Berrueta et al., 2007;Callao and Ruisánchez, 2018;Cavanna et al., 2018;Esteki et al., 2019Esteki et al., , 2018aEsteki et al., , 2018bGranato et al., 2018;Jiménez-Carvelo et al., 2019;Kemsley et al., 2019;Medina et al., 2019aMedina et al., , 2019bOliveri, 2017;Peris and Escuder-Gilabert, 2016;Ropodi et al., 2016;Valdés et al., 2018;Wadood et al., 2020). Therefore, we expect this model to perform well in our data set and that its use certify the efficiency of the SVM model by comparing the results obtained by both methods.

Performance measures
The data available for analysis must be divided into a training set to build the classification model and as a test set to verify the model prediction performance, also called the holdout method.
The default holdout method has two main disadvantages. First, it requires the data set to be divided The geographical origin of lettuce Sci. Agric. v.79, n.1, e20200011, 2022 denominator indicates the discrimination within each set. The larger the F-score, the more discriminative this variable.

Balancing the data set with the K-means clustering algorithm
Imbalanced data sets are inconvenient and present various challenges for data mining and the multivariate data analysis (Chawla, 2005;Haixiang et al., 2017;He and Garcia, 2009;Jo and Japkowicz, 2004;López et al., 2013;Prati et al., 2004). Overall, classification models trained on imbalanced data tend to express a good prediction performance for samples of the majority class and a lower performance for samples of the minority class. This decrease occurs not necessarily due to the difference in the class proportion, but due to other natural factors of imbalanced data, such as the presence of small disjuncts, low density of data, data overlapping and others (He and Garcia, 2009;López et al., 2013). In this study, we tackled the imbalanced data issue with the aid of a clustering algorithm called K-means.
Clustering algorithms are considered a branch of unsupervised learning and are basically employed in exploratory data analyses, where no hypothesis about the data nor previously known class labels existed. These techniques are useful to aid the identification of natural groupings existing within the data based on a similarity (or dissimilarity) pattern. Partitional clustering algorithms, such as K-means, divide the data set into mutually exclusive clusters in a way that samples assigned to a same cluster must be as similar as possible and as different as possible from samples associated to other clusters.
The K-means algorithm could be summarized in the following steps (Jain, 2010): (i) randomly selects k data samples and names their centroids. Each centroid is associated to a different cluster label; (ii) for each non-centroid sample in the data set, it find its nearest centroid and associates this sample to the same cluster as the centroid found; (iii) for each cluster formed, it updates the centroid to be the center of cluster mass; and (iv) repeats the previous steps until no new changes are made to the clusters, or a stopping criterion is reached.
The K-means algorithm is not new and is still highly reported in the literature due to its simplicity, low computational cost, and good performance (Jain, 2010). In this study, we used the K-means algorithm to aid data balancing due to under sampling. Considering that we want to discard m samples of a certain class from the data set, we divide the data labeled as this class into (n -m) clusters with the K-means algorithm and keep only the determined centroids as data samples. Because the centroids found for each cluster could be considered the most representative samples in a data partition, this strategy reduces the information loss that naturally occurs under sampling. classification models capable of discriminating lettuce from two distinct locations with high performance using only variables considered relevant for the decisionmaking of the classifier. Disregarding variables with low or null influence on the information mapped by the class label could also provide advantages, such as improvement of prediction accuracy, dimensionality reduction, reduction of time to build and run classification models, among others.
Filter methods are variable selection methods applied to the training data prior to the learning phase of the models, allowing irrelevant variables to be identified and discarded before training occurs. Filter methods evaluate variables by computing the intercorrelation between each other and the correlation between the variables and the class label. The best rated variables are present little dependence from other variables while presenting the highest dependence as possible from the class label. Since filter methods are algorithmically simple and operate with low computational cost, a common strategy is to use them to evaluate all the variables individually, to set up subsets with combinations of the best ranked variables, to apply them to a classification model and to check the final prediction accuracy obtained to attest their discriminative power. There are several examples of popular filter methods for the multivariate data analysis, such as information gain, chi-square, random forest importance (Izenman, 2008), mutual information, Correlation-based Feature Selection (CFS), and others (Bommert et al., 2020).
In this study, we used a variable selection algorithm called F-score. This function presented by Chen and Lin (2006) measures the discrimination of two sets of real numbers. For a single descriptive variable from our data set, we can divide its measurements into two distinct sets called positive and negative sets, which hold the variable measurements for lettuce samples from SP and PE, respectively. The value produced from this function, when applied to a variable to measure the discrimination between its positive and negative sets, can be used as a score for measuring the variable contribution to the class label. Given the training samples x k , k = {1, ..., m}, if the number of samples belonging to the SP and PE classes are n SP and n PE , respectively, the F-score value of the i-th variable, which reflects the discrimination between positive and negative samples, is calculated by the Eq. 7: where: X i , X i

Analysis strategy
The entire statistical and predictive analysis was conducted on the RStudio software, version 1.1.463. Our analysis methodology is presented in Figure 2 and summarized in the following steps: 1. Data samples obtained from Pernambuco (PE) state are under sampled in order to match the number of available samples from São Paulo (SP) state. The goal of this step is to create a balanced data set that could be reliably used to develop classification models without biases. The K-means clustering algorithm is applied only to the samples obtained from PE and 36 clusters were determined. The centroids computed for each cluster were retained in the data set, while the other samples from PE were discarded from analysis. Therefore, the final data set comprised all 36 originally samples collected in SP and 36 samples from PE retained as centroids from the clustering algorithm.
2. In order to perform five-fold cross validation, the balanced data set is randomly divided into five mutually exclusive subsets (folds), properly keeping the original proportion of the two states (50 % -50 %) in each set.

For each validation:
a. The selected validation is used as a test set while the remaining folds are merged and used as a training set; b. The training and test sets are standardized to avoid potential biases caused by the different unit measurements and ranges of values of the variables. The variables are centered by subtracting their means from theirs values, and then the centered variables are divided by their standard deviations; c. F-score values are computed for each variable of the training set. The selection threshold is set as the maximum F-score value obtained by the variables divided by 3; d. The SVM and LDA models are developed using the entire training data and the training data with only the variables that received F-score values higher than the threshold, resulting in a total of four models developed; e. Accuracy, sensitivity, and specificity values are computed for the four models.
4. The average accuracy, sensitivity, and specificity obtained by the models are determined and presented as final performance measurements.

Properties and micronutrients in lettuce and soil samples
Metal uptake by plants is influenced by several soil properties (Kumpiene et al., 2017). Therefore, we evaluated 25 soil variables (Table 2). We designated letters to represent the variables analyzed in lettuce and soil samples to simplify visualization in our graphics and  v.79, n.1, e20200011, 2022 specific mentions throughout the rest of the paper. Soil samples collected from lettuce farms in SP had fairly higher amount of nutrients and other beneficial traits than soil samples obtained from PE. Soil samples from SP farms presented mean values for pseudo-total concentrations of Ni, Cu and Zn of 9.45, 41.34 and 72.81 mg kg -1 , respectively, while samples from PE presented 2.95, 14.43 and 38.43 mg kg -1 , respectively. Ni, Cu, and Zi are essential metals for plants. Biondi et al. (2011) demonstrated that soils from PE have low capacity to release Cu and Ni to plants. Their study also indicated a significant association of most metals to clayey soils. Considering that the soil samples from PE are mostly composed of sand (approximately 70 %), the low clay content of these soils may help explain the relatively low metal concentration in the pseudototal fraction of soils from PE.
Amorphous and crystalline Al and Fe oxide minerals play a major role in stabilizing soil structure, and their presence in soils has a favorable effect on soil physical properties (Goldberg, 1989). Furthermore, kaolinite, Fe and Al oxides compose the dominant mineralogy in the clay fraction of most Brazilian soils (Fink et al., 2014), responsible for chemical reactions that control the availability of essential and nonessential elements.
Phytoavailable metal forms are sorbed to amorphous metal oxides (Rodrigues et al., 2010). Levels of well-crystallized Fe and Al were higher in soil samples from SP than for those from PE, presenting mean values of 22.13 g kg -1 and 10.77 g kg -1 respectively, against 7.94 g kg -1 and 5.04 g kg -1 respectively for soils from PE. Soil samples from SP also showed fairly higher numbers of amorphous Al than soil samples from PE, with mean values of 14.66 mg kg -1 and 1.21 mg kg -1 for soils of both states, respectively. As for amorphous Fe, mean values were 4.82 mg kg -1 and 1.28 mg kg -1 for soils from SP and PE, respectively. Mean levels of available P in soil samples from SP and PE states were 530.82 mg kg -1 and 311.67 mg kg -1 , respectively. Although samples from PE presented higher levels of exchangeable Ca and Mg, the soil samples from SP state showed higher CEC (for both CEC T and CEC e ), which depends on the levels of Ca, Mg, K, Al and potential acidity in the soil samples. Mean values of CEC T were 104 mmol c dm -3 and 29.17 mmol c dm -3 for soils from SP and PE states, respectively, while mean values of CEC e were 80.67 mmol c dm -3 and 17.68 mmol c dm -3 for soils from SP and PE states, respectively. Finally, values for the sum of bases were also considerably higher in soils from SP (mean 79.53 mmol c dm -3 ) than in soils from PE (mean 17.07 mmol c dm -3 ). Maybe soils in SP farms are better fertilized than in PE.

Statistical and predictive analysis
Using the standardized measurements of all variables shown in Table 2 as input values, the SVM models reached an average value of 98.67 % for accuracy, 97.14 % for sensitivity, and 100 % for specificity. The LDA models trained with the same input values performed fairly lower than SVM for all performance measurements, presenting 66 % average accuracy, 71.43 % average sensitivity, and 60.71 % average specificity. Although more complex to build than linear discriminant models and designed to handle large and complex data bases, SVM models are excellent tools to determine the geographical origin of lettuce, even when trained on a relatively small amount of data.
In order to determine the individual importance of each component for the discrimination of the lettuce samples from both regions, we applied the F-score equation to each training set during the cross-validation process. The F-score values achieved by each variable in each iteration are presented in descending order in Figure 3. The variables referring to the sum of bases obtained by (Ca 2+ + Mg 2+ + K + ) and soil cation exchangeable capability (CEC T and CEC e ) were retained in all training sets with very high F-score values. Levels of exchangeable Ca, well-crystallized Al, amorphous Al, sand and pseudo-total concentration of Ni measured from the soil plus the Zn levels in the plant were retained by four of five considered training sets. Overall, we conclude that these eight factors were the most relevant variables for the discrimination of lettuce samples from both states according to the F-score metric.
To ascertain the real relevance of the best rated components, we also built SVM and LDA models using only values for variables that achieved F-score values higher than the determined threshold (Table 3), while the others were discarded. The average performance measurements, namely accuracy, sensitivity and specificity, achieved by the models with and without variable selection are summarized in Table 4. The SVM model trained with only the best rated variables achieved 95.81 % average accuracy, 94.29 % average sensitivity and 97.14 % average specificity, a very small decrease in performance in comparison to the values achieved when all 28 variables were used for training. On the other hand, the LDA model experienced a slight decrease in performance, presenting 86.29 % average accuracy, 94.29 % average sensitivity, and 78.57 % average specificity when only the five best rated variables were used for training. Although the SVM and LDA models achieved the same average sensitivity values when combined with variable selection, the SVM models still clearly outperformed the LDA models in both scenarios for all other performance measures. The best classification model achieved was the SVM model trained on all variables available from the data set. The accuracy value achieved is approximately    3 %, 32 % and 12 % higher than the accuracy values obtained from the SVM model with feature selection, the LDA model without feature selection and the LDA model with feature selection, respectively. This slight increase in error for the SVM model is expected since several variables were discarded, the model was possibly deprived from meaningful information contained in them; however, the SVM model with variable selection still presented a high average accuracy value for predicting the geographical origin of lettuce when only approximately eight out of 28 variables were used. This is a substantial decrease in the dimensionality and size of the data set, and consequently reduction of the required effort from researchers to gather and prepare the necessary data. Mean values of the sum of bases, CEC T , CEC E , exchangeable Ca, well-crystallized Al, sand, amorphous Al, nickel in soil and Zn in plant for the lettuce samples produced in SP and PE are shown in Figure 4. Samples collected in SP presented relatively higher values for almost all of these components than the samples obtained from PE. França et al. (2017) studied lettuce production in PE, and although the Zn content in their soil samples was higher than those observed by Santos-Araujo and Alleoni (2016), the Zn content in the plant was much lower. Because different varieties of lettuce present different Zn uptakes even when cultivated under the same soils conditions (França et al., 2017), lettuce varieties grown in SP may differ from varieties cultivated in PE, resulting in different levels of soil-plant transference.
The strong effect of soil variables on the plant classification could be explained by relevance of soil properties in plant uptake. A previous study carried out by Santos-Araujo and Alleoni (2016) showed that the most important covariates for predicting the Zn content in vegetables sampled in SP were CEC e , pH, organic carbon, and the pseudo-total content of Zn and Cu. As production is a result of cultivated area multiplied by yield, it is possible to use soil productivity to infer the incorporation of agricultural technology (Camargo Filho and Camargo, 2017). Therefore, the inclusion of soil parameters in the model for plant classification complements the assortment and may give insights into the geographical origin of lettuce.

Conclusion
The Sustainable Capitol Hill (SCH) and Michigan State University list several reasons for customers to buy and consume locally produced food (Klavinski, 2013;SCH, 2019). Because local food involves a shorter time and less transportation effort from harvest to costumer table, it is likely to be safer to consume, fresher, less contaminated, more flavorful, and higher in nutritional value. It is also easier for customers to monitor the food origin and investigate practices and substances used to grow and harvest the crops. Purchasing local food also benefits the local economy as the money is retained within the community and reinvested in local businesses and services, also supporting local farmers, considerable importance in economic and food supply crises.
Verifying the geographic origin of food is a substantial matter to ascertain that this important kind of food was produced by a trusted source that takes quality and safety into account. In this study, we proposed a novel methodology to determine the geographical origin of Brazilian lettuce based on their elemental composition and soil properties through the use of SVM, LDA, and feature selection. We analyzed the contents of several chemical variables and soil properties determined for 72 lettuce samples obtained from São Paulo and Pernambuco Sates in Brazil. Through the use of a filter method for feature selection, we estimated that soil cation exchangeable capacity, exchangeable Ca, wellcrystallized Al, sand, amorphous Al and Ni in soil, Zn levels in the plant and the sum of bases obtained by (Ca 2+ + Mg 2+ + K + ) were generally the most important variables for differentiating lettuce samples produced in both regions. We developed classification models based on SVM, which were capable of discriminating lettuce samples from both regions with a high accuracy level, presenting approximately 98.67 % correct predictions when all 28 chemical variables were used for training, and 95.81 % correct predictions when only the most important variables were used for training. These values surpass those obtained by the LDA model, a well-known, reliable and widely employed model for classification of food samples, which scored 66 % and 86.29 % prediction accuracy when all variables and only the best rated variables were used for training, respectively. The values achieved proved that, when combined with the chemical composition of lettuce samples determined by ICP OES and certain soil properties, classification models based on SVM could successfully determine the geographical origin of lettuce samples with excellent accuracy, at the same time attesting that data mining techniques could powerfully support traceability strategies and ensure vegetable authenticity. Our previous literature search reveals that this is the first attempt to discriminate the geographical origin of Brazilian lettuce samples based on a powerful machine learning technique for data mining, such as SVM, also applied to chemical composition and soil parameters.