1. Introduction

Individuals choose their travel mode considering diverse factors, classified into three groups: (1) Characteristics of the trip maker; (2) Characteristics of the journey; (3) Characteristics of the transport facility. Thus, travel behavior involves household and personal characteristics, travel variables and spatially correlated factors (^{Ortúzar and Willumsen, 2011}).

Several works corroborate the assertion that travel behavior, especially for the case of mode choice, is strongly related to locations, characterized by urban density - compact vs. spread- out cities -, distribution of economic activities and presence of Traffic Analysis Zones (TAZ) with mixed activities (^{Cervero and Radisch, 1996}; ^{Kitamura et al., 1997}).

With the advances of technology, geo-referenced information has become more widely available. Travel demand spatial analysis has been identified as an emerging research area (^{Páez et al., 2013}), hence it is now possible to take in account the variables spatial influence and, consequently, incorporate space more effectively into travel.

Recently, researchers have found that travel behavior exhibits signs of spatial interdependence. ^{Bhat and Zhao (2002}) identified the spatial issues that need to be recognized in demand modeling, proposing a multi-level, mixed logit, formulation to address these spatial issues in the context of activity stop generation in the Boston Metropolitan area. ^{Páez et al. (2013}) introduced a new indicator of spatial fit that can be applied to the results of discrete choice models. ^{Peer et al. (2013}) applied geographically weighted regression for the approximation of door-to-door travel times in departure time choice models.

Among the techniques of spatial analysis, geostatistics is to be highlighted. Geostatistics enables the development of studies involving spatial autocorrelation, allowing mainly estimating the value of a variable in locations where values are unobserved.

The objective of this study is to propose a sequential method to estimate the mode choice in known geographical coordinates (sampled households) and also in non-sampled households.

The method is composed of a sequential application of Decision Tree (DT) analysis and Ordinary Kriging (OK). The DT model estimates probabilities of mode choice in known coordinates. Besides, DT analysis determines the variable to be kriged in the later stage. The OK application can be conducted only with numerical variables - the probabilities estimated by DT model. The OK was applied in one region of the study area. Geo-referenced disaggregated data were used in this work.

The proposed method can be an alternative to traditional approaches. It consists of two main steps: a non-spatial and a spatial model. Decision Tree analysis is presented here as an alternative to traditional econometric models. In spatial modeling stage, the Ordinary Kriging works as a technique that presented a major advantage over other spatial confirmatory techniques as Geographically Weighted Regression, for example. The OK allows estimating probabilities of choice in places not surveyed.

The next section summarizes the recent literature regarding transport and geostatistics and data mining and travel mode choices. The section 3 describes the rationality of the proposed method. In Section 4 DT and OK results are presented, as well as detailed discussions. Finally, the last section presents the work's general conclusions.

2. Background

2.1 Geostatistics and Transport

There are few applications of geostatistical methods on transportation data and all of them are recently. Is to be noted that most papers available refers to traffic engineering studies (^{Ciuffo and Punzo, 2011}; ^{Mazzella et al., 2011}). ^{Miura (2010}) presents an approach for predicting car travel time by kriging with good results, indicating that 95% prediction limits are between ±10 minutes and ±30 minutes for travel between two arbitrary points. This prediction method is effective for urban districts with links having changeable travel time.

^{Zou et al. (2012}) proposed an improved distance metric called approximate road network distance (ARND), for solving the problem of the invalid spatial covariance function in kriging caused by the non-Euclidean distance metric. ^{Wang et al. (2012}) proposed to solve the traditional research method of statistics on the floating car speed with geostatistics. The spatial structure and the interpolation of floating car speeds are analyzed by exact floating car speed data of the study area in Beijing in May 2005.

However, the use of geostatistics in problems concerning the transportation demand or travel behavior is extremely recent (^{Ji and Gao, 2010}; ^{Peer et al., 2013}; ^{Pitombo et al., 2010}; ^{Pitombo et al., 2015}). So, the focus of this paper is to present not so trivial techniques in the study of travel behavior, for estimation of mode choice, incorporating spatial factors.

2.2 Data mining and Travel Mode Choices

For decades, until nowadays, authors have been investigating the factors that influence mode choice, through different models, such as logit, probit and techniques of data collection as stated preference and revealed preference (^{Sen et al., 1978}.; ^{Ahern and Tapley, 2008}).

Most traditional mode choice models are based on the principle of random utility maximization derived from econometric theory. In addition to the econometric techniques used to estimate travel behavior and mode choice, there are several studies that used data mining techniques to investigate the travel behavior. Mode choice modeling can be regarded as a pattern recognition problem in which multiple human behavioral patterns reflected by explanatory variables determine the choices between alternatives (^{Xie et al., 2007}).

^{Xie et al. (2007}) investigates the performance of two emerging pattern recognition data mining methods, decision trees (DT) analysis and neural networks (NN), for work travel mode choice modeling. ^{Shmueli et al. (1996}) explored the application of neural networks to a behavioral transportation planning problem. The transportation issue explored is a comparison of travel demand patterns of men and women in Israel. ^{Pitombo et al. (2011}) analyzed relationships between socioeconomic, land use, activity participation and travel patterns with Decision Tree (Classification and Regression Tree algorithm) modelling.

The first step of the method here presented is to estimate mode choice, considering socioeconomic variables and measures of the transport system quality, by applying the technique of Decision Tree Analysis (Classification and Regression Tree algorithms). Analysis of DT may be an alternative approach to traditional econometric models because, despite being a non-parametric technique, has no constrains related to input variables (categorical, numeric or dummy) and do not have assumptions such as normality, linearity and multicollinearity. Moreover, it has a high percentage of success even in conditions which are known only socioeconomic variables, with information relating to individuals and not to alternatives.

3. Material And Method

To reach the goal of this work, the stages of the proposed method, summarized in the subsequent subsections, have been fulfilled.

3.1 Data Base Treatment

The study area of this work is the city of São Carlos (São Paulo/Brazil), with 221,936 inhabitants. 96% of the population lives in urban areas, which covers, approximately, 105 km² of 1137.30 km² of total area (^{IBGE, 2010}).

The data used for the development of this work is from the Household interview and Urban Transportation Evaluation Survey applied together in Origin-Destination Survey of 2007/2008 in the city (^{Rodrigues da Silva, 2008}). 5% of households was randomly selected. The Urban Transportation Evaluation Survey, which contains qualitative data of the transportation system of São Carlos, was applied to a resident of each household interviewed, in a total of 2,791 cases. This database, associated with the household interview was used as the basis for data processing and obtaining the final sample of this study.

The database preliminary analysis process led to the elimination of some samples when one, or more, of the following situations occurred: (1) inconsistent or missing data, (2) people who did not travel and (3) households with repeated geographic coordinates. Thus, the final sample contains 1,216 individuals, 22 categorical variables and 4 numeric variables. Such information was then associated with geographic coordinates (latitude and longitude in meters) relative to the households. The main variables in the sample and its description can be found in Table 1.

3.2 Decision Tree Analysis application - Determining the Numeric Variable

Decision Tree (DT) analysis, especially the CART algorithm, is a simple representation for the existing relationship within a data set (dependent variable - independent variables). It consists of a sequential binary partitioning of the dataset considering the values of the variables. Tree models are fitted by successively splitting the data to form homogeneous subsets, being the result a hierarchical tree of decision rules useful for prediction or classification (^{Breiman et al., 1984}).

CART is a segmentation modeling technique that satisfies the following properties: (a) The hierarchy is called tree and each segment is known as node; (b) The root node contains the complete database; (c) The root node is divided sequentially, generating child nodes; (d) When no further data subdivision is possible, the final subgroups are considered terminal nodes or leaves;(e) For construction of the CART, three main elements should be determined: a set of questions delimiting data division, a criterion for evaluation of the best division and a rule for termination of the further subdivisions (stop-splitting rule).

The CART application in this work was used to investigate the variables that contribute to understand individual mode choice. Another key objective for the use of DT analysis was the achievement of the continuous variable for application of Ordinary Kriging- mode estimated probabilities of travel modes.

The geostatistics technique used in the next step of this work, consists of using numeric variables, avoiding the use of the original categorical variable "main travel mode" for the spatial interpolation of mode choice.

Accordingly to the above described, the DT analysis was generated with a sample of 1,216 cases, using the CART algorithm and adopting the parameter of a minimum of 25 observations per terminal node (the stop-splitting rule considering the sample size and the desired homogeneity of the groups). The dependent variable was "Main travel mode", consisting of three categories (1 transit, 2 car/motorcycle and 3 non-motorized). The independent variables were socioeconomic, travel characteristics and the qualitative measures of transport system, as shown in Table 1.

3.3 Spatial Patterns Analysis

Following the steps of the proposed method, exploratory maps of probabilities of travel mode choice, obtained by DT model, were generated in order to observe spatial patterns. Good estimates for spatial interpolation depend, mainly, on spatial structure of the variable to be kriged. Figure 1, Figure 2 and Figure 3 present the spatial distribution of the probabilities of choosing the private motorized mode, transit and non-motorized travel mode, respectively, derived DT modelling.

From the exploratory spatial analysis (Figure 1, Figure 2 and Figure 3), was verified that the variables to be kriged did not present apparent spatial pattern. Thus, it was decided to segregate the city of São Carlos (São Paulo - Brazil) into small regions, considering the criteria of income, with the objective of defining spatial patterns on smaller regions.

As a result, the Cluster Analysis technique was applied considering the variable income (categorical, by Minimum Wages) and geographic coordinates of households (latitude and longitude in meters). A Two-Step Cluster method was adopted for a total of six groups / regions (Figure 4). The legend shows the percentage of households in the lowest income range (0-2 Minimum Wages) in each region. A detailed analysis of the probability maps for each of the six regions separately, revealed that Region 2 was the one that presented a more diffusive spatial pattern, ranging from the center to the periphery of the area, as shown in Figures 5 (a, b and c).

This region consists of both low-income households (center of the region) and higher income (peripheral neighborhoods). In addition, Region 2, is a particular one in the city hence it includes the University of São Paulo second campus, an important factor that affects the city dynamics.

Therefore, in order to estimate all the area of Region 2 (Santa Felicia), including the locations where mode choice was not observed, a database composed by 110 points was submitted to Ordinary Kriging.

For purposes of geostatistics modelling validation, around 30% of points were randomly selected for model test. Resulting in 30 points selected for regions 2 to be used in cross validation. Cross-validation allows the validation of the estimated values at the same time as assesses the goodness of the fitted parameters used.

3.4 Ordinary Kriging (OK) Application

a) Choice of Regionalized Variables

Geostatistics, in general, works with data that has a spatial structure. If, for example, measurements are taken at two different points, differences in the measured values decreases as the two points come closer to each other (^{Matheron, 1971}). The variables, called regionalized, are distributed in space and composed as random functions having a given spatial structure, or, in other words, having a given spatial correlation. The study of regionalized variables starts from the ability to interpolate a given field starting from a limited number of observations, but preserving the theoretical spatial correlation (^{Goovaerts, 1997}).

In this paper, three regionalized variables were used, which have been obtained from Decision Tree Modelling: i) probability of choosing the car, ii) probability of choosing a non-motorized travel mode; iii) probability of selection of transit.

b)Variogram

The empirical variogram allows the quantitative representation of the variation of a regionalized variable in space. The variogram generates information used in kriging algorithm. The variogram function is defined as half the average square difference between points separated by a distance h (Matheron, 1963). The variogram function is calculated as:

Where N(h) is the set of all pairwise; z(xi) and z(xi+h) are data values at spatial locations i and i+h, respectively.

After obtaining the experimental variograms, a mathematical function is to be fitted, the one that best represents the variability in study. Of the various theoretical models for adjustments of variogram, the most frequently used are Spherical, Gaussian and Exponential. In this step, the experimental variogram is replaced by a theoretical variogram function, from which is possible to obtain the main parameters for spatial modelling: nugget effect (C0), Range (a) and Sill (C+C0), see Figure 6.

In this work, experimental variograms (constructed based on observed points) for the three variables obtained by DT analysis were constructed and theoretical models were adjusted with spherical functions.

Table 2 summarizes the parameters of the theoretical variograms for each regionalized variable for Region 2. Figure 7 (a, b and c) illustrates only the theoretical variograms in the main direction for the three regionalized variable for Region 2: Probabilities of private motorized travel mode; Probabilities of transit and Probabilities of non-motorized travel mode.

c) Cross Validation

Cross Validation is a simple way to compare various assumptions either about the model (e.g. type of function to be adjusted, parameters of variograms) or about the data. In the cross validation procedure, each sample value Z(xi) is removed in turn from the data set and a value Z*(xi) at the location is estimated using the remaining n-1 samples. The difference between a data value and the estimated value (Z(xi) - Z*(xi)) gives an indication of how well the data value fits into the neighborhood of the surrounding data values (^{Wackernagel, 2010}).

In order to assess the accuracy of the model determined in theoretical variograms, parameters such as correlation coefficient, mean of residuals and variance of errors, were calculated taking in account the observed and estimated values (30 points randomly selected for Regions 2).

d) Ordinary Kriging (OK)

Ordinary Kriging is the most widely used kriging method. Its main goal is to estimate a value at a point of a region, for which the correspondent variogram is known, using data in neighborhood (^{Wackernagel, 2010}). OK is a method that is often associated with the acronym B.L.U.E. for "best linear unbiased estimator". OK is linear because its estimates are weighted linear combinations of the available data. It is unbiased since it tries to have the mean residual equal to zero. It is best because it aims at minimizing the variance of errors (^{Isaaks and Srivastava, 1989}).

For the prediction of the variable Z at a location x0, {Z(x0)}, the estimator Z*(x0) is defined as (^{Goovaerts, 1999}):

where the (i are weights found by solving the system of equations,

With *γ* (h) being the theoretical model for the variogram of the variable Z (fitted to the sample variograms) and ( being a Lagrange multiplier.

For the interpolation by OK of the three choice probabilities of travel mode, a grid of 100 x 100 meters was established, being its dimensions based on the distance between households in the study area. Kriging maps were generated by interpolation for the three travel modes.

4. Discussions And Conclusion

4.1 Results of DT: Determining the Numeric Variables to be Kriged

The DT, shown in Figure 8, illustrates that the majority of respondents use private motorized mode (60.6%), then the non-motorized mode (21.2%) and transit (18.2%). Here is also possible to observe the variables selected for DT model and their relationship with mode choice. The most important variable (which best explains the data variability considering mode choice) was "Driver's License", splitting data into 2 main branches:

(1) Individuals that have driver's license (Node 1- 74.6% - uses car/motorcycle, 13.0% uses non-motorized mode and 12.4% uses transit); (2) Individuals that do not have driver's license (Node 2 - 44.6% uses non-motorized mode, 34.7% transit and 20.7% uses car/motorcycle).

The selection of "Driver license" as the most important variable in travel mode decision could be justified by the strong correlation of this variable and income and car ownership. These variables are known to have robust influence on mode choice.

Subsequently, CART algorithm fragmented dataset into groups. The groups were then divided successively considering the independent variables values. At the end of data segregation 11 terminal nodes were obtained. The terminal nodes are the basis to evaluate travel behavior.

Therefore, one can observe the relationship of the following variables on car/motorcycle use: having driver license, having car or motorcycle at home, being below 59 years old, being male and being worried with the lack of parking and high cost car travel. Table 3 summarizes the relations (effects) of each variable on choosing travel mode, considering the 11 terminal nodes obtained. DT models also presented a good accuracy: 78% for private motorized mode, 83% for transit and 80% for non-motorized mode.

For the analysis with Ordinary Kriging, the results of the DT model were used as regionalized variables: Probability of choosing the car/motorcycle, Probability of choosing transit and Probability of choosing the non-motorized travel mode. These data are linked to geographic coordinates of 110 samples of Regions 2.

4.2 Results of Ordinary Kriging

As previously mentioned, approximately 30% of the samples of each region was randomly selected for cross-validation purposes, which allowed obtaining for 30 surveyed points in regions 2, the correspondent estimated values according to probability of travel mode based on the determined variograms.

In order to access quality of obtained models some statistical parameters, such as correlation coefficient, mean of residuals and variance of errors, were calculated and can be observed in Table 4. In addition, the percentage of correct estimation of travel mode, considering the estimated and observed probabilities of travel mode, was also obtained.

Upon analysis of cross-validation results, one can see that, despite the low correlation between observed and estimated values, there were also low values of the mean of the residuals and variance of errors. However, is to highlight that the percentage of correct estimation values is, with exception for Private Motorized Travel Mode Probabilities of Region 2, reasonably good. The travel mode probability was estimated for a 100 x 100 meters grid, for the referred region. Maps generated by Ordinary Kriging interpolation for the three travel modes are illustrated in Figure 9.

Results in the case of the probability of using private motorized travel mode, show that this travel mode is more likely to be used in the periphery of the region. Thus, the predisposition for car usage decreases considering the center proximity. Is to be highlighted that the main direction of this variogram (N30E) is clearly translated to the respective map. The kriging results for the transit and non-motorized travel mode also could be observed in the same figure. The predisposition to use other travel modes is opposing to the car usage. The tendency to use transit and non-motorized travel mode increases from the periphery to the center area.

The results of spatial interpolation are consistent with the reality of the region. Locations with higher probability of car/motorcycle use are exactly those corresponding to neighborhoods of higher income population. Conversely, the center of Region 2, which is the least probable to use the car, corresponds to low-income households.

4.3. Conclusion

The results obtained in this study allowed determining the probabilities of mode choice in household locations where choices are unobserved through this procedure (sequential application of Decision Tree analysis and Ordinary Kriging).

Through the application of DT, relationships between socioeconomic and transportation system variables and mode choice were found as expected, and found in literature (^{Bhat, 1997}; ^{Pas, 1984}). The eleven terminal nodes, found in this work, synthesize groups of individuals susceptible to use a particular travel mode, considering socioeconomic and transportation system characteristics, selected by the data partition algorithm.

The resulting maps, obtained from Ordinary Kriging, allowed determining that there is a trend in the use of private motorized travel mode, which increases from the center to periphery. While, opposite trend was observed for the other travel modes in the same area. Cross-validation showed good results considering mean of residuals and variance of errors. Besides, a percentage of correct responses over 70%, in case of non-motorized travel mode and transit, were achieved.

An important aspect to be taken into consideration is that the regionalized variables are unnatural (not directly measured), were produced by a nonparametric model, the DT model. Moreover, the process developed in this study probably suffers from the influence of different errors from this type of data and the use of sequential estimation models (spatial and non-spatial).

Nevertheless, the innovative characteristic of this study should be taken into account. The two-step method presented is based in unusual techniques in the analysis of mode choice. However, the results show the success in this combination, which allowed a preliminary assessment of spatial particularities of the study area, and, it also emphasized the necessity of robust/solid information basis, special when dealing with questionnaires.