MODELING SPATIAL EFFECT ON TRAVEL MODE CHOICE USING A SYNTHETIC SPATIALLY CORRELATED DATA SET

: Urban dynamics can be characterized more effectively by considering spatial aspects in studies. This paper, using a synthetic spatially correlated data set, aims to model the spatial effect on travel mode choice based on geostatistics precepts. A method was proposed based on three main steps. The first step consists of building synthetic spatially correlated data, using the intrinsic spatial dependence on travel demand data and mathematical principles of bilinear interpolation. The following two steps correspond to the modeling approach. The Exploratory Spatial Data Analysis stage aimed to attest the existence of spatial autocorrelation of the data set using two indicators: Moran and G-SIVAR (Global Spatial Indicator Based on Variogram). The Confirmatory Spatial Data Analysis stage proposed the calibration of two Binomial Logit models. The first model includes only the original database variables (non-spatial model). The second one is analogous to the original but added to spatial covariates obtained by geostatistical concepts (spatial model). A 15% increase in cross-validation hit rates is achieved when spatial variables are included. This paper presents three significant research contributions: (1) The methodological procedure to model spatial effect on travel mode choice; (2) The proposal of spatial covariates based on geostatistical assumptions; and (3) The suggestion of a simple procedure to propose a simulation of a spatially correlated database.


Introduction
Discrete choice models are a traditional tool, widely used in travel demand studies, mainly in travel mode choice problems (Ben-Akiva, 1974;McFadden, 1974;Bhat, 1995;Hess, 2005;Bhat et al., 2008;Ahern and Tapley, 2008;Qin et al., 2017). However, it is worth mentioning that it is challenging to know all the factors that affect individual decisions since there is a heterogeneity underlying travel behavior. Moreover, home and destination locations, the spatial distribution of activities in the urban environment, and land-use variables can influence travel mode choices, as well as personal and alternative attributes.
So, the spatial analysis of travel demand has become a potential line of research, especially given the requirement of including spatial effects on mathematical models (Páez et al. 2013). Hence, many studies (Miyamoto et al., 2004;Zhou, 2012;Páez et al., 2013;Pitombo et al., 2015;Lindner and Pitombo, 2018;Assirati, 2018;Lindner et al., 2021) started to incorporate variables related to geographic location to travel demand forecasting studies, promoting the improvement of estimates. So, this paper mainly proposes to model the spatial effect on travel mode choice, taking into account the spatial analysis of the travel demand research field.
The spatial analysis consists of a set of techniques that are designed to analyze the geographical position of observations of a variable, evaluating the possible relationships between the values of the variable and different locations. Several questions can be answered through spatial analysis. The spatial analysis of a phenomenon comprises a preliminary, exploratory step, in which the degree of spatial dependence, either locally or globally, of a data set is visualized or measured. After confirming the spatial dependence of the data, the confirmatory spatial analysis stage is indicated. Consequently, models started to be employed to include the spatial structure of the data.
Many studies carried out only Exploratory Spatial Data Analysis (ESDA), considering different research fields (Anselin, 1996;Unwin and Unwin;1998;Messner et al., 1999;Le Gallo and Ertur;2003). Among the ESDA procedures, the degree of spatial dependence can be measured by spatial indicators (Naizer et al., 2019). Over many decades, different authors proposed spatial indicators to corroborate the existence of spatial autocorrelation (Moran, 1950;Geary, 1954;Getis and Ord, 1992;Anselin, 1995). Global indexes qualify the database, while local indexes characterize the spatial association of observations, identifying spatial association pockets. Most indexes could be applied to quantitative variables (Anselin, 1995;Naizer et al., 2019).
In this paper, the global indicators Moran and G-SIVAR (Global Spatial Indicator Based on Variogram) were used in the ESDA methodological stage. The Moran global indicator is based on correlograms (also known as autocorrelation diagrams, are graphs of the sample's autocorrelations versus distance units). It indicates the similarity of the data and ranges from -1 to 1. Negative values indicate negative spatial autocorrelation, while positive values indicate positive spatial autocorrelation. The value 0 indicates the absence of spatial correlation (Moran, 1950;Anselin, 1995).
The global indicator G-SIVAR (Naizer et al., 2019) is based on assumptions of geostatistics, more specifically, on the semivariogram tool (theoretical and experimental). The G-SIVAR indicator is based on the standardized values of the theoretical variogram (ordinate axis), associated with a hypothesis test for spatial randomness. It indicates the dissimilarity of the data and ranges from 0 to 1, where value 1 shows the complete absence of spatial autocorrelation, and value 0 indicates the entire presence of spatial autocorrelation. The original studies of Moran (1950) and Naizer et al. (2019) should be consulted if the reader wants more details about the mathematical formalism used.
The Confirmatory Spatial Data Analysis (CSDA) includes the set of estimation models, in addition to validation procedures. In recent years, different spatial models have been used in the travel demand field. There are various studies regarding Geographically Weighted Regression application (Chow et al., 2006;Nowrouzian and Srinivasan, 2014;Pitombo et al., 2010) or spatial regression models that consider the spatial autocorrelation including an explanatory spatial variable (Wang, 2001;Lopes et al., 2014). This paper presents a kind of spatial regression model for discrete choice models. The CSDA stage of this study took place applying Binomial Logit models based on the inclusion of spatial covariates.
Decisions between travel modes are a classic example of discrete choice, where one must choose an alternative within a finite set of possibilities (car, public transportation, walking, etc.). The attributes of travel mode options (such as travel time, cost, etc.) and the attributes of individuals (income, age, gender, etc.) are used to calculate the probability of choosing the possible travel mode. It is known, however, that the neighborhood can influence the attributes involved in the discrete choice models. Many techniques have been proposed to deal with discrete choice estimation when spatial dependence is present, and the well-known Spatially Dependent Discrete Choice Models have been applied in recent years (Fleming, 2004;Dugundji and Walker, 2005;Sener et al., 2011;Paez et al., 2013). Then, regionalized or spatial variables started to be incorporated into discrete choice models to promote better estimates. Using an approach based on geostatistics precepts, this research proposes to extract spatial covariates providing increments to parametric modeling. The proposal of an independent spatial variable, based on the geostatistical approach, could be considered a significant methodological contribution.
Thus, this study seeks to attest the spatial association of the data set in the ESDA stage using a synthetic spatially correlated data set. Next, the comparison of two Binomial Logit models is made in the CSDA stage. The first model, with only traditional covariates and the second one incorporating spatial covariates based on geostatistics assumptions.
Additionally, despite the observed advances in population synthesizers for travel data microsimulation, none of the approaches recognize the spatial correlation of data as a relevant input to reproduce travel behavior. Recently, Sequential Gaussian Simulation was presented as a promising simulation tool in Travel Demand Modeling (Lindner and Pitombo, 2019). Then, the specific objective of this paper is to propose a simulation of a spatially correlated database, using the intrinsic spatial dependence on travel demand data and mathematical principles of bilinear interpolation.
This paper presents three notable research contributions: (1) The methodological procedure to model spatial effect on travel mode choice; (2) The proposal of spatial covariates based on geostatistical assumptions; and (3) The suggestion of a simple procedure to propose a simulation of a spatially correlated database. This article has five sections, in addition to this introduction. Section 2 provides a brief definition of the tools used in the methodological procedure. Section 3 describes the initial database used for data simulation. Section 4 describes the method adopted. Section 5 presents the results and discussions. Finally, Section 6 presents the main conclusions and contributions.

Linear and bilinear interpolation
The linear interpolation method is used to extrapolate new values within a discrete range of a previously known dataset. When having a discrete mesh containing some known values, it is useful to infer a value for a location that has no given value. Assuming that to estimate the value A 1 at a point located at the coordinates (x, y 1 ) and previously knowing values V 11 and V 21 , respectively located at the coordinates (x 1 , y 1 ) and (x 2 , y 1 ) as shown in Figure 1a, the value of A 1 from Equation 1 can be inferred: This expression can be interpreted as a weighted average where the weights ( 2 − 2 − 1 ) and ( − 1 2 − 1 ) are inversely related to the distance from endpoints V 11 and V 21 to point A 1 whose value is unknown. Thus, point V 11 , being closer to A 1 , has more influence than V 21 in the interpolation process. The weights should be normalized by the distance ( 2 − 1 ) between V 11 and V 21 since their sum should be 1.
From the example presented in Figure 1a, one can note that the interpolation for point A 1 involved only V 11 and V 21 values and their x-axis coordinates. However, many areas of knowledge deal with data spread in a two-dimensional mesh. Therefore, it is desirable to extend data interpolation to a bidimensional scenario known as bilinear interpolation. Bilinear interpolation can be interpreted as the performance of linear interpolation in one direction, followed by a second application of linear interpolation in the other direction. Although calculations are applied in linear terms at each step, the bilinear interpolation, as a whole, is a quadratic metric at the location whose value is needed to be inferred. In addition to points V 11 , V 21 and A 1 of the previous example (Figure 1a), let us now consider points V 12 , V 22 , A 2 and P located respectively at the coordinates (x 1 ,y 2 ), (x 2 ,y 2 ), (x,y 2 ) and (x,y) presented in Figure 1b.
The bilinear interpolation method must be used to obtain the value of P in the (x, y) coordinate. It implies that for the x-axis, a linear interpolation should be applied to obtain the value of A 1 (Equation 1), and similarly to obtain the value of A 2 (Equation 2): Then, for the y-axis, linear interpolation is applied one more time to obtain the value of P finally, according to Equation 3: At first, points A 1 and A 2 were obtained by a weighted average for elements only belonging to the x-axis and values V 11 , V 21 , V 12 , and V 22 . In a second moment, point P was obtained by a weighted average for elements only belonging to the y-axis and values A 1 and A 2 . It is important to note that the final result of bi-linear interpolation does not depend on which axis was used in the first and second calculations. If one initially uses the y-direction for the calculations, followed by the x-direction, an analogous result must be obtained.

Variograms
Adjusting experimental variograms (obtaining theoretical variograms) is the procedure to measure spatial dependence before and after data simulation. Additionally, the semivariogram was the tool used for the G-SIVAR (Global Spatial Indicator based on the Variogram). G-SIVAR is based on the standardized values of the theoretical variogram (ordinate axis), associated with a hypothesis test.
The variogram is the mathematical description of the relationship between the variance of pairs of observations and the distance separating them (h). In the experimental determination of the variogram, for each value of "h", all pairs of observations z (x) and z (x + h) are considered, according to Equation 4 (Krige, 1951;Matheron, 1971;Cressie, 1993): where: γ (h) is the variance; N(h) is the number of measured value pairs; and z(x) and z(x+h) are pairs of observations separated by the vector "h".
After the calculation of the experimental variograms, a mathematical model that best represents the variability under study should be established. From the various theoretical models available for variogram adjustments, the most applied in geostatistics are the spherical (Figure 2), Gaussian, and exponential. There are some parameters of the theoretical variogram, illustrated in Figure 2, that must be observed to determine the spatial structure of the variables.  (Cressie, 1993).
(1) Range (a): the distance where the samples are spatially correlated; (2) Sill (C 0 +C): the maximum value of γ (variance) on the range curve (a). It is obtained by the sum of the nugget effect (C 0 ) and the contribution (C); (3) Nugget Effect (C 0 ): is the starting point of the range curve (a) touching the γ-axis. The nugget effect reflects the random variance for short distances. A high value of this parameter indicates that significant variations are found in short distances.
In this paper, taking into account the step of building synthetic spatially correlated data to analyze the spatial structure of the variables (total of six) that compose the initial and simulated data samples, the main parameters of the theoretical variograms of the six variables will be observed.

Initial database
Initially, we had a dataset representing a fictitious locality. The set consisted of 60 hypothetical coordinate points that ranged from 0 to 10 in steps of 0.5 for the x-axis and from 0 to 5 in steps of 0.1 for the y-axis. Associated with each of these 60 points were seven variables: Travel distance, Bus travel time, Car travel time, Bus fare, Car travel cost, Income indicator and a dichotomous choice variable that represented the use of bus (0) or car (1) by the individual represented by each of the 60 locations. Figure 3 shows the original points within the fictitious city, with the marks representing the travel mode choices. Table 1 shows the descriptive measures of the variables in the initial database.

Methodological procedure
A method was proposed based on three main steps. The first step consists of building synthetic spatially correlated data, using the intrinsic spatial dependence on travel demand data and mathematical principles of bilinear interpolation. The following two steps correspond to the modeling approach. The ESDA stage aimed to attest the existence of spatial autocorrelation of the data set using two indicators: Moran and G-SIVAR (Global Spatial Indicator Based on Variogram). The CSDA stage proposed the calibration of two Binomial Logit models. The first model includes just the original database variables (non-spatial model). The second one is composed by the original variables and some spatial covariates obtained by geostatistical concepts (spatial model). Figure 4 illustrates the methodological procedure. The following subsections describe the methodological stages.
Linear interpolation can be achieved by any method, from manual calculations to computational packages (which we suggest python). The spatial dependency calculations are made using the G-SIVAR R package accessible via https://github.com/pedreirajr/g-sivar.

Step 1: Synthetic spatially correlated data
The present article proposes a method based on bilinear interpolation to supply absences and simulate a spatially correlated data set from pre-existing values. This simulation is analogous to a disaggregation or even filling gaps left by eventual data loss. The accuracy of the procedure is assessed by nonparametric statistical tests (Kolmogorov-Sminorv and Mann-Whitney) that indicate the similarity of population distributions of the new data generated compared to the original data used as the basis for the simulation. Once desirable accuracy is obtained, experimental variograms of the pre and post-simulation bases are modeled to evaluate whether the spatial structure of the data has been maintained.

Step 2: Exploratory Spatial Data Analysis (ESDA)
The ESDA of this study consisted of calculating two indicators of global spatial association: Moran and G-SIVAR. To do this, a division of data into performance area ranges was required. Initially, for ESDA, and later, for the proposal of the spatial covariates.
The dimensions of the sample space are ten units on the X-axis and five units on the Y-axis. Thus, the longest possible distance in this space is the main diagonal, with approximately 12 units. This concept is inherited from geostatistics and is called cut-distance. Thus, four neighborhood strips were formed through the equal division of the diagonal into four plots: (I) 0-3 units; (II) 3.1-6 units; (III) 6.1-9 units; (IV) 9.1-12 units. The constitution of the bands is also another concept inherited from geostatistics and is called lag-distance. It refers to the segments, delimited by the cut-distance. All lag-distances are commonly the same size and equidistant.
After calculating both indicators for the given distance ranges, not only the values of the indicators should be evaluated, but also the statistical significance for spatial autocorrelation. Thus, the neighborhoods considered for determining the spatial covariates will be those statistically significant (p-value ≤ 0.05) for the hypothesis tests associated with the calculated global indicators.

Proposal of Spatial Covariates
This step was one of the most important of the methodological procedures. It constituted an essential contribution, taking into account the importance of representing spatial dependence through an independent variable. This stage followed the process proposed in previous study (Assirati, 2018).
Based on the geostatistical precepts, the procedure was established to measure the probability that an individual belongs to a particular category (analyzed variable) according to the values of their neighbors (regional character). The approach aims to complement the traditional parametric analysis by including spatial covariates on discrete choice models (Binomial Logit models in this study).
In this article, the spatial covariates are the probabilities of the neighbors (determined from the four neighborhoods) to choose the two categories of travel mode adopted in this study: bus or car.
To calculate the probabilities, the first step consists of establishing the number of neighborhoods. We suggest measuring the maximum distance between the two points most distant from each other and divide this distance into equal portions to determine n neighborhoods of interest, which should not overlap.
Once neighborhoods are established, probabilities of belonging to the categories are calculated by individuals, given the categories of their neighbors. For each individual and neighborhood, the number of elements included in a region belonging to each category is counted. The probability of this individual belonging to each of the categories is the ratio between the number of elements in a given category and the total number of elements found in the considered neighborhood. Figure 5 shows an example of the procedure. If one desires to characterize an individual, located at point X ( Figure 5), according to the probabilities of belonging to a certain category based on neighborhoods, the procedure should be as follows: a) Given the maximum dimensions of the data distribution, space is divided into a number n of neighborhoods of interest. For the example in Figure 5, the data are arranged in a quarter circumference with a radius of size 4L. Thus, it is possible to delimit n = 4 bands of size L each. Each range will, therefore, be a neighborhood for analysis.
b) The number of elements from all categories present in the first neighborhood is counted. For example, in the first strip, there are four red elements of category A and no black elements of category B. Therefore, an individual located in X has the following according to the first neighborhood (n = 1): -P 1 (A) = 1 (4 elements out of a total of 4) of being category A; and -P 1 (B) = 0 (0 elements out of a total of 4) of being category B. c) It works similarly for the other neighborhoods (n = 2, 3, and 4). Thus, individual X is characterized as: -P 2 (A) = 0.7 (7 elements out of 10) of being category A; -P 2 (B) = 0.3 (3 elements out of a total of 10) of being category B; -P 3 (A) = 0.5 (8 elements out of a total of 16) of being category A; -P 3 (B) = 0.5 (8 elements out of a total of 16) of being category B; -P 4 (A) = 0.2 (4 elements out of a total of 20) of being category A; and -P 4 (B) = 0.8 (16 elements out of a total of 20) of being category B.
d) The previous procedures are repeated for all points in the sample.
Thus, once the spatial dependence is proven in the ESDA stage and the global spatial association indicators are calculated, the spatial covariates are proposed, based on the neighborhoods considered statistically significant in the ESDA stage.

Binomial Logit Calibration (non-spatial model)
This stage corresponds to the calibration of the non-spatial discrete choice model. The determination of utility functions, in this step, is based on the variables that represent the travel mode alternatives (bus travel time, car travel time, bus fare and car travel costs), the trip (travel distance), as well as individual variables (income). The response variable is the binary mode choice (0) bus and (1) car. Equations 5 and 6 represent the utility functions of the non-spatial model.
The variables "income" and "distance" were included in both utility functions V 0 and V 1 , although they did not vary between alternatives (only between individuals). Therefore, to guarantee the assumption of distinction between alternatives, it was necessary to have specific parameters β 0_ . β 0_ and β 1_ . β 1_ . for functions V 0 and V 1 , respectively. The parameters β and β are generic since the assumption of distinction is guaranteed.

Binomial Logit Calibration (spatial model)
If the ESDA shows that there is a spatial dependence on the concerned variable (travel mode choice), it is expected that the inclusion of spatial information increases the potential for accuracy and quality of the parametric modeling. Then, the calculation of the probabilities of belonging to the categories of the travel mode is applied based on the neighborhood values.
The establishment of probabilities, by category, in each of the neighborhoods provides the informational addition of a regionalized character that was sought to increase the parametric analyses of the discrete choice. Previously, four travel mode attributes were taken into account (bus travel time, car travel time, bus fare, and car travel costs), an individual attribute (income), and a travel attribute (trip distance). Now, in addition to these, there are eight new attributes of individuals: PI (0); PI (1); PII (0); PII (1); PIII (0); PIII (1); PIV (0); PIV (1), which are, respectively, probability P of an individual choosing to travel by bus (0) from the values of the elements in neighborhood I, probability P of an individual choosing to travel by car (1) from the values of the elements in neighborhood I, and so on, alternating probabilities P of the mode choice until neighborhood IV.
It should be noted that using all neighborhoods is an ideal scenario where they were all identified as statistically significant by the global indicators. The existence of covariates is, therefore, conditioned to the result obtained in the exploratory analysis. Only the spatial covariates related to the regions where at least one indicator pointed to the significance (p-value ≤ 0.05) should be included in the parametric equation. Otherwise, the region (neighborhood) must be dismissed. Locational informational addition now defines new utility functions, represented by Equations 7 and 8.

Comparison of the two models
For the comparison of the Binomial Logit models (spatial and non-spatial), three metrics will be used: The adjusted rho-square values; the Akaike criterion value; the hit rate by cross-validation and likelihood values for validation sample. Such metrics are well known, however, the reader who seeks more details can consult Hosmer and Lemeshow (2000) and Assirati (2018). Equation 9 defines the adjusted rho-square metric. If the models had the same number of parameters, the rho-square metric would suffice. However, we intended to compare models with "original" variables versus models that add spatial covariates to "original" variables. The models present a different number of parameters. They, therefore, require evaluation by the adjusted rho-square metric as this allows us to compare models estimated from the same sample of observations, but with a different number of parameters.
Where L 0 is the likelihood value obtained by assuming all parameters beta of the model as zero and L* is the maximum likelihood value obtained when the parameters β correspond to the estimated values. K is the number of estimated parameters.
The Akaike criterion is defined by Equation 10. The formulation causes the criterion to penalize overfitting (the act of adding too many variables to the equations to obtain better adjustments).
= 2 − 2 ln * (10) Lastly, cross-validation consists of segregating a portion of the sample to be used in the parameter estimation process and another part used in the validation process. Then, the modeling is applied along with the calibrated parameters coming from the calibration group in the elements of the validation group. Thus, obtaining estimated values for that group. As the values of the validation group are known, the hit rate can be measured when comparing the real values with the estimated values. Accurate models tend to have high hit rates, and the quality of this value can be measured by calculating the likelihood (Equation 11).
For the total number n of elements considered in the test, there is a quantity y of correctly evaluated elements. Thus, the ratio p = ny between the number of elements is successfully assessed, and the total of elements considered is defined. When the value of p tends to unity (100% accuracy), the value of L also tends to unity, and consequently, log L tends to the null value.

Step 1: Synthetic spatially correlated data
After applying the bilinear interpolation technique, we obtained new locations from values for the seven variables involved. This technique allowed the change from 60 starting points to 580 new points, and now the x-axis, set at the original range of 0 to 10, has a step of 0.25 and the y-axis, set at the original limits of 0 to 5, now has a step of 0.05. Figure 6 illustrates the geographical distribution of the binary dependent variable. Table 2 presents the descriptive measures of the simulated data sample variables. The following results were obtained (Table 3) from non-parametric statistical tests. These results corroborate the efficiency of the simulation, attesting the similarity of population distribution between the two samples (initial and synthetic spatially correlated data) for all seven variables under analysis.   Finally, the modeling of experimental variograms, a fundamental step for investigating the spatial structure of the variables before and after the simulation of the data, is illustrated in Figure 7. The figure shows examples of adjusted semivariograms for two concerned variables. The parameters of the theoretical variograms (for the seven variables) are shown in Table 4. Similar parameters can be appreciated regarding the theoretical variograms of the initial and simulated data, confirming the similarity of the spatial structure between the two samples.  Figure 7: Variograms for the initial and simulated data.

Modal choice
Bus travel time Initial data Simulated data

Step 2: Exploratory Spatial Data Analysis (ESDA)
Initially, an Exploratory Spatial Data Analysis was performed, using the synthetic spatially correlated data, based on the calculation of the Moran and G-SIVAR indicators, shown in Table 5. It is worth noting here that the neighborhoods chosen were those where the indicators were considered statistically significant by hypothesis tests, associated with at least one of the indicators. The null hypothesis, for the hypothesis tests of both indicators, is spatial randomness. In this case, the neighborhoods of interest are those where the null hypothesis was rejected in at least one of the indicators (Neighborhoods I, II, and III). It can be seen, from Table 5, that the indicators are almost complementary quantities. The proximity of value 1, for the Moran index, indicates high spatial autocorrelation, while value 1 indicates very low spatial autocorrelation for the G-SIVAR indicator. Note that the values for the statistically significant coefficients are not high (for Moran) or low (for G-SIVAR). This indicates that, even if small, there is some spatial autocorrelation in these locations and it is statistically significant. Thus, the inclusion of covariates extracted from these neighborhoods will provide informational addition to the Binomial Logit model.

Step 3: Confirmatory Spatial Data Analysis (CSDA)
Subsequently, the probabilities of neighbors who belong to neighborhoods I, II, and III to choose the bus (category 0) or car (category 1) were calculated. It is reinforced that these spatial covariates are individual. Each individual has six probability values from its neighbors: P (1) I; P (1) II; P (1) III; P (0) I; P (0) II; P (0) III. The next step, related to the calibration of the utility functions of the original variables, was based on Equations 5 and 6. When performing the calibration using the maximum likelihood method, the estimated beta parameters were obtained. However, not all were statistically significant. Therefore, those variables associated with non-significant parameters ( in both Equations 5 and 6) were removed from the original utility functions. Consequently, it is necessary to recalibrate the model, dismissing the variables that are not relevant to the study. The results are shown in Table 6a. When segregating the sample in a 7/3 proportion and promoting cross-validation tests with the calibrated coefficients, 74% of hit rates was obtained for the validation sample, with a likelihood value L= 2.26 × 10−44 and log (L) = −100. 49. Then, the calibration of the spatial model is performed by incorporating the spatial covariates related to neighborhoods I, II and III, per individual. As the hypothesis tests did not show the significance for region IV, the variables related to that region and their respective parameters were taken from original equations. Once more, as in the first modeling, after calibration, there was a non-significant coefficient (coefficient relative to the third neighborhood for V 1 ) demanding the elimination of its variable from the equation and a new calibration. Therefore, a new calibration is carried out, ignoring the non-significant coefficients and their associated variables. The results are shown in Table 6b. Once more, by segregating the sample in a 7/3 ratio for cross-validation tests with the calibrated coefficients, an 89% hit rate was obtained for the second modeling, with likelihood value L = 1.82 × 10−39 and log (L) = −79. 98.
When comparing the first model, which considered the original significant variables of the database (Table  6a) with the second model that considered the original significant variables and the spatial covariates (Table 6b), one can corroborate the hypothesis that if there is spatial association of variables through ESDA, the spatial aspect can be included in the model as it will bring greater precision to the results.
The informational increase is positive, providing, for this case study, an increase of 15% in the cross-validation hit rates. The values that measure the quality of the model were also positive: adjusted rho-square increased, approaching the unit considerably. The Akaike criterion decreased from 210.455 to 77.323. The importance of such a reduction is emphasized since the Akaike criterion penalizes overfitting, and it is for this reason that lower values are sought for this criterion. The second model is shown to be compatible because, despite having five coefficients more than the first, it presents the Akaike criterion approximately 2.72 times smaller.

CONCLUSIONS AND RESEARCH CONTRIBUTIONS
This paper aimed to model the spatial effect on travel mode choice using a synthetic spatially correlated data set. This research presented three major academic contributions mentioned previously. The first one is related to the methodological procedure to model the spatial effect on travel mode choice. The authors presented a sequential spatial method for forecasting mode choice. The procedure is based on an Exploratory Spatial Data Analysis, followed by a Confirmatory Spatial Data Analysis, regarding discrete variables. This procedure could be easily replicable in different study fields regarding correlated spatial data. Additionally, the methodological steps for G-SIVAR calculation are available in an R -code for any interested user (https://github.com/pedreirajr/g-sivar).
The proposal of spatial covariates, based on geostatistical assumptions, is an essential contribution considering the importance of representing spatial effects through an independent variable. In this article, the spatial covariate represents the probability of choosing some alternative based on the choice of different neighbors. The neighborhoods are delimited based on lag distance concepts. Additionally, it is essential to mention that the spatial covariate proposal is dependent on the ESDA stage. By observing the hypothesis tests, associated with the indicators, the neighborhoods to be added to the analysis were determined.
It is also worth mentioning the role of geostatistics in the methodological procedure. The geostatistics tool was fundamental for the composition of the G-SIVAR, to determine the neighborhoods (based on the values of the lags/distances between observations of the variograms), and, indirectly, on the establishment of spatial covariates (in determining the angle and distances from neighbors -based on variograms). Additionally, the adjusted of experimental variograms was applied to attest similarities of spatial structure considering original and simulated samples.
The suggestion of a simple procedure to propose a simulation of a spatially correlated database is the last research contribution, taking into account that, despite the observed advances in population synthesizers for travel data microsimulation, none of the approaches recognizes the spatial correlation of data as an extraordinary input to reproduce travel behavior. Although there is a recent application of Sequential Gaussian Simulation to simulate travel demand data (Lindner and Pitombo, 2019), the procedure proposed in this article requires less conceptual effort.
Finally, the main contribution of this paper is related to a methodological demonstration on how to take into account the spatial characteristics of the data, along with the usual variables. Assirati (2018) applicated this methodology considering a real data and a trip-chaining study case regarding panel data collected by smartphones. This application corroborates that the methodological procedure is feasible to real data, with different urban configuration.