PROPOSAL OF A GEOSTATISTICAL PROCEDURE FOR TRANSPORTATION PLANNING FIELD

The main objective of this study is to estimate variables related to transportation planning, in particular transit trip production, by proposing a geostatistical procedure. The procedure combines the semivariogram deconvolution and Kriging with External Drift (KED). The method consists of initially assuming a disaggregated systematic sample from aggregate data. Subsequently, KED was applied to estimate the primary variable, considering the population as a secondary input. This research assesses two types of information related to the city of Salvador (Bahia, Brazil): an origin-destination dataset based on a home-interview survey carried out in 1995 and the 2010 census data. Besides standing out for the application of Geostatistics in the field of transportation planning, this paper introduces the concepts of semivariogram deconvolution applied to aggregated travel data. Thus far these aspects have not been explored in the research area. In this way, this paper mainly presents three contributions: 1) estimating urban travel data in unsampled spatial locations; 2) obtaining the values of the variable of interest deriving out of other variables; and 3) introducing a simple semivariogram deconvolution procedure, considering that disaggregated data are not available to maintain the confidentiality of individual data.


Introduction and brief literature review
The transportation planning aims at adjusting the supply with transportation demand, in order to minimize the problems caused by the travel needs.In the field of transportation planning, a few concepts are widely used and some of them are object of study of this paper.According to diverse authors, a trip may be defined as the movement of a single direction leaving from an origin point and heading to a certain destination; trip production (focus of this research) is the amount of trips originated in a specific Traffic Analysis Zone (TAZ) or region; trip attraction is the amount of trips attracted to a particular TAZ or region; trip generation is the sum of trip production and trip attraction per TAZ or region (Papacostas and Prevedouros, 1993;Ortúzar and Willumsen , 2011).
Trip generation model is the first step of the classic four step transport model: trip generation, distribution, travel mode choice and trip assignment.Trip generation models aim to estimate the production and attraction of trips in each TAZ or trips generated by household.This first step is essential to transportation planning, since it is analyzed the main factors that determine the trip generation.The most common models for representing trip generation are Multiple Linear Regression and Cross Classification (Mahmoud and Abdel, 2004;Al-Taei and Taher, 2006;Ortúzar and Willumsen, 2011).
However, these traditional models of trip generation forecasting assume that the observed data have no association with the spatial location (Lopes et al., 2014).It is understood that urban travel issues are associated with individual and household features, as well as the spatial location of each household, destination and the activities distribution in the urban environment (Páez et al., 2013).Therefore, the incorporation of spatially correlated variables and the spatial position into studies of travel demand become important to enhance the quality of the estimations (Bhat and Sener, 2009;Páez and Scott, 2005;Ben-Akiva et al., 2004;Pitombo et al., 2015;Lindner et al., 2016;Gomes et al., 2016;Rocha et al., 2016).
The influence of spatially correlated data and the increasing availability of georreferenced data enabled the incorporation of spatial data into studies on travel modeling.Such types of researches have been identified as emerging lines of study (Páez et al., 2013).Bhat and Zhao (2002) identified spatial factors that must be incorporated in travel demand models.The authors formulated a multilevel mixed logit to estimate trip generation and activities in the Boston Metropolitan Area.Adjemian et al. (2010) demonstrated that the travel mode choice is associated and depends on the spatial location.Páez et al. (2013) introduced a spatial indicator that was incorporated to discrete choice models for household-based travel estimation.Bhat and Sener (2009) introduced a multivariate logistic distribution copula-based approach to address spatial dependency and endogeneity issues in binary discrete choice models.
Among the spatial statistical techniques, geostatistical methods stand out as they enable the study of features in which the variables are spatially correlated.This makes possible to estimate values of a specific variable when considering a coordinate where there is no prior knowledge of the values.Kriging has been an accepted tool in spatial statistics since it moved from geology and geochemistry into other applications in the 1990s.Although in the 1990s there were some studies in the literature on travel demand forecasting with applications of estimation techniques related to Kriging (Ickstadt, K. et al., 1998), its application is still recent and has not yet been fairly explored in the line of research of Transportation Planning (Yoon et al., 2014;Chen et al., 2015).However, it is possible to find diverse studies in "Transportation Engineering" field (Ciuffo et al., 2011;Mazzella et al., 2011;Zhang and Wang, 2013) and also Emission of vehicular gases (Pearcea et al., 2009;Kassteele and Velders, 2006;Kassteele and Stein, 2006).
Different levels of data aggregation, known in the field of spatial analysis as the change of support, influence the results of geostatistical analysis.The support defines the method, spacing and/or volume of data acquisition (Journel and Huijbregts, 1978).The change of support is related to the adopted scale and the existence of the Modifiable Areal Unit Problem (MAUP) and ecological fallacy (when occurring the misinterpretation of results).
In the field of Transportation Planning, disaggregated data (e.g.household's or individual data) is the best geometric support for forecasting urban travel.That is, this paper has two important steps towards the travel data adequacy to geostatistical modeling: proposing a simple method of disaggregated systematic sampling from two types of aggregated datasets (the 1995 origindestination dataset and the 2010 census data -IBGE 2010)the semivariogram deconvolution; and estimating and mapping the transit trip production by using Kriging with External Drift (KED).
Hence, this paper has three main objectives: 1) to use geostatistics to estimate urban travel data in locations where the values are unknown or unobserved; 2) to obtain the values of the variable of interest deriving out of a secondary variable (more easily available) by using Kriging with External Drift and; 3) proposing a disaggregated systematic sampling that uses aggregate data (a simple procedure of semivariogram deconvolution).This article is organized into four sections besides this Introduction.Section 2 presents the materials (techniques, study area and dataset) and the method.Section 3 presents the results, and finally, Section 4 describes the main conclusions.

Sampling in Geostatistics: Systematic Sampling
A sample is a set of values from a spatial event that must have a good representation of reality, in the sense to present certain characteristics, in order to allow the modeling of studied phenomenon.Geostatistical approaches calculate the uncertainty from estimations based on samples.Three types of samples are identified in Geostatistics: Simple Random Sampling (SRS), Stratified Sampling and Systematic Sampling (Webster and Oliver, 2007).
In the SRS, the random components are the geographic coordinates.The selection is arbitrary and all the components must have the same probability of occurrence.The Stratified Sampling works by partitioning the population into groups and randomly collecting the samples in each stratum.
When considering the Systematic Sampling, the intervals are regular and established in advance.
In the case of Geostatistics, the Systematic Sample has a particular advantage, since it covers the entire study area and avoids empty areas or grouping of samples.Hence, a Systematic Sampling may produce more accurate estimates (Wang et al., 2012).This paper introduces a Systematic Sampling aiming at enhancing the geostatistical estimations in the field of Transportation Planning.The procedure consists of considering the geographic coordinates of TAZ's centroids and building a regular grid mesh that turns the dataset regularly distributed in space.The use of the Systematic Sampling is relevant for softening the spatial discontinuity effects of travel variables in terms of the aggregated data, for instance.

Multivariate Geostatistics
The main point of using Geostatistics is to characterize the spatial (and/or spatial/temporal) dispersion of an event, assessing uncertainty parameters, concerning its spatial variability and obtaining a continuous surface estimation.Geostatistics is better defined as the following steps: 1) variographic analysis, 2) cross validation, and 3) kriging.

Variographic Analysis
The first concept regarding the use of Geostatistics is the definition of the regionalized variable theory.If a particular variable is distributed in space and suggests various trends in a set of different points, it consists of a regionalized variable.Matheron (1970) first showed that the spatial variation of a regionalized variable could be expressed by: a structural component, having a constant trend that is spatially dependent; a random, but spatially correlated component; and a spatially uncorrelated random noise.
The investigation of the spatial dispersion of each variable makes it possible to predict the structure of a random process, the main direction of spatial continuity (case of anisotropy) or if there is an omnidirectional property.This step of the proposed geostatistical approach aims at investigating if the study variables are indeed regionalized by plotting an experimental semivariogram.
An experimental semivariogram represents the spatial variation of a regionalized phenomenon in quantitative terms.The semivariogram is the mathematical description of the relationship between the variance among each pair of observation (points) and the distance between these observations (h).The experimental semivariogram function is given by Equation 1, where N(h) is the set of all pairwise, and z(xi) and z(xj) are data values at spatial locations xi and xj, respectively (Equation 2).
Following the calculation of the experimental semivariogram, a theoretical model must be set.This model is the one that better fits the calculated variance function (γ (h)) (Figure 1).Various semivariogram models are available in different computational software products that provide geostatistical tools.The most popular theoretical models among the literature are the spherical, exponential and Gaussian, as they are usually able to explain the spatial variability of most spatial events.The parameters presented in Figure 1 are described as follows: (1) Range (r): distance in which the observations are spatially correlated; (2) Sill (C0+C): the maximum variance value (γ) of the model.The range (r) represents the correspondent abscissa value; (3) Nugget effect (C0): the first point of the curve, i.e., where the curve reaches the ordinate axis (γ).The nugget effect reflects how short distances are alike or not.A high value to this parameter means that there is a high variance between close observations; (4) Partial sill (C): the difference between the sill and the nugget effect.
These graphical parameters are used together with the distance matrix to estimate the variables' value at unsampled locations by Kriging with External Drift, for instance.

Cross-Validation
The cross-validation, also known as fictitious test point, is the technique that predicts values in the sampled locations using the corresponding values of the neighboring points and the theoretical semivariogram model.Thus, for each point there will be an original value (sampled) and the estimated value (Dubrule, 1983).In this way, statistical measures such as errors can be calculated.In this study, the validation step was carried out by splitting the sample into two sets.
A set consisted of 60% of the sampled data and aimed at calculating a theoretical semivariogram model; 40% of the data was separated to validate the geostatistical estimations.

Kriging with External Drift (KED)
Many studies have a multivariate nature and the interdependency between the attributes must be incorporated into the analysis.In various situations, a variable may be assessed by means of an easier or an inexpensive way when compared to other means.That is, this interdependency between the attributes may be used as tools to predict variable's values from another variable.
KED enables the estimation of a primary variable given a secondary variable (Matheron, 1982).
In this case, in addition to the observed points of the primary variable, the values of the secondary variable are previous known for the entire surface to be estimated (Olea, 1999).
Considering Z(x) the primary variable and Y(x) the secondary one, both represent the linear dependency (Equation 3) according to Wackernagel (2010).Equation 4provides the kriging estimator with external drift.
Where a0 and b1 are the coefficients to be implicitly estimated together with the variable Z(x), and Y(x) is the value of the secondary variable.
KED is adequate when a main variable (primary variable) presents a relationship of dependence with the auxiliary variable (secondary variable).This study presented the estimation of transit trip production obtained from information of the population in the Salvador Metropolitan Area (Brazil).Therefore, this paper considers the parameters given by the theoretical semivariogram models of both the primary and the secondary variable.

Semivariogram Deconvolution
The method of obtaining a point-semivariogram (ℎ) using a regularized model   (ℎ) (area or blocks) is known as the deregularization or deconvolution of a model   (ℎ) (Journel and Huijbregts, 1978).
After choosing an initial point of a support model  0 (ℎ), the deconvolution method is iterative and seeks to calculate the support model that minimizes the difference between the theoretical model of a regularized semivariogram and the point-support semivariogram model to be estimated (Goovaerts, 2008).The optimization criterion is the relative difference () between both semivariogram models.The  1 (ℎ) model is chosen, if  1 <  0 .This procedure is repeated until the lowest value of D is achieved (Goovaerts, 2006).
Techniques as the semivariogram deconvolution may be of interest to the application of geostatistics in the transportation planning field, considering the availability of aggregated data.This study presents a simplified method of semivariogram deconvolution applied to the estimation of aggregated travel data in Traffic Analysis Zones (TAZs).

Study area and dataset
The Salvador Metropolitan Area (SMA) has an extension of 4,375 km², 13 municipalities and a population of over 3.6 million inhabitants; however, approximately 2.7 million live in the city of Salvador (IBGE, 2010).That is, Salvador represents the city in the SMA where most of the trips are concentrated.
The first dataset consisted of the origin-destination dataset based on a home-interview survey carried out in 1995.This survey indicated 3,691,889 daily trips in the study area, with 22% concentrated in the morning peak (between 6 and 8 am).Approximately 55% of the trips referred to public transportation.In this first dataset, the variable of interest was the transit trip production and the explanatory variables referred to the socioeconomic characteristics of each unit area.
The 1995 origin-destination survey used the TAZs as unit areas.These TAZs were classified into four main regions: Consolidated Urban Area, Central Area, Seafront and Suburb.The Central Area and the Suburb are the most populated areas in Salvador (Figure 2), corresponding to 46% of the total population of the city.Standardized values of the regarded variables were considered, as presented in the next section.The second dataset consisted of the 2010 census from the Instituto Brasileiro de Geografia e Estatística (IBGE, 2010).The variable of interest was also the transit trip production.However, variables of travel demand are more particular and are not usually available in traditional census surveys.In order to forecast the transit trip production in 2010, the authors collected information on the socioeconomic features and used a calibrated model obtained from the first dataset (1995).

Method
The first step of method follows the idea of calibrating a linear model that predicts transit trip production by using an origin-destination survey from 1995.The outcome is an equation that uses as independent variables the most significant socioeconomic attributes.This calibrated equation is applied in the second step aiming at predicting the variable transit trip production using only socioeconomic data from 2010. Figure 3 demonstrates the proposed method, with five main steps identified.

Figure 3: Proposed method
Considering that the dataset consists of different types of variables and different scale of values, the third step suggests that the data should be rescaled.The procedure first considered standardizing the variables as shown in Equation 5. Secondly, as the values could be negative, a normalization was conducted by using the smallest observed value and converting it to zero.Hence, the smallest value was added to all the values and caused the absence of negative values.
Where z is the standardized value from x, μ is the mean and σ is the standard deviation.
In the fourth step, the systematic disaggregated sampling is proposed (Semivariogram deconvolution -Simple procedure).This method consists of: (1) defining a grid mesh with uniform spacing of 2,000 meters; (2) defining the centroid of each cell; (3) obtaining, for each centroid, the standardized values for the transit trip production and the variable with the greatest association with the former.
Considering that the input data consist of information related to TAZs, it is necessary to measure the occupation of each TAZ in each cell, in order to weigh the value of the transit trip production and the secondary variable in each grid: • From the values of transit trip production in 2010 per TAZ, previously calculated in the former steps, a weight was assigned for each TAZ considering its area into each cell.
• Afterwards, a multiplication of each TAZ's weight by the transit trip production value of each respective TAZ and summing all the multiplications belonging to a particular cell provide the transit trip production value for this considered cell.
• This final value represented the cell's centroid.
The systematic sampling is an attempt to adequate the original data of the origin-destination survey to the application of geostatistics, given that one of the assumptions of such technique is the spatial continuity of the studied event.Aggregated data per unit area are associated to their centroids and, hence, are spatially discrete.Moreover, the geostatistic technique has been developed for regular geographic units.However, in general, travel data are associated to area with different shapes and sizes, that in most cases it is associated to census tracts in order to maintain the confidentiality of individuals.
The fifth step is the multivariate geostatistical procedure.It consisted of using KED as means of mapping the transit trip production also based on a secondary variable.The secondary variable is defined in the second step as the variable with the greatest association with the study variable.
It is important to highlight that the procedure intends to predict the transit trip production by only aggregated data from 2010 and a calibrated model of 1995.The use of available information, such as census data, makes the proposed methodology interesting in terms of Transportation Planning, especially for developing countries.
The software used in this research were the IBM -Statistical Package for the Social Sciences (SPSS) version 24 to model a linear equation and GeoMS 1.0 for the geostatistical processes of the semivariograms and kriging.The software ArcGIS 9.3 was used to handle with the systematic sampling as well as obtaining graphical representations of the results.

Calibrating a linear model
The model was initially calibrated with the 1995 origin-destination data and its socioeconomic variables.The Linear Regression Model pointed out two significant variables to predict transit trip production: the population and the average income of the head of the family.The calibrated equation corresponds to Equation 6.
The Tables 1 and 2 show the main results of the linear model calculated from the 1995 data.The estimated parameters make sense when explaining the transit trip production.The population was the most significant variable to estimate the dependent variable.

Estimating the transit trip production in 2010
In order to use the calibrated equation (Equation 5) to predict the transit trip production in 2010, an adaption regarding the aggregation level of the second dataset was necessary, since the information was at a census tract level.Initially, the census districts from 2010 were made compatible with the TAZs from the origin-destination dataset from 1995.
According to Ortúzar and Willumsen (2011), the traffic zoning must be compatible with other administrative divisions.The areas must be as homogeneous as possible in relation to the land use and the population characteristics.
The compatibility of the aggregation level was conducted in the ArcGIS 9.3 software.This Geographic Information System made it possible to overlap the census districts of 2010 and the TAZs from the origin-destination survey (1995).Afterwards, the census areas from each TAZ were selected and the population and average income of the head of the family were outlined.
The values of the average income of the head of the family were adapted with regard to the updated minimum wage value, in which a 170% increase had occurred from 1995 to 2010.
Having defined the values of the population and the average income of the head of the family for the 2010 dataset, Equation 5provided the estimated values of transit trip production.The next step addressed the standardization and normalization of the transit trip production and its most correlated variable (population).

Creating a mesh grid for a disaggregated systematic sample (Semivariogram deconvolution -A simple procedure)
In order to disaggregate the dataset, a systematic sampling was adopted.Figure 4 presents the cells and their respective centroids.

Kriging with External Drift
In this step of the research, it is known that the population is the most correlated variable with the transit trip production.Thus, the population was interpolated for 2010 and used as auxiliary variable in the kriging procedure.
Aiming at verifying the spatial structure of the transit trip production, Figure 6 presents the standardized values and the spatial distribution.It can be noticed that the transit trip production tends to increase according to the proximity with the Suburb.This area has the largest population and lowest per capita income.Some parameters were established to calculate the experimental semivariograms of the transit trip production and the population variables.Afterwards, the model fitting provided the nugget effect ( 0 ) and the range () parameters.When fitting the theoretical curve, no sill was considered, as the semivariogram presented no stationary through the cut distance.That is, the spatial variability keeps increasing with greater distances between the pair of observations and, thus, they have higher variance values.
Table 3 presents a summary of the modeling parameters of each semivariogram and Figure 7 shows the experimental and theoretical models.The omnidirectional semivariogram was chosen and the spherical theoretical model was the one that best fitted the data.Finally, the travel demand data were estimated by kriging in locations where there was no prior knowledge of the values as well as in sampled locations.The interpolated map of the secondary variable (population) was an auxiliary to interpolate the primary variable.Figure 8 presents both maps and the similarity between both is clearly noticed.This is due to the strong association of the population and transit trip production features.Another point to be considered is that there is a great concentration of trip production in the Suburb Area.This tendency decreases insofar as the Consolidated Urban Area is considered.
Figure 9 presents the spatial pattern distribution of observed values of transit trip production by areal support through choropleth maps.The representation can lead to an important limitation of choropleth maps, which concerns the biased visual interpretation.For instance, when it is understood that larger areas have more importance than smaller areas, it occurs an ecological fallacy.Such problems can be solved by creating continuous maps of the studied attribute (Goovaerts, 2008).From the maps presented on Figure 9, a similar tendency of trips values between observed data by Traffic Analysis Zones and the estimated surfaces is observed, for example, in the Suburb Area.Ten classes of equal intervals were used in the legend of the maps presented.Despite the mentioned problems regarding the usage of area data and interpretation of choropleth maps, the visual comparison through maps (Figure 8 and Figure 9) is another form of validating, methodologically, the results of geostatistics approach.

Conclusions
Methods that consider the spatial correlation of the sampled values have been often used in the area of transportation planning.However, some specific data characteristics, related to this study area, need special attention.Different levels of data aggregation, as well as the presence of irregular areas, are some of the problems identified in the line of health research and transportation planning, for example.
Changes in the level of data aggregation, i.e. change of support, influence the results and cause the Modifiable Areal Unit Problem and ecological fallacy.These two phenomena can occur in spatial analysis due to the modification of the scale or the level of zone aggregation.Moreover, the usual unit areas in travel demand analysis, such as TAZs and census tracts, have irregular shapes.This is an obstacle to the application of Geostatistics in transportation studies, since this technique was developed to be applied in regular supports.Considering this, the present paper proposed to combine a disaggregated systematic sampling method (a simple procedure for semivariogram deconvolution) and a multivariate geostatistical technique (Kriging with External Drift) to estimate transit trip production.
Despite the methodological challenges, the results yielded in this research demonstrate a method that has the advantage of estimating values in sampled and non-sampled positions using KED, with the transit trip production as primary variable and the population as secondary variable.
Additionally, the proposed systematic sampling enabled each cell of the created mesh grid (with a size of 2,000 x 2,000 meters each) to be associated with transit trip production and population values that were originally at an aggregate level (TAZs and census tract level).This advantage is a significant contribution in the field of Transportation Planning as the data collection may be costly and time-consuming activity.
The kriging procedure presented good results considering the error measures and the correlation coefficient between the estimated and observed values.Besides, the positive results are in line with the study area context: 1) a higher number of transit trip production is seen from the Central Area to the Suburb; and 2) locations with lower income and larger have more tendency to produce transit trips.
This paper is a first step towards the data disaggregation when considering travel demand and spatial association.The proposed systematic sampling is a basic procedure that still needs to be refined and improved in order to be effectively implemented.The authors suggest, for future studies, the use of simulation techniques.However, the results indicated that the kriging technique is adequate for the intended purposes in this research, especially when considering variables that have similar spatial distribution, such as the case of the population and the transit trip production.

able 1 :
Main statistical measures for the calibrated model

Figure 4 :
Figure 4: Mesh grid derived from the systematic disaggregated sample Source: Rocha et al. (2016)Figure5shows the scatter plot of the estimated variables in each cell's centroid (Transit trip production and Population).It is possible to note that there is a strong correlation between both variables.

Figure 5 :
Figure 5: Scatterplot of the estimated values (Transit trip production and population)

Figure 6 :
Figure 6: Distribution of standardized values at the Salvador Metropolitan Area: Transit trip production in 2010 Source: Rocha et al., 2016

Figure 7 :
Figure 7: Theoretical semivariogram for the population and the transit trip production in 2010 -omnidirectional case

Figure 8 :
Figure 8: Kriging map for the population and the transit trip production in 2010

Table 3 :
Graphical parameters from the variographic analysis

Table 4 .
It is noted that the model resulted in low values of errors.Besides, the observed and estimated values are highly correlated, considering the Pearson correlation.

Table 4 :
Results for the cross-validation