A COMPUTATIONAL TOOL TO EVALUATE THE SAMPLE SIZE IN MAP POSITIONAL ACCURACY

In many countries, the positional accuracy control by points in Cartography or Spatial data corresponds to the comparison between sets of coordinates of well-defined points in relation to the same set of points from a more accurate source. Usually, each country determines a maximum number of points which could present error values above a pre-established threshold. In many cases, the standards define the sample size as 20 points, with no more consideration, and fix this threshold in 10% of the sample. However, the sampling dimension (n), considering the statistical risk, especially when the percentages of outliers are around 10%, can lead to a producer risk (to reject a good map) and a user risk (to accept a bad map). This article analyzes this issue and allows defining the sampling dimension considering the risk of the producer and of the user. As a tool, a program developed by us allows defining the sample size according to the risk that the producer / user can or wants to assume. This analysis uses 600 control points, each of them with a known error. We performed the simulations with a sample size of 20 points (n) and calculate the associated risk. Then we changed the value of (n), using smaller and larger sizes, calculating for each situation the associated risk both for the user and for the producer. The computer program developed draws the operational curves or risk curves, which considers three parameters: the number of control points; the number of iterations to create the curves; and the percentage of control points above the threshold, that can be the Brazilian standard or other parameters from different countries. Several graphs and tables are presented which were created with different parameters, leading to a better decision both for the user and for the producer, as well as to open possibilities for other simulations and researches in the future. Nero, M., et al. 446 Bull. Geod. Sci, Articles section, Curitiba, v. 23, n°3, p.445 460, Jul Sept, 2017.

The major contribution of this paper, not present in the previous researches, is a program that simulates and analyzes maps with different positional accuracy, allowing us to define the sample size in the process, and to calculate users's and producer's risk.
The control quality process is quite well-known, having been applied to positional accuracy of spatial data as in Ariza-López et al (2001), Ariza-López (2002a), Atkinson-Gordo (2005), Nero (2005), Atkinson-Gordo, Ariza-López and García-Balboa (2007) and Cintra and Nero (2015), generating mappings with different point percentages with errors in a given coordinate, above (or below) a limit established by a specific set of standards (CAS-Circular Accuracy Standard, LMAS-Linear Accuracy Standard, among others).Quality control guidelines for cartography are employed by several countries.The standards set by CONCAR (2011, 2016 -Brazilian Commission of Cartography), for example, allow up to 10% of the points to present an error greater than 0.5mm, in the scale of B class of cartographic documents.
In summary, many countries use 20 points as sampling without further considerations.We agree that this number of control points represent a minimum for the statistic consideration and it reduces the cost of field works.But there is an error associated to this number: a risk of the producer and another for the user.Therefore, studying how to reduce those risks could be useful.In the present research, the sampling dimension (n) will be studied while associated with these risks.
The proposal is to analyze the punctual positional accuracy (positional control quality) of maps by means of computational simulation using different numbers (n) of control points, and to calculate the user´s risk (to accept a map with low quality) and the producer's risk (to have a map with a good quality rejected).
The program, based on an actual 600-point case, controls and varies the number of points with error slightly greater or smaller than allowed.At the same time, it analyzes the effect of sample size (n) associated with producer and user risk.
The general aim of this paper is to present a computer program that was developed for spatial data (maps and others) quality control analysis and the construction of operational curves to help users to define the size of the sample related to the risk a user or a producer is willing to take.This risk is associated with the standard of each country; the program also allows introducing the adequate parameters for the calculations.

Available resources
In order to better evaluate the performance of the programs developed, and to develop a similar one, we find it interesting to know the main features of the equipment we used.The following resources have been used: -Computers and peripherals: Pentium IV PC 4.2 GHz, 60 GBytes HD, 1GBytes RAM memory; Pentium III, Notebook, 30 GBytes HD, 512Mbytes RAM memory; mobile HD unit, incorporating a 80 GBytes rapid disc.
-Programs: Delfi 7.0®, for advanced programming; PostigreSQL 8.1®, for manipulating data bases; Microsoft Excel XP ® for generating the final simulation program with tables and graphs.

Specifications of the program developed
The program input parameters are represented in Figure 1 and described below.-"Type" Box: it is possible to choose between "Fictitious Data" or "Real Data".If "Fictitious Data" is chosen (for studies and research), the graphs generated as output will incorporate operational curves generated by a process that simulates the differences between homologous points of the cartographic base and a most accurate font.These differences refer to a normal distribution (further explanation will be presented later).Whenever "Real Data" option is chosen, the operational curves will be created by means of a processed curve generated by the real data, concerning the differences between homologous control points, in the cartographic base and in the most accurate one.For this, we need a large sampling, and to know the error at each point.
-"Parameters" Box: the following data inputs will be provided: -"Number of control points": corresponds to the number of data points where the errors are known.
In the example of Figure 1, it corresponds to 600 control points.
-"Admissible Error" (CAS): the maximum value of the error (in map units) that can be accepted, according to the specific legislation adopted in each country, and in one specific class of the map.In the example quoted, the value of 5m is considered.
-"Percentage of CPs above the CAS": control point percentage with errors above the value that can be accepted (5m), according with CAS.In the example case, it corresponds to 10%, as in most countries' standard.
-"Above interval percentage": are the steps considered of an increase of the interval of populations with errors above the admissible error around the value shown in the "CPs above the CAS" parameter.This allows constructing the operational curves with percentages changing due to constant increments (2% by 2%, for example) above the acceptable value (10%), that is, 12, 14, 16, 18, 20, 22, 24, 25, 28, 30, etc. up to the 40% highest limit, chosen guided by practice: mappings usually do not present such poor quality and the results do not change that much.
-"Below interval percentage": similar to the above explanation, it allows constructing the operational curves below the 10%: that is, 8, 6 and 4%, an inferior limit also defined by practice; -"Number of iterations": the number of times the program runs to calculate the precision test and the point percentage test.
Thus, for each problem, we simulate cartographic documents (maps or spatial data) with different qualities, from 4% to 40% of points with errors above the limit.To elaborate the curves, we incremented the sample size by steps of 5 points, from 5, 10, 15 points to 60% of the points.At least 1,000 iterations are usually suggested (Hardin, 1997).

Operational curves, test and analyses
After the parameters have been supplied, the user could initiate the data processing, following the stages below, according the graph of Figure 2.

Data Creation
The processing starts with the program generating a base table containing control point errors, followeding by a normal distribution that simulates a real situation.In the example, the table corresponds to a 600-record list ("Number of control points").No more than 10% ("CP Percentage above the CAS" = 90%) may have errors higher than the value of the "Admissible Error (according to the CAS provided to the user as an input parameter).
The number of 600 points was taken from a real test available in Nero (2005), which resulted in the B classification mapping in 1:10.000scale, according to the Brazilian standard.This test compared two databases: the one of reference, with more accuracy, was produced in 1: 2.000 scale, and the other, the map being evaluated in 1:10.000scale.Both were produced by EMPLASA (São Paulo Metropolitan Region Planning Agency, 1981, 1997).
To proceed, fictitious data was created in a normal distribution.An example was generated with errors in increasing order, in module (Table 1), and the histogram confirmed the normal distribution, as observed in Figure 3.

Tests
At this point, the user is able to extract a first random sample of the 5 control points (for the example presented), which are submitted to the following tests: a) Precision Test: mean value calculations, µ: equation (1); of the standard deviation, Sx: equation (2) and the sample chi-square parameter ( ): equation ( 3), for the one-dimensional variable.
The first two values are used in many formulas (for example to calculate the existence of tendency) and the third can detect the precision: for this, the chi-square parameter calculated should be compared with a reference value in the table, as shown below.The adoption of the Mapping Linear Accuracy Standard (MLAS) is recommended, which is one-dimensional, instead of Mapping Accuracy Standard (MAS). (1) Where: μ = error average in a X direction; n = sample size; = error listing in the controls points; = coordinates of the reference control points obtained in a given direction as, for example, by means of GPS or more precise document; = coordinates of the corresponding points in the product to be evaluated (cartographic base).
In the case of fictitious data, the values of are provided by a list based on the random sample.
(2) Where: = standard deviation in a given coordinate; , in this case resulting from the table created; μ = error average in a X direction; n = sample size.Example: A situation in which one is dealing with: a) 5 point (n) as sample; b) admissible error of 5 m (CAS); c) percentage of 10% of CPs above the CAS and standard deviation ( ) of 2.9m.With these values in the calculation we have: The limit value in the table of chi-square is obtained, also, in Excel INV.QUI((n-1),(α)). Where: n-1 = the number of control points less one, or freedom degree and α=1 [Probability/100] = point percentage above the CAS (5) divided by 100.Then, (3.7380 7.7790) the cartographic document finds the precision in the X direction and the mapping is not rejected.b) Direct Test: the sample is analyzed to show if 10% or less of control points present an error below or equal to the admissible error.For this, the program analyses the error column in the table and counts the values above the specified limit.
For example, considering a 5-point sample, there should be no point with any error above the admissible error, as it would correspond to 20%, that is (1/5).100= 20%.

Random sampling analysis
In the following step the program analyses the result by different choose of five points sampling.
If the number of selected samples increase, there are a decrease in the random effect in that selection of points.For this, both the precision test and the direct test are then repeated as many times as specified by the user in the field "Number of Iterations" input.In the example, it was repeated 3000 times.The program internally generates, internally, a listing similar to that presented on Table 2.After the iterations for the first size of the sample, the software calculates mapping rejection percentage: number of "NOs" divided by the total number, represented by a percentage, as in equation 4. Using the base table population, these operations are repeated for other sample sizes, considering increments from 5 up to 360 control points.A second column of the final tables is then created for the Precision Test and the Direct Test, as in Table 3.

Map accuracy variation
The next step is repeated to generate tables similar to Table 3 introducing a mapping quality variation, that is, for populations with error percentages higher than 10% (12, 14, 16,…, 40%) or lower than this limit (8, 6 and 4%).For this, it was necessary to know the values of the errors, and we can create a predetermined number of points, below or above the limit established by the standards.For this purpose, a k multiplication factor was applied to the base table (600 points), providing a previous decreasing arrangement of the absolute values in the errors list (as in Table 1).The multiplication by a k factor is responsible for maintaining a normal distribution sample.The next step is to calculate the percentage of points with error above the admissible limit.For example, for the 12% case, we need 600 * 0.12 = 72 points.The record in this position is 4.864 and the value of k is one that, multiplied by that number, leads to the limit value (5.0).Hence: k*4.864=5 and k = 1.028.
Multiplying all the values of the Direct tests with 3,000 iterations for each sample size (n) of 5, 10, 15, 20, 25, up to 360 points.
Successively, new k factors were calculated while tables with percentages above and below were generated, depending on what is defined by the user in the input data.
Finally, two summary tables were generated (Tables 4 and 5).The first one is to be used in the Precision Test and the other in the Direct Test.In both, the first column corresponds to the sample size and the others to the rejection percentages for a map with a certain positional accuracy (from 40% to 4% points above the limit).
To understand the tables, in Table 4, see the line with n = 20 (sample size).It corresponds to the Precision Test.In a map of column 40 (40% of points with errors above the threshold) it results in a rejection percentage of 99.93% (marked with bold face).The risk that a low quality map (low accuracy precision) could be mistaken for a good one is about 0.07%.In the other example, in the same line and in the 8%column, the rejection percentage found was 2.20% (italic, bold).Thus, the risk that a good map could be mistaken for a bad one is 2.2% .
In the second table (Table 5) to be used in the Direct Test, as previously detailed, in the same conditions, the risk that a low accuracy map could be mistaken for a good one is of 0.47%, and the risk that a good map could be mistaken for a low quality one is 79.70%, which is a high risk.

Creation and use of operational curves
The latter tables (Table 4 and 5   The graph incorporates the continuous variation of sample dimension and is more practical to use. The user identifies the sample size on the x-axis and draws a vertical line that intersects the curve (corresponding to the quality of the map: its positional accuracy) and then draws a horizontal line and reads the percentage of the mapping rejection on the y-axis.
Yet the best use of these curves is the reverse way: the user or the producer can begin by defining the risk they consider prudent to assume.They can then enter this value on the y-axis and, by finding the corresponding curve (expected quality of the map), can define the sample size on the x-axis.
Finally, it is worth noting that the user can use real data, and then create a curve with them.Always after this process, we generate the operational curves to the Direct or to the Traditional tests.

Conclusion
A great advantage of the program is that it allows a remarkable flexibility for the parameters input and can be applied to most countries, aiming at the punctual positional control quality of maps or at any other spatial dataset, making the appropriate adaptations.
The tests showed that the sample size and the positional accuracy of the cartographic document influences the aspect of the operational curves and the rejection percentages, presenting different values for the user's and the producer's risk.As presented, the Direct Test is simpler than the Traditional Test, but it is possible to draw some conclusions about the user's and producer's risk: 1) The traditional methodology benefits the results for the producer because it requires a minor sample while the Direct Test requires a larger sample; 2) This new situation for the producer does not result in excessive costs, because with the evolution in technology, we have more modern, accurate and economic processes; 3) More efforts with the greater sampling guarantee a more secure evaluation, and a smaller risk of a bad project based in bad maps.
number of control points; = EP, depending on the admissible error for the mapping (CAS: A, B or C); = standard deviation in a X direction.

PRM
of Rejection of the Mapping.
), data can be converted into operational curves for practical purposes.This procedure generates Figures4 and 5from the Traditional Test and from the Direct Test.

Figure 4 -
Figure 4 -Example of graphics resulting from the Traditional Test.

Figure 5 -
Figure 5 -Example of graphics resulting from the Direct Test.

Table 1 -
Example of base table.

Table 2 -
Rejection data for the precision and the direct tests, in a set of 600 control points, according to the Brazilian standard, with only 5 control points.

Table 3 -
Rejection percentage of the mapping (PRM) of a population of 600 control points for random samples from 5 to 360 control points for the Direct Test.

Table 4 -
Rejection percentage of the mapping (PRM) obtained by the Traditional Test in relation to the size of the sample (n) and the quality of the mapping.

Table 5 -
Rejection percentage of the mapping (PRM) obtained by the Direct Test in relation to the size of the sample (n) and the quality of the mapping.