APPLICATION OF MEDIAN-EQUATION APPROACH FOR OUTLIER DETECTION IN GEODETIC NETWORKS

In geodetic measurements some outliers may occur sometimes in data sets, depending on different reasons. There are two main approaches to detect outliers as Tests for outliers (Baarda’s and Pope’s Tests) and robust methods (Danish method, Huber method etc.). These methods use the Least Squares Estimation (LSE). The outliers affect the LSE results, especially it smears the effects of the outliers on the good observations and sometimes wrong results may be obtained. To avoid these effects, a method that does not use LSE should be preferred. The median is a high breakdown point estimator and if it is applied for the outlier detection, reliable results can be obtained. In this study, a robust method which uses median with 3σ or 3σ as a treshould value on median residuals that are obtained from median equations is proposed. If the a priori variance of the observations is known, the reliability of the new approch is greater than the one in the case where the a priori variance is unknown.


INTRODUCTION
The least squares estimation (LSE) is very sensitive against deviations of the model assumptions (HAMPEL et al. 1986).The LSE spreads the effect of the outliers on the residuals of the good observations which do not have any outlier (HEKIMOGLU et al. 2011a).There are two main reasons for the wrong results of outlier detection methods as spreading effect of the LSE and weakness of configuration of the given geodetic network (HEKIMOGLU et al. 2011b).It is showed that an outlier in the observations of a geodetic network can not be identified reliably by using any method due to the configuration weakness in the network.The outlier affects badly the residual from LSE of another observation that lies close to this bad observation, due to the deficiency of the configuration of the network.
Generally, statistical procedures for detecting outliers work well in practice only in case of one single outlier, but can fail in case of multiple outliers (BAARDA 1968, POPE 1976, HEKIMOGLU and KOCH 1999and 2000, XU 2005).In addition, Baselga (2007) showed that only one outlier in a geodetic network may be detected by using Test for outliers (Pope's test) when the a priori variance of the observations is not known.In case of more than one gross error, the test becomes inefficient.Moreover, even the sample includes single outlier; the test may fail when observations are correlated.Tests for outliers are based on the assumption of a (possible) single outlier, and frequently with unjustified hope that they are supposed to be successful if multiple outliers appear.It is impossible to detect multiple outliers without additional hypotheses.Also, the single outlier hypothsis is also proven as being sufficient except to the case where the degree of freedom is one.These results of Baselga (2007 and2011) verify the ones of the mentioned above properties.
The median has a highest breakdown point such as 50%.It means that for one dimensional case (i.e. the population may be defined by a random variable) the median can isolate the good observations with the rate of 51% from the bad observations of any kind with the rate of 49%.Median is used for estimating the location parameter (μ).However, the median is not as efficient as the mean, i.e. the standard deviation of the median is greater than the one of the mean at Gaussian distribution (MARONNA et al. 2006, p.20).Youcai (1995) applied the median and MAD (Median Absolute Deviation) to the triangulation network by identifying outliers under some criterions such as |r i |>2σ med or 3σ med where σ med =1.4826 MAD and r i are the differences between the coordinates and the median of them.The coordinates of the new points are calculated by taking all the possible combinations of observations (angles), and then the median applied to these coordinates' values.Duchnowski (2010) applied R-Estimator to idendify unstable reference marks where the median and MAD play a main role.Hekimoglu et al. (2011b) proved that the reasons for the failures in the outlier detection depend on not only due to the ability of the outlier detection method, but also mostly due to the weakness in the configuration of the networks.To detect the probable configuration weakness the median equations were used.In this paper, a new robust approach based on the median and MAD on the observations of the geodetic networks is introduced.To apply the median estimator for outlier detection the median equations are used.The main idea is to do outlier detection in geodetic networks that is based on observations without using LSE.

MEDIAN EQUATION APPROACH FOR LEVELING NETWORK
The geodetic networks are established to realize two main topics: estimating the coordinates of the new points optimally based on the coordinates of datum points, and controlling the reliability of the network whether the observations include outliers or not.
The height differences are measured for the leveling network.A minumum configuration that resists against one outlier is given in Fig. 1 (HEKIMOGLU et al. 2011b).
Figure 1 -A leveling network that has a minumum configuration that resists against one outlier.
The following median equations can be written where each observation must appear once in these equations (HEKIMOGLU et al. 2011b): (1a) To clear this, an extra equation can be written such as: h 3 and h 5 apper two times in these four equations.If h 3 or h 5 has an outlier, the median can not separate them from the good observations.Therefore, the last equation can not be considered as a median equation.The similar median equations for the other observations can be written as follows: The median of these equations can be estimated such as: The median can separate only one outlier in these three median equations.For example, let h 1 include an outlier.Med 1 can separate it from two other median equations.It can not affect Med 1 and also Med 2 , Med 3 ,…, Med 6 .Consequently, the median can separate only one outlier in the observations of this leveling network.For example, if h 1 and h 5 were contaminated, the median can not separate these two outliers.
Now, the question is arised how can this outlier be detected when the median is used as an estimator.If the variance σ 2 of the population is known before, then the outlier may be detected by using the 3σ-rule (KUTTERER et al. 2003, LOON 2008).
Let h 2 contaminated by outlier Δ. , , , and are damaged.They include the outlier Δ and the random error ε due to the second observation together except that includes only Δ. Δ+ε > 3σ when both Δ and ε have the same sign (i.e. both + or both -) or Δ+ε < 3σ when both of them have the opposite sign (i.e.+ and -or -and +).Therefore, more flagging values than the one may exceed the threshold value of 3σ.How can we decide which one of them is the true outlier?If the median equations of , , , and are considered, it is seen that they all include the common value of h 2 .Hence, the true bad observation must be the value of h 2 .Thus, if observations have only one outlier and more candidates of outlier are detected, true outlier can be found.
Let's look at the leveling network given in Fig. 1.We can obtain r ij instead of observations by using the Eq.( 3): where r ij are defined here as "median residuals".They are considered here as measuruments.
When h 2 is contaminated by an outlier , , , and are contaminated.As a result, 5 median equations of 18 are ruined by h 2 .The median of r ij is not affected by these five contaminated values.The main question is that how we can detect the outlier in h 2 .According to the median equations we can form a matrix which is called here "decision matrix" given Table 1.In the decision matrix, "0" means that there is no relation between observations and "1" means that these equations form one median equation (For example, in the second line of the matrix, r 12 is formed by h 3 and h 4 ).When h 2 is contaminated five median residuals (r 13 , r 21 , r 32 , r 53 and r 63 ) are contaminated, too.According to the contaminated median equations, for r 13 h 2 and h 5 ; for r 21 h 2 ; for r 32 h 2 and h 6 ; for r 53 h 2 and h 1 ; for r 63 h 2 and h 3 can be flagged as outliers.The flagged means that this observation is set candidate for contaminated observation.If the total flagged numbers is estimated, the number of h 2 is five times, and other observations are only once, so that the outlier can be detected considering decision matrix and total flagged number (k).If the total flagged number is bigger than one (k>1) this observation includes an outlier.
If the a priori variance σ 2 is not known, σ med is proposed instead of MAD because is more efficient than MAD (HAMPEL et al. 1986, ROUSSEEUW and LEROY 1987, MARONNA et al. 2006):

SIMULATION OF THE LEVELING NETWORK
To apply the new median approch, the network given in Fig. 1 is considered.The heights of four points are H 1 =100.000m, H 2 =105.276m, H 3 =104.388m and H 4 =103.055m respectively.They are not affected from random errors.The height differences (h oi , i=1,2,..,6) are computed.To obtain the measurements of the height differences we assume that the random measurement errors have the same variance σ 2 (i.e.σ=1 mm).Thus, the measurements of the height differences h i are computed as where e i ~ N(μ=0, σ 2 =1), it is assumed that the measurement errors are Gaussian with the expected value which is equal to zero and the variance which is equal to 1.
Here e i are 0.90, -0.52, -0.50, -0.97, 0.00, 1.39 mm To generate one contaminated height value , the random error e i is replaced by the outlier dh i as follows: In this section we have tested the following cases: I.
The observations do not include any outlier.II.
The observation (h 1 ) is contaminated with +5 mm magnitude.III.
The observation (h 1 ) is contaminated with +10 mm magnitude.IV.
The observation (h 1 ) is contaminated with +1000 mm magnitude which is called as a wild observation.For the first case: The median equations for each height difference h i are constituted and their medians are taken according to the equations of (1a), (1b) and (2).Then, the differences (r ij ) according to (3) are computed.σ med of them is found as 1.0 mm.The method did not detect any outlier which is greater than 3 for the case when the a priori variance is known, and also 3 for the case when the a priori variance is unknown.
For the second case: The median residuals (r ij ) according to Eq. ( 3) are given as [-4.5, 0.0, 0.9, 0.0, -5.5, 1.4, 1.4, 0.0, -3.2, 0.0, 4.5, -2.3, -2.4,0.0, 3.2, -1.4,0.9, 0.0] mm.If we look at r ij -values, we can see that the outlier is not spreaded on the adjacent observations as in LSE.The σ med of them is 2.0 mm.The threshold value for r ij is 3 mm for 3 .We see that there are five values (r 11 , r 22 , r 33 , r 42 and r 53 ) greater than 3 .These median equations are contaminated by h 1 .The height difference h 1 is the joint value among these five contaminated values.In the decision matrix h 1 is flagged five times and h 2 , h 3 , h 4 , h 5 and h 6 are flagged only once, since the flagged number of the h 1 is greater than one, h 1 is the outlier.If the a priori variance is unknown, 3 is used as a threshold value.Since none of the median residuals exceed the 3 the method can not detect the outlier.For the third case: The differences (r ij ) according to Eq. ( 3) are given as [-9.5, 0.0, 1.0, 0.0, -10.5, 1.4, 1.4, 0.0, -8.2, 0.0, 9.5, -2.4,-2.4,0.0, 8.2, -1.4,1.0, 0.0] mm.The σ med of them is 2.0 mm again.If the a priori variance σ 2 of the height differences is known, 3 is used as threshold value that is the same as in second case.We see that there are five values (r 11 , r 22 , r 33, r 42 and r 53 ) that are greater than 3 .h 1 is the joint observation.In the decision matrix h 1 is flagged five times and h 2 , h 3 , h 4 , h 5 and h 6 are flagged only once, since the flagged number of the h 1 is greater than one h 1 is the outlier.If the variance σ 2 of the height differences is unknown, 3 is used as a threshold values that are the same as in second case.We see that there are five values (r 11 , r 22 , r 33, r 42 and r 53 ) that are greater than 3 . If we look at these median equations, there is one joint value i.e. h 1 among them.Therefore, h 1 must be contaminated.
For the fourth case: The differences (r ij ) according to (3) are given as [-999.5, 0.0, 1.0, 0.0, -1000.5, 1.4, 1.4, 0.0, -998.2, 0.0, 999.5, -2.4,-2.4,0.0, 998.2, -1.3, 1.0, 0.0] mm.If we look at r ij values, we can see that the outlier is not spreaded on the adjacent observations as in LSE.The σ med of them is 2.0 mm again.There are five values (r 11 , r 22 , r 33, r 42 and r 53 ) that are greater than 3 .If we see these median equations, there is one joint value h 1 .Thus, we can detect this outlier in h 1 .If the a priori variance σ 2 of the height differences is unknown, 3 is used as a threshold value where they are the same as in the second case.We see that there are five values (r 11 , r 22 , r 33, r 42 and r 53 ) that are greater than 3 . If we see these median equations, there is one joint value h 1 .Therefore, h 1 must be contaminated.
We see that the σ med value for the cases II, III and IV does not change as the magnitude of outlier changes.This is the proof for the property of robustness.

SIMULATION RESULTS
We know that the success of the robust methods and Tests for outliers are changed from one sample to the other one where the random errors are different (HEKIMOGLU andKOCH 1999, HEKIMOGLU andKOCH 2000).Therefore, the success of a method used cannot be evaluated by the result from one sample which may be chosen subjectively.
For simulation the network given in Fig. 1 was considered.The random errors, 6 mesurements and outlier are generated as done in above section.They come from a Gausian distribution such as N(μ, σ 2 =1mm 2 ).A hundered random error vectors e and also a hundered good sample are generated.In addition, each sample is contaminated only by one outlier 100 times.Thus, we have obtained 10 000 contaminated samples.r ij are analysed by using median and threshold value as 3 or 3 .The results are given in Table 2. To measure the capacity of a method, the mean success rate (MSR) (HEKIMOGLU and KOCH 1999, HEKIMOGLU and KOCH 2000, HEKIMOGLU and ERENOGLU 2007, ERENOGLU and HEKIMOGLU 2010)  Considering the algorithm of forming the median equations, i.e. the decision matrix which gives us how the height differences in the median equations are connected, we can detect outlier when the repeat number k of the flagged outliers is greater than 1.If the median and median equations are used for outlier detection the reliability of the method in condition that a priori variance is known is 67% where the magnitude of an outlier lies between 3 and 6 .If the a priori variance is unknown the reliability of the method decreases to 44.5%.The method may detect the good observations as outliers for some cases.

CONCLUSION
In this study, it is investigated that whether median (with 3σ or 3σ med ) may be used as an estimator on the median residuals r ij or not to detect outliers by

Table 1 -
The decision matrix.