Geostatistical simulations with heterotopic hard and soft data without modeling the linear model of coregionalization

Most mining decisions are based on models estimated/simulated given the information obtained from samples. During the exploration stage, samples are commonly taken using diamond drill holes which are accurate and precise. These samples are considered hard data. In the production stage, new samples are added. These last are cheaper and more abundant than the drill hole samples, but imprecise and are here named as soft data. Usually hard and soft data are not sampled at the same locations, they form a heterotopic dataset. This article proposes a framework for geostatistical simulation with completely heterotopic soft data. The simulation proceeds in two steps. First, the variable of interest at the locations where soft data are available is simulated. The local conditional distributions built at these locations consider both hard and soft data and are obtained using simple cokriging with the intrinsic coregionalization model. Second, the variable of interest in the entire simulation grid using the original and previously simulated values at soft data locations is simulated. The results show that the information from soft data improved both the accuracy and precision of the simulated models. The proposed framework is illustrated by a case study with data obtained from an underground copper mine.


Mining Mineração
Decisions in mining operations are generally based on estimated/simulated grade models. The information used to build these models comes from samples obtained from various sources. At every stage of a mining project, more samples are collected and used to update the models and reduce their uncertainty.
During the exploration stage, the samples are mostly diamond drill cores, which are accurate and precise, but expensive and scarce. These samples are considered error-free and are denoted hard data. At the production phase, additional samples are obtained from blast hole drilling fragments (dust) or channel samples. Compared to the diamond drill samples, these samples are cheaper and more abundant, but they have a nonnegligible sampling error and are denoted soft data. Usually, soft and hard data are not sampled at the same locations.
Classical approaches to integrate different data sources in geostatistical simulations are generally known as cosimulation. The sequential Gaussian co-simulation (Verly 1993) is one of the first workflows and is an expansion of the original sequential Gaussian simulation (Isaaks 1990) to consider multiple variables and their correlations. The sequential Gaussian co-simulation uses simple cokriging to integrate different data. The relationships between the different types of data are described by the linear model of coregionalization (LMC) (Journel and Huijbregts 1978;Goovaerts 1997;Wackernagel 2003). Oliveira et al. (2020) proposed a methodology for geostatistical co-simulation that does not require the LMC. The integration of data of different types is based on covariance tables (direct and cross), which describe the spatial continuity of the data. The main limitation of this technique is that the secondary data must be densely sampled.
The problem of sequential Gaussian co-simulation is that fitting the LMC is often a tedious and challenging task. This task is even more complicated when the hard and soft data are heterotopic and have different and irregular sampling spacing. In this case, the experimental cross-variograms cannot be calculated, so cross-covariograms or correlograms must be used. In addition, these experimental cross-correlograms may be very sensitive to their calculation parameters, such as lag spacing and tolerances.
Another problem is that the sequential Gaussian co-simulation requires solving the cokriging system at all the nodes of the simulation grid whose search neighborhood includes secondary data. The resolution of the cokriging system may slow the simulation process, if the search neighborhood includes many samples. Using many samples in the search neighborhood is recommended for geostatistical simulations, since it aids in reducing the uncertainty.
One approach for simulation of multiple correlated variables without modeling the LMC is to simulate the variables hierarchically (Journel and Almeida 1994) using collocated cokriging with a Markov model of coregionalization (Almeida and Journel 1994;Journel 1999). In the Markov models, the crossvariogram models are proportional to the variogram models of either the hard or soft data (Shmaryan and Journel 1999). This approach requires that the soft data be available at all locations of the simulation grid. Otherwise, the soft data may be simulated first to populate the simulation grid. Then, the hard data or primary variable is simulated using the collocated secondary/soft datum with collocated cokriging.
The problem of simulating the variables hierarchically with collocated cokriging is that the variance of the primary simulated variable is too high (Journel and Deutsch 1998). In this context, Babak and Deutsch (2009) proposed a method that solves the vari-ance inflation of collocated cokriging. The method is called intrinsic collocated cokriging (ICCK) and consists of simulating with cokriging using the secondary data at the node being simulated and at the locations of the primary data. The authors used the intrinsic coregionalization model (ICM) to define the direct and cross-variogram models. The intrinsic coregionalization model (ICM) is a particular case of the LMC, where all the direct and cross-variograms are proportional to one basic variogram model (Goovaerts 1997). For instance, all direct and cross-variograms are obtained by multiplying a basic variogram model by a rescaling factor. The basic variogram model used by Babak and Deutsch (2009) for the ICM was the variogram model of the primary variable. Thus, ICCK requires fitting only the variogram model of the primary variable.
In this study, we adapted the ICCK (Babak and Deutsch 2009) to the situation of complete heterotopy between the hard and soft data. These authors showed a case study with exhaustive secondary data. In this case, the secondary data are available at the locations of the primary data, and the use of ICCK is straightforward. This article considers the situation where the secondary data are not at the locations of the primary data.
Intrinsic collocated cokriging requires the coefficient of correlation between the variables, whose calculation is not direct when the data are completely heterotopic. To obtain the coefficient of correlation between the variables, we extrapolated the experimental crosscorrelogram values until the distance of zero (Minnitt and Deutsch 2014). There are other ways to obtain the relationship between different variables in the case of heterotopic data. Myers (1982) and Cressie (1991) proposed one generalization of the variogram to compute the cross variance between variables called as pseudo cross variogram. Some authors discuss the limitations of the pseudo cross variogram when the variables are measured at different support and units (Papritz et al. 1994;Papritz and Fluhler 1993). Cressie and Wikle (1997) have a different standpoint. They believe that in cokriging estimates with variancebased cross-variograms, the variables can be measure with a different mean and scale. Foehn et al. (2018) used the pseudo cross variogram to compute the linear model corregionalization to de-velop a methodology called regression cokriging in heterotopic data applied in meteorological data When the dataset is heterotopic, Barnett and Deutsch (2015) and Silva and Deutsch (2018) proposed multiple imputation (MI) to integrate secondary data into geostatistical simulations. The main idea of MI is to perform the geostatistical simulations in two steps. The first step is to simulate the primary variables at the locations of the secondary data. The result is a series of different datasets. The second step consists of performing geostatistical simulation in the simulation grid using the different datasets generated at the first step as conditioning information. Each generated dataset conditions one realization. This simulation at the second step is performed using simple kriging. This approach is efficient since it requires solving the cokriging system only at the locations of the secondary data. The main drawback of this approach is that only the collocated secondary datum is used.
Similar to Barnett and Deutsch (2015) and Silva and Deutsch (2018) 1996;Neufeld and Deutsch 2006) to build the local probability distribution at each location where a soft datum is available and sequential Gaussian simulation to complete the remaining grid nodes. The main drawback of these methods is that only one collocated soft datum combined with hard data is used to condition the simulations.
Cuba et al. (2012) proposed a method to combine data with different errors. The method builds a conditional distribution at the location of the soft data using only the neighborhood samples removing the sample with error. The Gibbs sampler algorithm is used to This article proposes a methodology to integrate heterotopic soft data into geostatistical models. Soft data are referred to herein as values of the same variable of hard data but with attached uncertainty (imprecise). The methodology is presented in Figure 1 and listed below: i. Build the Conditional Cumulative Distribution Function (CCDF) at soft data locations using MultiGaussian cokriging. For this, local simple cokriging estimation was performed at the soft data locations using all hard and soft data within the search neighborhood. The local CCDF has a Gaussian distribution with mean and variance equal to the simple cokriging mean and variance, respectively. The Intrinsic Coregionalization Model (ICM) was used, so all direct and cross-variograms are proportional to the variogram model of hard data.
ii. Generate a spatially correlated grid of random numbers in the interval [0, 1] via unconditional simulation. This is the main idea of p-field simulation (Srivastava 1992;Froidevaux 1993). These values were generated by unconditional geostatistical simulation using the variogram model of hard data. N scenarios were generated iii. Draw a value from the CCDF built-in (i) using the random number obtained in (ii) at this soft data location.
iv. Repeat (ii) and (iii) at all soft data locations. sample from this conditional distribution. The simulated values are kept if they satisfy statistical requirements (variogram and histogram). Maleki Tehrani et al. (2013) showed a method to consider both hard and soft data in geostatistical simulation. The original soft data consisted of geological information (rock type description), while the hard data consisted of copper grades. The first step was to build a local Cumulative Distribution Function (CDF) of grades in these soft data locations. The local CDFs followed a Gaussian distribution, whose means and variances correspond to the cokriging estimates and variances. The conditioning information for cokriging consisted of the original hard data and the soft geological data. The second step consisted of simulating a set of values at these local CDFs using the matrix decomposition method (Alabert 1987). Finally, these simulated values were combined with the original hard data to condition the simulations that occur over the entire study area. Some approaches based on Kalman filters, to integrate soft data to update short-term mining plans, improved production control in mineral resource extraction (Benndorf and Jansen, 2017;Wambeke and Benndorf, 2017).
Different from the other proposals presented herein, we present a method that uses all the hard and soft data available in the local vicinity of a soft datum to condition the simulations. To the authors' knowledge, the method has not been applied before. Similar to other approaches (Cuba, 2012; Barnett and Deutsch (2015) 2019)), we start by generating a series of datasets at the location of the soft datum. The generation of the datasets is performed by p-field simulation (Srivastava 1992). The conditional cumulative distribution functions required for p-field simulation were built with MultiGaussian cokriging using all the hard and soft data inside the search neighborhood. P-field simulation is used due to its simplicity and flexibility, since the local distribution may be obtained from any source of information.
The direct and cross-covariances required for cokriging were obtained by the intrinsic coregionalization model (ICM), so that all the variograms are proportional to the variogram of the primary variable. The ICM requires the coefficient of cor-relation, which was obtained by extrapolating the experimental cross-correlogram (Minnitt and Deutsch 2014). This cokriging is similar to the ICCK used by Babak and Deutsch (2009).
The result of the p-field simulation is a set of datasets that are used to condition the simulations in the simulation grid. The simulations at the remaining grid nodes were performed afterwards with turning bands simulation (Journel and Huijbregts 1978). Simple kriging was used to condition the turning band simulations. Although the p-field simulation may generate some artifacts (Pyrcz and Deutsch 2001;Ortiz 2003), the number of locations simulated with pfield simulation is negligible compared to the number of locations simulated with turning bands. The result is that the final geostatistical models do not present the artifacts of p-field simulation.
The advantages of the workflow proposed are the following: (i) only the spatial continuity model of the hard data is required; the uncertainty of soft data is accounted for in the geostatistical models, (ii) it is computationally efficient, and (iii) all the available soft data are considered. An underground mine case study illustrates the methodology proposed.

Methodology
This study uses data from an underground copper mine. The geological domain considered was sampled by two different types of data (hard and soft data). Figure 2 shows a data location map. The hard data ( Figure 2a) were collected using diamond drill holes, and are considered accurate and precise, but have a large sample spacing. There are 470 samples with an average sampling spacing of 10 × 10 m. The soft data ( Figure 2b) comprise 1473 values sampled by channels. The soft data are imprecise and inaccurate because errors are incorporated during the sampling process. The average space between soft data values is 3 × 3 m. Table 1 shows the global statistics for the declustered dataset, using cell declustering (Deustch and Journel,1998) which will be combined to build a grade model: hard data (Cu_DDH) with good quality and sparse and soft data (Cu_Channels) with poor quality and abundant. The mean and standard deviation are respectively 73.25% and 89.68 %, higher than that of the hard data. These large differences indicate that soft data was collected in the vein area (Figure 2a and 2b), where the high grades are located. To check for bias and imprecision in the soft data, both the hard and soft data must be investigated at the same domain. Table 2 shows the main statistics of the declustered dataset of the samples that are a maximum of 3 m apart (quasi-collocated). This was obtained by assigning the soft da-tum value to the nearest hard datum location whose distance was less than 3 m. We assume that the hard data (Cu_DDH) are unbiased and precise. The mean and standard devia-tion of the soft (Cu_Channels) data are respectively 5.25% and 8.50% lower than that of the hard data. These differences indicate that soft data is biased and imprecise. Figure 3 shows the univariate Gaussian histograms of the hard (Figure 3a) and soft (Figure 3b) data. The data were transformed d by the normal transformation (Deutsch and Journel, 1998). The data are transformed as the simulation techniques used assume that the data follow a multivariate Gaussian distribution. The histograms show that the data after transformation are standard normal (zero mean, unit variance) for both variables.

Data presentation (a) (b)
v. Combine original hard data with a realization at all soft data locations and proceed with simulation of the remaining grid. v i. Rep e at (i i) to (v) for a new realization.   In order to compute correlation in a completely heterotopic dataset (the hard and soft data are never at the same location), the experimental cross-correlogram plot ( Figure  4) was used to project the cross-correlogram to zero h and infer the correlation among data (Minnitt and Deutsch 2014). The results (blue line) indicate a considerable correlation of approximately 0.90.
MultiGaussian cokriging was used to build the local CDFs which are used afterwards to draw local realizations using p-field simulation. The local CCDF is Gaussian where the mean is the cokriging estimate and the variance is the cokriging variance.
The cross-covariances for the cokriging were obtained by the intrinsic coregionalization model. Figure 5 shows the variance map of The spatial continuity model of the hard data used in this case was defined using standardized variogram models.  (1) Figure 5 -Variance location map of simulate data locations generate in step i) of the proposed methodology.
To validate the datasets generated, we checked the histogram and variograms of the datasets built with p-field simulation. Bias effect in the histograms was controlled by using cell declustering (Deustch and Journel,1998), leading to a weighted average Cu (%) content of 2.28 (%). Figure  7 shows the cumulative histogram of the simulated datasets (black lines) and declustered hard data (red line). The mean and standard deviation of the simulated data are close to the hard data histogram. The results indicate that the distributions are similar.

Validating simulated data
local CCDF. The variance is zero (light blue dots) at the hard data locations, as expected. At the soft data locations, the variance depends on the amount and correlation of data in the vicinity. Figure 6 shows the location map for two realizations of the simulated data with p-field simulation. When we compare the two, the simulated values (dots symbols) are different at the soft data locations but are the same at the hard data locations (cross symbols). It means, for each new realization, there will be different simulated values comprising the data set. Fifty realizations were generated.    The remaining grid nodes were simulated with the Turning Bands simulation (Matheron, 1973 andJournel, 1974). These simulations were conditioned to the datasets generated by p-field simulation. For each realization, a different dataset was used, which was comprised of that original hard data and simulated data set values at soft data locations. To optimize final model results with the recommendations suggested by Emery and Lantuéjoul (2006) were used. Simple kriging was used to build and obtain the mean and variance for the local CCDF. A total of one thou-sand lines were used for the turning bands algorithm. Fifty realizations were created with the same spatial continuity model used in p-field simulation data describe in equation 2. Finally, the simulated models were back-transformed to the space of the original units.
In Figure 9, the red line represents the cumulative histogram of the declustered hard data (Cu_DDH), and the black lines are the histograms of the realizations. The plot shows good statistical reproduction of the histogram. The relative difference between mean values is 12% for the proposed methodology. The bias can occur in situations when the histogram of Cu (%) has a small amount of data, mostly in low frequency classes. Therefore, as in the proposed workflow, soft data were used with their error incorporated; the bias and imprecision of the data were not transferred to realizations.
In order to verify whether the results obtained by the proposed framework improved the accuracy and precision of the simulations, the values of a few hard data were excluded from the dataset, and simulations were run at these locations of the excluded data (omitted for the sake of comparison using jackknife (Efron 1982). To perform the test, eight diamond drill holes (comprising 55 hard data samples, or the equivalent of 12% of the total hard data values) were excluded from the dataset. (Figure 11). for the generated dataset. The models are similar; meaning that the inferred data (pseudo-hard) have the same spatial continuity of the hard data.

Evaluating the accuracy and precision models
The artifacts of p-field simulation are more pronounced when there are hard data to condition the simulations. In this case, the p-field was performed only at the locations of the soft data. The result is that the experimental variograms of the dataset did not show the artifacts of the p-field simulation.   This article presents an approach to incorporate hard data and completely heterotopic soft data into geostatistical simulations. The simulated data generated to infer the uncertainty associated with soft data reproduced the spatial continuity and cumulative distribution (histogram) of the hard data.
The simulated data inferred by p-field simulation reproduce the distribution and spatial continuity of the hard data. The quality of the models was assessed using the accuracy plot.
The results indicate the methodology proposed incorporates the uncertainty of the soft data into the final model. For future studies, the authors suggest using the methodology in the original data space with direct sequential simulation and applying it in other case studies with variables that have different coefficients of correlation. This workflow may be an easy alternative to incorporate soft data in the simulation, since it needs few steps and the modeling of the spatial continuity only for the hard data. Figure 11 -Copper underground mine data locations along a ZY plane view (blue dots: hard data; red dots: soft data; green dots: hard data excluded from the dataset; and the gray area corresponds to the geological domain). In order to check the model of uncertainty, accuracy plots were used. On this plot, the probability of a real value falling inside an interval given by simulations is checked against the number of times the real value actually fell into this interval. If the results are accurate and precise, the points of the plot are close to the 45° line. More details about accuracy plots can be found in Goovaerts (2001). Figure 12 shows the accuracy plot for the methodology investigated. The results obtained by Pfield/Tbsim show the data points closer to the 45° line, indicating that the models obtained are accurate and precise.