Data imputation of water quality parameters through feed-forward neural networks

ABSTRACT The constant monitoring of water quality is fundamental for the understanding of the aquatic environment, yet it demands great financial investments and is susceptible to inconsistencies and missing values. Using a database composed of 59 sampling campaigns, performed for 12 years, on 10 monitoring stations along the Iguassu River Basin (Southern Brazil), this study presents a model, based on feed-forward neural networks, which imputed 1,370 values for 11 traditional water quality parameters, as well as 3 contaminants of emerging concern (caffeine, estradiol and ethinylestradiol). The model validation errors varied from 0.978 mg L-1 and 0.017 mg L-1 for the traditional parameters, for caffeine the validation error was of 0.212 µg L-1 and for the hormones, the errors were of 0.04 µg L-1 (E1) and 0.044 µg L-1 (EE1). The models underwent two techniques to understand the operations performed within the model (isolation and nullification), which were consistent to those explained by natural processes. The results point to the validity of modeling water quality parameters (especially the concentrations of caffeine) through neural networks, which could lead to better resource allocation in environmental monitoring, as well as improving available datasets and valuing previous monitoring efforts.


INTRODUCTION
Water is essential for human life but increasing urbanization and industrialization of modern society, as well as the ever-increasing agricultural activities demanded to support the populational rise, has put pressures onto water resources systems during the past century.Therefore, the continuous and systematic monitoring of water quality is a fundamental part of the understanding and preservation of economic, environmental, and human health.
Monitoring water quality of a large water body during extended periods of time demands large financial investments in laboratorial infrastructure, qualified personnel and sampling logistics (Woodhouse & Muller, 2017).Yet, these investments do not guarantee a flawless and complete dataset, as human error, instrumental malfunction, differences in methodology, budgetary constraints, and/or implementation of different monitoring approaches allow for missing or incongruent data collection within a studied period.These problems are more evident when dealing with concentration measurements for contaminants of emerging concern (Muthukrishnan et al., 2020).
Contaminants of emerging concern are natural or artificial compounds which are originated from human activity and are present in water resources in extremely low concentrations (in the order of ng to ug L -1 ), such as pharmaceuticals, female sex hormones, industrial preservatives, and additives, as well as agricultural pesticides, and personal care products (Berrou et al., 2021).The chronic exposure to such compounds, even at low concentrations, though not completely understood, have been linked to mutagenicity, carcinogenicity, endocrine disruption, deleterious alterations in reproductive patterns and environmental imbalance (Giri, 1993;Kidd et al., 2007;Oaks et al., 2004;Isidori et al., 2005;Katipoglu-Yazan et al., 2013;Galus et al., 2013).
The greatest hurdle that must be dealt with when dealing with water resources variables is the high variability and intertwined variance between the variables present.The relationships which govern the behavior of traditional parameters and contaminants of emerging concern within an aquatic environment (as well as interposed to the socioeconomical picture in which this system is enclosed) are extremely complex.One of the modern alternatives to understanding such complex databases and interrelationships is artificial intelligence, as computational modeling has experienced great advances on the past few decades.Machine learning methods, for example, are computational algorithms developed aiming at interpreting and reproducing intrinsic patterns in highlydimensional datasets.
There are very few studies that have analyzed the application of machine learning to model the concentration of contaminants of emerging concern (Kiesling et al., 2019;Krishnaraj & Deka, 2020), such as caffeine and female sex hormones.Also, very few studies have been performed on the practicality of data imputation for highly irregular, not evenly spaced temporally, water quality databases (Park et al., 2021).
Thus, this study aims at applying artificial intelligence models, based on feed-forward neural networks, to predict missing data within a water quality monitoring dataset produced for the Iguassu River from 2005 and 2017.The parameters that will be modeled are the concentrations of 11 traditional water quality parameters: biological oxygen, particulate organic carbon, dissolved organic carbon, ammonia nitrogen, nitrate, nitrite, total nitrogen, total phosphorus, and orthophosphate, as well as four contaminants of emerging concern: caffeine, estradiol, ethinylestradiol, and estrone.This study also aims at understanding the accuracy that may be expected in such an endeavor, as well as the intrinsic functions and relationships that each model was able to extract from the dataset for each pair of input-output variables.Thus, the models created to impute the missing data of the set might be used to predict values for faulty future sampling campaigns, as well as be able to better understand the natural system it is applied, which could lead to more efficient resource allocation for future environmental monitoring.

Study area
The database of water quality parameters used for this study was collected from the Iguassu River, in the State of Paraná in Southern Brazil.It runs for 1.311 km, and its watershed spreads over 70,800 km 2 , 97% of it is located within Brazilian borders, while the remaining 3% is in Argentina.The study area does not cover the entirety of the Iguassu basin, being restricted to 39% of the basin area and 36% of the river's course.
This study analyzed data collected from 10 different sampling sites across the first half of the Iguassu River.Table 1 shows the coordinates and distance from the spring for each of the sampling Though the Iguassu River is formed by the confluence of the Irai and Atuba River (around IG2D and IG2E), the Irai River's River's spring, located 22 kilometers upstream from IG1 (located in a mountainous Atlantic rainforest biome), is commonly addressed as the Iguassu River's spring.For this study, this will also be assumed.
The IG1 sampling site is in the municipality of Pinhais, within the Curitiba Metropolitan Area (CMA), which is the largest and most populous urban agglomeration of the State of Paraná and the 9 th most populous of Brazil.This confers urban characteristics to the majority of the first 100 kilometers of the Iguassu River's course.
The sites IG2D and IG2E are located within the same transversal section of the river but represents different aspects since are located 50 meters downstream from the confluence of the Atuba and Irai rivers, while their waters have yet to completely mix.The characteristics of these water bodies are contrastingly different, as the Irai river is part of the water supply system of the region, it tends to present less degraded water quality (IG2E) than that present on the Atuba River (IG2D), which receives the discharge of one of the largest wastewater treatment plants of the CMA, usually degrading the water quality of the river.
The subbasins for IG3, IG4 and IG5 are encompassed entirely within an urban setting.The IG6 subbasin is situated in a semi-urban scenario.The first sampling site located in a mostly rural environment is IG7, situated in the municipality of Porto Amazonas.The sub-basins of IG8 and IG9 sites are larger than those that came before them.This difference between the size and setting of each of the sampling sites and their respective zones of hydrological influence confer unique characteristics for each one.

Database
All the data used in this study was collected through the Project Integra II, a multi-institution, multidisciplinary water quality monitoring scientific effort performed on the Iguassu River, in southern Brazil, from 2005 through 2017.During this period, 415 surface water samples were collected in non-regular time-steps, from 10 unevenly spaced sampling sites along the first half of the course of the Iguassu River.
The dataset is composed of the observations of the concentrations of 11 traditional water quality parameters, as well as the concentrations of 4 contaminants of emerging concern.The parameters which were monitored were: biological oxygen demand (BOD), particulate and dissolved organic carbon (POC and DOC, respectively), ammonia (NH4), organic and total nitrogen (NOrg and TN, respectively), nitrate and nitrite (NO2 and NO3, respectively), orthophosphate (PO4), total phosphorus (TP) and dissolved oxygen (DO).The 4 contaminants of emerging are: caffeine (CAF), estradiol (E1), ethinylestradiol (EE1) and estrone (E2).The latter was ultimately dropped from the analysis for lacking informational entropy, resulting in an analysis that encompassed only 3 contaminants of emerging concern.
The database has varying amounts of missing values for each of the parameters.Table 2, next, shows the number of observations, missing values, average and standard deviation for each of the studied parameters.
The missing values of POC, PO4, CAF, E1 and EE1 were mostly caused by alterations in the monitoring strategy during the study period.DOC monitoring started only in 2009, CAF and the hormones were measured starting in 2012, and POC has only been observed from 2012 to 2014.The other missing values were caused by machine malfunction and are spaced throughout the study period.Besides the considerable number of missing values within the dataset, the samples showed high variability, as their mean values for all parameters are smaller than their standard deviations.
Due to the lack of informational entropy within the estrone example dataset (all examples for this hormone's concentration  RBRH, Porto Alegre, v. 28, e14, 2023 4/12 Data imputation of water quality parameters through feed-forward neural networks were zero or below the limit of quantification) the analysis of imputation of new data for this parameter was dropped, as any attempt to generate an artificial intelligence method would only allow for the replication of null results.

Network architecture and data imputation process
An artificial neural network (ANN) is a computational model which has the objective of understanding and predicting an intrinsic complex non-linear dependency relationship between a set of input variables (predictors) and a set of output variables (targets).This net is based on mathematical neurons, which are disposed into interconnected layers that allow computing high level models from simple neural models (Khalil et al., 2011).
The system receives the input data and uses many parameters to compute the solution, which need to be estimated.So, a neural network is an iterative method that focuses on minimizing the error in prediction (the performance metric for this study was the MSE -mean squared error).This is achieved by exposing the model to data examples with previously known values for inputs and outputs, this process is called the training of the model (Jiang et al., 2013;Wang et al., 2019;Bansal & Geetha, 2020).
The input variables are inserted into the input layer of the network.Each node of this layer receives the input signals and computes a result.As each node is connected to each of the nodes of the next (now called hidden) layer, its output is passed to a subsequent hidden layer (feed forward) or, finally, the output layer.Each connection between neurons has a weight assigned to it, which will be updated, by an iterative process, to minimize the difference between the observed (real) and modeled results.This iterative correction is called back-propagation, as it spreads from the output nodes towards the input (Ruben et al., 2018;Mitrovic et al., 2019).This kind of net is called a feed forward artificial neural net (FFANN, as the input information has a directional path towards the output nodes), trained with the propagation algorithm, or multilayer perceptronic neural network (MLPNN), the nomenclature is not consistent within the literature, therefore, for this study it will be used FFANN to identify this sort of ANN that is used here.
The operations that occur inside a hidden artificial neuron are usually arranged into two compartments.First, the Transfer Function section is responsible for summing the information of each weighted input.It then sends the value of this summation to the Activation Function compartment, which applies a userdefined function to modulate the output before passing it to another neuron within the net (Khalil et al., 2011).An illustration of an artificial neuron is presented on Figure 2. Note that in this case a ReLU (rectified linear unit) activation function is applied.
As previously stated, networks are arranged in interconnected layers of neurons.This allows representing and understanding more complex underlying patterns within a dataset.
In this study, it was verified if it would be possible to fill the missed values of a given water quality parameter, computing a value from other variables using a neural network.That means that it was evaluated the viability of predicting one of the variables using the others as information source, assuming that there is a certain dependency between the variables and that the neural network was able to detect this relationship.For this purpose, it is important that enough samples with several parameters are available.For example, in this study case, for the 372 samples in which the DOC concentration was measured, 348 had also an associated measurement of BOD, while only 77 of these samples had an associated concentration for POC.Note that as there were only 110 observations for the contaminants of emerging concern, they were not considered as possible input parameters.By taking these values into consideration, the inputs that would be fed into each of the models are presented on Table 3.
To verify this hypothesis, neural networks were used varying the architecture.A script was written to measure the performance of different neural networks architectures using the same input data.The script trained 1641 different configurations for each of the 14 parameters, varying the number of nodes for the first, second, and third layers.The number of nodes for the first layer varied from 1 to 32, the second varied from 0 to 16, and the third varied from 0 to 8. When the value of layer reached zero within the iteration, it was taken away from the overall model, if the second layer became zero, so would the third.This preliminary sieve for the definition of the specific network architecture experienced 100 epochs.
After obtaining the performance of each configuration, the 5 settings which presented the best performance were re-trained with different epochs to circumvent possible under or overfitting caused by a non-optimal epoch number.The experiments started with 150 and finished with 1,000 epochs, with an iterative step of 50.The results for the network architectures which presented the best performance for each of the variables that would be used to generate the new imputed values are presented on Table 4.
These neural network models were trained by being exposed to examples which had all input data observed, not missing a single value within the input vector, as well as having all the input parameters 0-1 standardized.To determine the accuracy of the models, approximately 10% of the examples (aside from the 20% which were separate inside the algorithm to produce the testing accuracy values) were set aside for validation purposes.

5/12
The parameters which would be used for validation were selected randomly for each different training run, because if these parameters were locked from the start, the models' measured accuracy would choose the ones which would better predict solely those 10% of examples previously set.The accuracy of each model's validation capability was measured through the root mean square error.
The overall imputation process was iterative.The chosen FFANN models were then applied to situations in which the output parameter was missing, but the input parameters were filled.Following stages would allow for the imputation of new values, as some input vectors which had missing values were completed by the previous imputation stage.Therefore, more than one imputation stage was performed for filling the dataset.
The performance for each of the chosen models are presented on Table 6.
The overall performance of each of the developed models, when compared to those present on literature (Table 7), used as benchmarks, were acceptable, as the values for the validation errors did not differ greatly from previous studies, and were within the range of the lowest errors presented.But, as the RMSE values are intrisically dependent on the scale of the measurements, the true extent of the performances of different models are difficult to assess without the application of the model on a standardized dataset.
Though the RMSE metric has been widely used for the analysis of the error of a predictive artificial intelligence model, there are no set reference margins which allow for the general evaluation of a model -though several studies have been performed to estimate levels of goodness of fit for different environmental variables (e.g. the recommendation of the Wisconsin Department of Natural Resources Bureau (2007) which classifies models as Excellent if the error lies within a margin of 0.3 mg L -1 , for the prediction of DO concentrations, such is the case for the proposed model), no consensus has been achieved (Kouadri et al., 2021;Boursalie et al., 2022).Therefore, the analysis of overall performance of a model is dependent on several factors, such as the scale of the input and output variable, the scope of the model/ project, and the complexity of the system.For the scope of this study, due to the high variability of the input variables and the complexity of the system being represented, the errors presented by Table 6 (which vary from 0.04 mg L-1 to 0.978 mg L-1) might be defined as a good representation of the starting dataset.
No studies have been performed to analyze the viability of predicting/modeling the concentrations of CAF, E1 or EE1 (Tiyasha et al., 2020).For the values obtained in this study, as the imputation method applied was iterative, the amount of predicted data for each parameter was different for each iteration, as shown in Table 8.None of the parameters had a new example opened up, by filling its input vector, by the third iteration.
Due to the different starting stages of each of the parameters input list, as well as the amount of missing data within each one, the final number of examples still had missing values, as no new vectors were opened up by the completion of new input vectors that would allow for new prediction of missing values.Table 9 presents the number of final values for each of the parameters.
The percentage of prediction between the different parameters were not equal.As expected, those parameters which presented the least amount of filled examples were the ones which had the largest number of imputed values.Therefore, the values for caffeine, estradiol and ethinylestradiol were the ones which presented the highest rate of filled in values.A total of 1.370 values were imputed overall, 61.57% of which are represented by the imputation of values for the concentrations of the contaminants of emerging concern.
The distribution for the imputed values for the contaminants of emerging concern are presented on Figure 3 (Caffeine), Figure 4 (estradiol) and Figure 5 (ethinylestradiol).
For the hormones, the peaks of the distribution of modeled values were consistently smaller than those measured.,which might be explained by the presence of a high percentage of true zeros examples.Since E1 had 57.2% of its training examples as true zeros, and EE1 presented a true zero percentage of 54.5%, which contrasts to the low percentage of true zero caffeine examples of 17.3%.This could have overwhelmed the model to tend to lower concentrations, yet different behaviors shown by the input parameters during the period in which the contaminants of emerging concern were not measured could have led to a model representation that results in lower concentrations overall.
As another product of model generation, the underlying function between the output parameters and each of its input Peixoto et al.

7/12
parameters may be explored by fixing the values for all nonscreening parameters and iterating the values for the target input.For example, for the relationship found between BOD and CAF, every other input had its value set to zero, and the model was asked to predict the probable value of caffeine (CAF) while BOD varied from 0 to 200 mg L -1 .The limits of variation for the x-axis were determined as a rounding of the double of the highest observed concentration for each of the parameters used as input for each specific output (the contaminants of emerging concern).Figure 6 presents the behavior presented by each of the input variables for the contaminants of emerging concern.
The behavior of the relations between the input variables and their output varied for each of the contaminants of emerging concern.The predicted concentrations for caffeine and estradiol presented a linear rise in relation to the concentrations of biological oxygen demand and ammonia nitrogen, which were expected, as higher concentrations of these parameters have been previously linked to an increase in water quality degradation (Mizukawa et al., 2019).This behavior was the opposite of the one presented by the dissolved oxygen (for caffeine and ethinylestradiol), as the fall in concentration of this parameter is usually linked to the discharge of wastewater, for the modeling of the estradiol concentration it did not present any significant relationship, this could be explained by the sensibility of the parameter, as variations on this parameter depend not only on chemical and biological variations but also physical ones, as higher temperatures diminish the maximum saturation of oxygen on the water (Li et al., 2020b).
The relationship for some of the input variables and ethinylestradiol, BOD, NH4, and total phosphorus, presented an initial peak of 0.17 µg L -1 for a null concentration of the parameters, followed by close to zero contribution as the concentration of said input parameter rose.This behavior is expected when the model cancels out an input variable, and could be understood as    RBRH, Porto Alegre, v. 28, e14, 2023 8/12 Data imputation of water quality parameters through feed-forward neural networks the lack of informational entropy offered by the variable when modeling the output.
The last step for understanding the relationship between the variables used as input and the output was to stablish the improvement (or decay) in performance when an input parameter was fixed to its mean for all examples.This operation allows for the determination of the relevance of the fixed parameter on the result produced by the entire model.As a parameter is fixed on its mean, the overall performance of the model is altered; the more the performance goes down, the more relevant that parameter was to the output.Therefore, it is possible to assume that the runs in which the performance was stable (or even increased) by the fixation of a parameter, that parameter is irrelevant to the overall result of the model.
Table 10, next, presents the results for the effects of fixing each input parameter onto each of the output models.
Each of the output models presented a different relationship between its variables.The organic matter measures (biochemical oxygen demand, particulate and dissolved organic carbon) did not present specific characteristics about the parameters in their inputs with two exceptions, both regarding DOC: The presence of ammonia nitrogen in the modeling was highly correlated to the performance of the model, while the removal of the total phosphorus from the inputs produced results which were 29.4% more accurate.
For the nitrogenated parameters (ammonia nitrogen, organic nitrogen, nitrate, nitrite and total nitrogen) their interrelationship was shown to be most relevant.Yet, while ammonia nitrogen and total nitrogen had their performance dependent of BOD, the modelling for organic nitrogen presented the highest improvement in performance while the BOD levels were fixed at its mean.The phosphorated compounds models were also dependent on the presence of total nitrogen within its input parameters.
For the contaminants of emerging concern, the distribution of biochemical oxygen demand, total nitrogen and dissolved oxygen were the ones which presented the highest relevance towards the performance of the three models.For the caffeine model, nitrate did not present such relevance as the one pointed out by the hormones' models.Also, the parameters which had no effects on the modelling were the distribution of ammonia nitrogen on the ethinylestradiol output, nitrite for the caffeine output and nitrate for the estradiol output.
The dynamics of micropollutants in aquatic environments are difficult to completely understand, due to these compounds' intrinsic characteristics and their relations to anthropic activities.Thus, the response sensibility and the synergy of one compound to  another is not completely known as well as their toxicological risks or the biological response to chronic exposure to these compounds, when exposed in a natural water setting (Routledge et al., 1998;Montagner et al., 2019;Santos et al., 2020).Yet, it is well stablished these compounds' great potential for endocrine disruption and deleterious health effects, especially on aquatic animals (Rocha et al., 2013;Luo et al., 2014).Aside from the compounds' inherent characteristics, there are environmental aspects that should be taken into account, such as the compounds' environmental halflife and recalcitrance, dilution caused by increased precipitation and other anthropic influxes, seasonal variance and changes in land cover and land use (Yoon et al., 2010;Wang et al., 2018;Yuan et al., 2020).As for the dilution and the seasonal variances, these might be the reason for the high rate of non-detection, or the concentrations being below the limit of quantification for the contaminants of emerging concern.Generally, higher concentrations of contaminants of emerging concern are linked to water quality degradation, which leads to the decrease in dissolved oxygen, and consequent increase of the concentrations of biochemical oxygen demand, ammonia nitrogen and phosphorus within the aquatic environment (Mizukawa et al., 2019).
The physical and chemical water quality parameters are indirect indications of anthropic activities which influence the water resources systems, while the concentrations of contaminants of emerging concern reflect a more accurate picture of the human impact on the aquatic environment, especially the influence caused by the discharge of wastewater.Therefore, it is necessary to consider the local and seasonal scenario which produced the changes in the physical and chemical parameters, as variations in the concentrations of OD, BOD, nitrogen and phosphorus may represent the natural configuration of the ecosystem being studied, or being a response to inefficient wastewater treatment (Yang et al., 2013;Berger et al., 2017).This is one of the great challenges of dealing with complex systems, which need to be mindfully studied.Therefore, advancements in technology and techniques to increase the quality of data collection and overall environmental monitoring are necessary to maintain up-to-date knowledge of the environment in which the researchers and water resource managers are included into.

CONCLUSION
This study presented a modeling strategy to impute missing data of 11 traditional water quality parameters and 3 contaminants of emerging concern through the implementation of specific neural network models for each of the parameters analyzed.The results presented point to the viability of using feed-forward neural networks to predict missing data values in past sampling campaigns for the Iguassu River.One of the main concerns about the implementation of artificial intelligence models is their black-box characteristics, in which the input transformations within the model are not visible to the user, this problem was addressed by this study twice, aiming at exploring the intrinsic relationship between the input and output variables.Both methods of evaluating the importance of each variable in the prediction pointed to the chemical and physical validity of the relations explored by the models.
In this study, the results point to promising model performances for the data imputation of caffeine concentrations, using as input the concentrations of biochemical oxygen demand, ammonia nitrogen, total nitrogen and phosphorus and dissolved oxygen.The overall performance of model expected errors of a little more of 10% of the measured concentrations for this compound.Though this dataset was a product of one of the largest and most comprehensive water quality monitoring efforts in Brazil, it was still not possible to gather data with enough informational entropy to allow for a better imputation model for estradiol and ethinylestradiol (as the majority of the training examples were composed of true zeros).Also, as in this study the networks were trained on examples for all sampling sites, and each of the sites presents specific characteristics, other studies might benefit from performing a similar analysis on a per-site basis, or by grouping sites with similar characteristics.
The efforts to monitor and understand the aquatic environments are dependent on large monetary investments, as well as being susceptible to inconsistencies in sampling, which could possibly generate gaps in the dataset.Artificial intelligence methods have been shown to be a viable modeling alternative for understanding, and consequently predicting, complex patterns in high dimensional datasets, and the results obtained in this study point to the viability of modeling aquatic environments by feedforward neural networks.Further studies should be performed to better understand the extent which the implementation of machine learning methods could improve the understanding of the environmental settings researchers, water resource managers and decision-makers are included in, as well as improve resource allocation in environmental monitoring efforts while potentially diminishing its financial burden.

Figure 1 .
Figure 1.Map of the study area within the Iguassu River Basin.

Figure 2 .
Figure 2. Function of a generic neuron in a feed-forward neural network with a ReLU activation function.

Figure 3 .
Figure 3. Distribution of observed and modeled examples of caffeine through the study period.

Figure 4 .
Figure 4. Distribution of observed and modeled examples of estradiol through the study period.

Figure 5 .
Figure 5. Distribution of observed and modeled examples of ethinylestradiol through the study period.

Figure 6 .
Figure 6.Behavior of the concentrations of the contaminants of emerging concern as each of their input behavior was the only varying aspect in the model.

Table 1 .
Coordinates of the ten sampling sites on the Iguassu River.

Table 2 .
Status of the dataset and measures of central tendency and dispersion.

Table 5 ,
next, presents the number of valid complete examples present on the database for each parameter, and the number of random examples which were excluded from the training dataset for validation.

Table 3 .
Set of input parameters used for training the models for each of the output parameters.

Table 4 .
Network architecture (number of nodes per layer) for the models of each of the output parameters.

Table 5 .
Number of valid examples and their separation in training and validation sets for each of the output parameters.

Table 6 .
Overall performance for each of the models implemented.

Table 7 .
Performance of past studies which predicted the concentrations for the output parameters.

Table 8 .
Number of filled values in each iteration for the output parameters.

Table 9 .
Total number of filled values by output parameter.

Table 10 .
Performance for each of the output models when each input parameter is fixed in its mean value.