Modeling Bayesian Networks from a conceptual framework for occupational risk analysis

Occupational risk is the possibility that some element included in a particular work environment can cause damage to someone’s health. Thereby, the risk is understood as the product of probability and consequences. In this sense, risk analysis through probabilistic stochastic techniques, such as Bayesian networks (BN), becomes an important tool to analyze occupational risks. Thus, this article aims to show how BNs are being used in the field of occupational risk analysis, and to develop a conceptual framework for the construction of the BNs. Therefore, a systematic review analogous to the Statement for Reporting Systematic Reviews (PRISMA) protocol was performed, which allowed the evaluation of learning methods with the BN, building models and also for us propose a conceptual framework for the implementation of BNs in the analysis of occupational risks.

Taking into account the relationship between risks and its predictor variables, one of the methods that has emerged as an option for analysis of occupational risks is the Bayesian Networks (BN) method.This is a multivariate probabilistic inference method used in many areas of knowledge, mainly because it favors the resolution of complex problems by using a robust technique, and also allows the calculation of probability values from the causal relationship between each safety factor, as well as extracting information from data sets (Abdat et al., 2014;Arlinghaus et al., 2012;Tighe et al., 2013;Valdés et al., 2011;Wang et al., 2012;Weber et al., 2012).
Therefore, the assessment of occupational risk depends on the previous understanding of the risk scenario, which permeates uncertainty in the light of new evidence.This allusion has interfaces with the Bayesian perspective, thus, to a theoretical essay, this article aims to show how the BNs are being used in the field of analysis of occupational risks, and to develop a conceptual framework for the construction of the BNs in the field of study mentioned.

Stage I: building the theoretical framework
It was initially performed a systematic review of the literature through the Statement protocol for Reporting Systematic Reviews (PRISMA).This method consists of a practical guide, containing a checklist of items considered essential for structuring systematic reviews and meta-analysis (Liberati et al., 2009).The literature review was conducted in 2015 digital database Web of Science, Scopus and Science Direct, using as keywords the terms Bayesian Network, Risk Analysis, Occupational risk and Workplace.Supported in this protocol, the procedures for the selection of research projects and articles that have the BNs application interface in occupational risk analysis are presented in Figure 1.

Stage II: the construction of the BNs
From the protocol used and obeying the flowchart at the stage I, it was characterized the structure of a Bayesian network, observing the possible causes from the relationships among the factors involved in occupational risks.

Stage III: structural proposal of application of BNs
From the completion of steps I and II, it was captured on the selected studies the peculiarities concerning the analysis of occupational risks through the BNs.Thus, the methods of learning with BN were evaluated, and through the concepts of BN and the structure of the directed acyclic graph (DAG), a methodological model was built for the application of BNs in the contribution of occupational risks evaluation according to the following methodological procedures, namely: Identification of the Model's Purpose (IMP); Ordination of Data (OD); Model Construction (MC); Validation of the Model (VM).

Stage I: building the theoretical framework
In order to achieve the proposed objective, initially 647 articles were identified, in which, considering the inclusion criteria, 24 articles were selected (Table 1).Furthermore, it was observed that most application studies of the BNs in the analysis of occupational risk occurred in the industrial and transportation sector.The most relevant journal for this study was "Safety Science" contributing with 6 items, as seen in the Table 2.The Table 3 classifies articles according to the productive sector in which it is applied and the features of the model building, considering information related to occupational risk factors and procedures for data processing.

Stage II: the construction of the BNs
The relationship between random variables guides the use of BN in the identification of occupational risk, represented in this case by risk factors, depending on the circumstances in which these variables are represented in the workplace.The Bayesian technique, at first, consists in the formalization of distribution, reflecting the beliefs of the researcher on the subject, which combined to the actual information obtained about the domain studied via inference in BN, results in a posterior probability distribution, in which is possible to draw conclusions.Therefore, the BN has basically two components: a qualitative, represented by a directed acyclic graph (DAG), formulated from the previous knowledge of risk exposure, and where it is established the dependency relationships between variables; and a quantitative component, where it is possible to identify the distribution of conditional probability (Guerrero-Barbosa & Amarís-Castro, 2014).
Thus, the definition of the network is BN = <G, Θ>, where G is the DAG whose nodes X 1 , X 2 , ..., X n representing random variables, and Θ indicates the set of quantitative parameters of the network.Thereby, the BN determines a single probability distribution for each variable, given by Equation 1, where π i is the set of variables, in which Production, 27, e20162239, 2017 | DOI: 10.1590/0103-6513.223916   5/12 X i is conditionally dependent (Aneziris et al., 2010;Friedman et al., 1997;Leu & Chang, 2013;Pearl, 1988;Schenekenberg et al., 2011).
The structure of DAG is defined by two elements: nodes and arrows.Each note in the graph represents a random variable, while the arrows between the nodes represent the probabilistic relationship of dependence between the corresponding variables.
The arrow of a node X i to a node X j represents a statistical dependence between variables, and these connections can be simple, that is, when there is only one path that connects one variable to the other, or multiple in the existence of various paths.Thereby, the arrow indicates that X j depends on the value of X i node.The latter is then referred to as the parent node, and X j as the child node.
The structure of DAG ensures that there is no node that can be its own parent or child, as the graph is acyclic.Such condition is essential to the reliability of data.One has for example, the DAG shown in Figure 2 containing the variables X 1 , X 2 and X 3 , which may be represented by elements of the working environment that offer danger to the professional.The next stage is to specify the distributions of a priori probability for each variable based on hypotheses, considering the spread of risk from their parent nodes.In this case, the distributions are P(X 1 ), P(X 2 │ X 1 ) and P(X 3 │ X 1 , X 2 ) (Griffiths & Yuille, 2006;Jordan, 1999;Pearl & Russel, 2001).
Markov property regulates the structure of DAG, which identifies independence assumptions, in which every variable X i is independent of the nodes other than its parent nodes in the graph.This is used to reduce the number of parameters required for joint probability distribution.
Such reduction provides benefits on inference, learning (parameter estimation) and on computational perspective, and the resulting model is more robust to determine the effects, the trend and the variance (Aneziris et al., 2010;Friedman et al., 1997;Leu & Chang, 2013;Pearl, 1988;Schenekenberg et al., 2011).With the structure of the BN set, and after the estimation of network parameters, one can then obtain the posterior probability distribution, and based on these data, infer about the occupational risk (Delcroix et al., 2013;Lee et al., 2013).

Stage III: structural proposal of application of the BNs
Occupational risk analysis involves a complex and multidimensional system in a setting in which regularly there is uncertainty, lack or absence of certain risk factors involved in the problem.Modeling made possible through distributions of probability amid uncertainties, provided by the BNs, to become appropriate to model the occupational risks, considering that it allows inference from hypotheses and on the relationship between risk factors.However, the risk is a changeable data when it regards to time (among other aspects), and modeling through the BNs satisfies this logic, as it stimulates stochastic processes (Englehardt et al., 2003;Hsu et al., 2014;Puncher et al., 2013;Xing et al., 2013).
Once identified the structure of the network, probabilistic and computational calculations performs the inference of the quantitative parameters of the model, with a possible interaction of the theoretical knowledge of experts about the risk scenario that involves a specific problem.So, one can reach a consistent amount of exposure to the occupational risk with attenuation of possible biases (Lee et al., 2013;Englehardt et al., 2003).Therefore, hypotheses and assumptions about the interactions of risk factors, not treating the data methodically can turn its validity questionable and contaminated by personal beliefs base the understanding of the approach designed by BNs, despite being mitigated by the use of software.
Thus, for the construction of modeling in BN, it is required previous subjective knowledge, and to be associated with the fact that algebra of Bayesian analysis is more complex than the classical analysis, especially in multidimensional problems, making the BNs little used for analysis of occupational risk (Aguilera et al., 2011).Based on the results obtained in stages I and II, a methodological flow was built from the application of the BNs in the evaluation of the occupational risks according to the following methodological procedures, namely: IMP; OD; MC; and VM.

Identification of the model's purpose (IMP)
Initially, the construction of the BNs requires a characterization of the risk scenario in terms of relationships between variables and characteristic of the relationship.For a description of the risk scenario, it is necessary to understand the sequence of events that leads to failure, it is understood as failure the undesired effect that leads to accident or injury to health.
Allowing the understanding of how a harmful event is spread and recognize the trigger event (an event that starts the chain of harmful events), to identify the hypotheses and assumptions governing the relationship between occupational risk variables (García-Herrero et al., 2012a;Hänninen et al., 2014;Wang et al., 2011).The assumptions externalize the previous knowledge of the studied domain, and can be formulated from previous studies, knowledge of experts or through the use of predictive probability distributions (Hamra et al., 2014).
Knowing how the work is performed allows the identification of the variables that are involved in the sequence of events that leads to failure, which will be converted into nodes in the DAG of the BN.Most of the works built modeling from organizational characteristics, especially in the industrial and transport sector, followed by the variables "work demands" and "ergonomics".Some of the risk factors adopted as nodes in the DAG are variables related to the experience of the professional, task duration and the training type of professional, knowledge of safety regulations, risk perception, seat belt use, incorrect posture and others (Martín et al., 2009).
The research features shape the form of data acquisition, in this stage.The most widely used methods have been the collection through questionnaires, favored because it is a simple and fast method that enables the capture of large amounts of information in a short period of time (Chatterjee, 2014;Rivas et al., 2011;Taylor et al., 2014).

Ordination of data (OD)
The complexity of the model grows positive and exponentially due to the number of variables and relationships between them, and thus, more robust models require techniques of analysis of more accurate and effective data.Thereby, it is called attention to the selection of the number of variables and identification of the types of relationship between them.Recall that the BNs have the ability to handle continuous or hybrid data, which requires the use of specific algorithms, or even the need for discretization of data with consequent loss and generalization of information (Aguilera et al., 2011;Chatterjee, 2014).
Each risk factor, represented by the nodes, is composed of a finite number of mutually exclusive states (García-Herrero et al., 2013b).States are numbers of possible conditions that each risk factor may assume (Wang et al., 2011).Often these states are binary, mainly because the greater the number of states, greater the model complexity, as there is a need to specify each variable probability distribution for its n states (Hänninen et al., 2014).If there are latent variables that influence the risk, they cannot be measured directly; the structural equation modeling (SEM) is indicated to measure them indirectly (Chatterjee, 2014).
In case, the data is in large quantities and BN becomes too complex, the network can be modularized so that subnets are formed due to the fragmentation of the activity in sub-operations, and then, identify the causes and probabilities in each part of the operation.Thus, the total risk is the sum of the risks of each transaction (Khakzad et al., 2013b), providing a more structured and accurate model (Carvalho & Chiann, 2013).These procedures underpin the formulation of the prior probability distribution on exposure to the occupational risk, which will be put to test in subsequent phases.

Model construction (MC)
Being defined risk factors (represented by the nodes) and the possible conditions each node can take, one can then establish a DAG that enhances the estimation of the quantitative parameters, and then, obtain the posterior probability distributions (Aguilera et al., 2011;Wang et al., 2011).The possibilities in the construction of DAGs are numerous, and many of these possibilities are equivalent, that is, the result at the same probability distributions, differing between them the complexity in the estimation of the quantitative parameters.
Thus, the choice order of the variables of DAG may effect the format of the network, and to avoid errors, which would make the network more complex, it is suggested that the choice is primarily based on risk factors that are possible to be the root nodes of the network (represented by trigger events), followed by the nodes that are independent, and the leaf node, characterized as the result (García-Herrero et al., 2012a).
In practical contexts, the structure of DAG or quantitative parameters of BN are unknown, making it necessary to identify them through learning methods.Artificial intelligence learning is the assimilation of the experience in the presence of uncertainty or variability, that is, the estimation of qualitative or quantitative parameters that can generate predisposition to updates, which may be applied by an expert, or in the case of very complex problems with automatic learning methods from the available data on the problem (Liao, 2012;Martins & Maturana, 2013).
Specialist learning can be directed to the structure and / or numerical parameters of the model.Structure learning is most often used in the literature with examples such as Wang et al. (2013), Leu & Chang (2013), Wang et al. (2011) and Chen & Pollino (2012), in these the expert evaluates and establishes relational structures among variables, while in the learning of the expert parameters, evaluates the conditional probabilities for each variable, one by one, directly or adjusted the results recorded by computational programs, pondering them (Hänninen et al., 2014).
In the case of learning by specialists, using methods by which it is possible to resolve subjectivity is indicated.In this perspective, the indication is the D-S Evidence Theory method, which is possible to associate evidence from different sources in order to reach a consensus that provides a strong measure of certainty.In order to identify the DAG that leverages the estimation of quantitative parameters of the model, a conditional independence test should be applied, and if the previously stipulated DAG is not the most efficient, the network structure will be modified to the extent that it uses mutual information between the nodes (Zhao et al., 2012).
On the other hand, if a node presents several parents, the construction of DAG, and hence, the posterior probability distribution becomes complex and imprecise.The decomposition method allows the simplification in the identification of conditional independence and compensate for the uncertainty, considering that the probability of each of the parent nodes happen to be calculated separately, these values are associated obtaining the conditional probability chart.The method of decomposition was used to resolve uncertainties and display the nodes that could be hidden in the network (Wang et al., 2011).
Other techniques can be used for visualization of hierarchies and interconnection of the variables.The model of human factors can be applied to identify hierarchies in the genesis of accidents (Akhtar & Utne, 2014;Wang et al., 2011); and also the Bow-Tie model is indicated for identification of hierarchies, mainly because it adds data on the impact of security and containment barriers to the accident (Khakzad et al., 2013a, b).However, the chosen automatic learning methods of quantitative parameters vary according to the knowledge obtained from the data, implying in knowing the initial event and its consequences, and discover the dependency relationship between the variables.
There are situations that the structure is known, but the data is incomplete; or the structure is unknown and the data is complete.When the variables are known it is said that the data is complete, otherwise the data are incomplete.Therefore, four cases of learning in BN are considered, for which different learning methods are proposed (García-Herrero et al., 2013a).There are specialized algorithms for this function, although these are rarely explained in scientific communications, given that in many cases the algorithm used is not known, as this is part of the statistical package commercial software.
Most of the scientific communications do not comment on the algorithm used, so, these communications were not investigated in this article.Therefore, there are countless possibilities of algorithms to complex problems, although, there is no consensus in the literature of a standard algorithm for each case (Aguilera et al., 2011).Some options, based on knowledge acquired from the data and network structure are elucidated in Table 4 ( Ben-Gal et al., 2005).
The calculation of conditional probabilities is not simple; although, using software makes the modeling more accessible and viable.Two softwares have been used more, the Netica Bayesian Network (Norsys Software Corporation, 2007) and Hugin Researcher (Hugin Expert, 2014).However, in spite of having powerful computational tools, it is worth noting that even with the use of learning through computer algorithms, the human perception is essential for a successful application of the BN in the analysis of occupational risk (Chen et al., 2009;García-Herrero et al., 2012a;Khakzad et al., 2013b).
Once the network structure and the model parameters have been determined, conclusions can be drawn.It is possible to highlight some results obtained with the application of the BNs in the analysis of occupational risk, namely: (1) In the transport sector, the results indicated that human factors were, among the direct factors, the most responsible for accidents, and among the indirect factors, the major influence was related to road conditions (Zhao et al., 2012) ; (2) In a study applied in the sector of industry, services and civil construction, it was shown that accidents at work had a strong direct relationship with the global set of hygiene and ergonomic factors, indicating that poor working conditions have a significant effect on health and safety in companies (García-Herrero et al., 2012a); (3) Intending to infer about the risks of occupational exposure to polycyclic aromatic hydrocarbons in the industrial sector, it was identified that prolonged exposure results in 74.4% chance of exceeding the acceptable risk of exposure (Hsu et al., 2014).

Validation of the model (VM)
This step consists of checking the representativeness and robustness of the model, and the accuracy of the resulting inference.In the case of the resulting model passing by a validation to prove its reliability, the subjective inherent to the BNs building process reduces.If the resulting model is indicated as invalid, the data or the structure must be upgraded, so that it can be tested again (Aguilera et al., 2011;Akhtar & Utne, 2014).
There are several validation methods, either by analogy with other similar publications or the application of knowledge of experts.The use of cross-validation occurs in cases when the BNs have a variable target, from the sensitivity analysis to identify which variables have more power of influence on the risk.At this stage, we need reliable and reliable methods.However, in the case of validation by a belief of experts, there is subjectivity and, in addition, the size and composition of the group influence the results (Aguilera et al., 2011;Akhtar & Utne, 2014;Wang et al., 2011).
The sensitivity analysis is indicated in order to validate the theoretical construction of the model and when is intended to estimate the accuracy of the model, making it possible to determine the mutual relation between the variables.The sensitivity analysis refers to how sensitive a model is to changes in the input parameters, and can be performed by changing the values of the parameters and monitoring the effects of these changes on later probabilities (García-Herrero et al., 2012b;Ren et al., 2008;Wang et al., 2011).
For cross-validation, two data groups are required, one for the construction of the model and the other for validation (Akhtar & Utne, 2014).One option for application of sensitivity analysis by cross validation is the ROC (Receiver Operating Characteristic), which is a graphic criterion of performance (García-Herrero et al., 2012a, 2013a).The ROC curve suggests a prediction of reliability and error rate for the results obtained from the application of the BNs (García-Herrero et al., 2012a).
Some of the publications directed to the analysis of occupational risks that use the modeling of BNs do not highlight whether the model was validated or not.However, among the publications in which the model was validated, the technique usually adopted was the sensitivity analysis, considering that this type of validation may indicate the most prevalent risk factor, thus promoting decision making (Ren et al., 2008;Wang et al., 2011).From the foregoing, it is presented a conceptual framework, compiled in Figure 3, for inference of occupational risk having BNs a statistical tool.
Thus, for each phase, a methodological framework was constructed according to the IMP, OD, MC and VM.The interconnection between these phases results in a logical structure for evaluation of occupational risks through the BN.Thus, for this research, the BNs are appropriate because they are able to generate predictions or decisions without any knowledge "a priori".Accordingly, these networks, in conjunction with Bayesian techniques, facilitate the combination of knowledge and data, enabling the encoding of the length of causal relationship with the probabilities.Consequently, knowledge "a priori" and data are combined with Bayesian techniques and statistics.
Moreover, occupational risk analysis involves a complex and multidimensional system in a scenario in which there is regularly uncertainties, lack or absence of certain risk factors that involve the problem.The proposed model in this research made possible through probability distributions amid uncertainty, provided by BNs, allowing modeling occupational risks, considering that enables inference from hypotheses and the relationship between risk factors.

Conclusion
Occupational risks are inherent elements in the systematics of work.Identifying and quantifying them provide information for a more effective action in an attempt to prevent accidents and / or minimize their effects.Regarding this, the probabilistic networks have come to be unique tools to determine the risk, the importance of which was due to the ability to provide data for monitoring, as new information is added or removed from its structure; and also for being able to incorporate various hypotheses.Thus, it is understood that one of the major benefits of the Bayesian approach is the ability to incorporate prior information in order to quantify uncertainties and verify the legitimacy of propositions.
Furthermore, the risk analysis through BNs becomes an important tool to analyze occupational risks because the risk is a data that changes depending on time (among other aspects), and this logic satisfies the modeling through BNs, as it simulates stochastic processes.
Finally, the construction of the Bayesian network model to analyze the occupational risks based on the proposal presented in Figure 3 may contribute to (1) model data, bridging the gap between different types of knowledge and unify all knowledge available in one type of representation; (2) Assimilate qualitative knowledge, in terms of accident factors through the structural connection of BN; (3) assimilate quantitative knowledge in terms of frequency of occurrence of the accident, using the parameters obtained by the BN, which will allow the extraction of recurring scenarios.

Figure 1 .
Figure 1.Flowchart of the procedures of selection of publications.

Figure 2 .
Figure 2. The directed acyclic graph of a BN.

Figure 3 .
Figure 3. Model for analysis of occupational risks from BN.

Table 1 .
Quantitative of the filtered articles.

Table 2 .
Quantitative of the article by a sector of application and journal.

Table 3 .
The number of articles from the methodological characteristics of the BNs.

Table 4 .
Learning method of the proposed data.