Data mining in occupational safety and health: a systematic mapping and roadmap

Paper aims: This research presents a literature overview in relation to data mining and machine learning applications in the area of occupational health and safety. Originality: A summary of main insights obtained from the analysis of systematic mapping is presented at the end, as well as a roadmap with recommendations for directing future research on the topic. Research method: This article carries out a thorough descriptive research of the scientific literature on the topic through a systematic mapping covering the period between the years 2008 and 2019 and 12 scientific databases, which at the end presents 68 selected records. Main findings: Around 84% of the selected records were of total significance for the research, with the majority of them being classified in the areas of civil construction and steel industry. Implications for theory and practice: Through this study it is possible to understand the way research has been developed on this theme, as well as point to the guidelines for future studies. Other contribution is the indication of studies in OSH 4.0 concept, based on monitoring workers full-time.


Introduction
Market developments from industrial revolutions have provided numerous changes inside companies in general, as well as in labor environments, conditions and legislation.At the beginning of industrialization, companies were not required to have commitment to the safety and health of workers.Yet, this issue came to be considered employers' obligation and a competitive strategy for organizations (Ciarapica & Giacchetta, 2009).Therefore, it is essential to study and understand the concepts related to occupational health and safety, presenting an effective and simplified approach for industrial applications (Sanni-Anibire et al., 2020).
The term occupational safety and health (OSH) is related to mitigation and prevention of accidents and diseases that affect individuals considering the work they perform.This area requires attention since, according to the International Labor Organization (ILO), every second about 10 workers have an accident and each year 2.34 million employees die around the world due to accidents and occupational or professional diseases.Thus, health and safety are directly related to the social development of organizations and countries (Chen et al., 2020).

Data mining in occupational safety and health: a systematic mapping and roadmap
There are several approaches to discuss OSH-related factors.Yoon et al. (2013) link the emergence of the OSH management system to the United Kingdom in 1991, when a guide was developed to assist employees in improving health and safety in organizations.Hicks et al. (2016) present a study on the influence that occupational stress has on organizations' security environment, and Sanni-Anibire et al. (2020) assess risks in civil construction to improve workers' performance in relation to safety.Kang & Ryu (2019) use data mining (DM) applied to accident data to predict and classify these episodes, associating accidents to climatic conditions.
Given the volume of data generated from accidents, diseases, deaths and occurrences related to workers' health and safety, DM and machine learning (ML) are fundamental resources to take actions in this area.The concept of DM is described as part of the KDD (Knowledge Discovery in Databases) process and is responsible for extracting patterns from the data (Fayyad et al., 1996).This definition is directly linked to ML and sometimes the terms are mixed up.Yet, ML is usually more related to learning the algorithm, which occurs from the data used for its training (Buczak & Guven, 2016).
Every day more data are generated related to health and safety at work, and in the literature there are studies that make the interaction of these data with mining techniques and machine learning; Kakhki et al. (2019) compared the performance of four techniques in DM on labor claims; Baghdadi (2018) used sensors to collect kinematic data from workers and evaluate them with DM techniques; and Siddula et al. (2016) described an analysis of construction sites safety using images of the place.
The application of mining methods to OSH data is also a tool to assist organizations managers.The monitoring of employees by a coordinator is a possibility to reduce accidents (Antwi-Afari et al., 2018;Yanar et al., 2019).Furthermore, evaluating results and correlations presented by data mining serves as a subsidy for management's strategic decision making (Bevilacqua et al., 2008;Del Pozo-Antúnez et al., 2018) or can be used to establish new policies aiming at workers' health (Comberti et al., 2018;Liao & Perng, 2008).
This study seeks to answer the following question: "How does data mining support decision-making in OSH?" The article aims to provide an overview of the literature through systematic mapping, highlighting primary and quality studies involving the combination of DM and OSH themes, and defining future directions of research based on the gaps found.

Materials and methods
The Systematic Literature Mapping presents the purpose of carrying out the classification and analysis of the literature in relation to a topic, which provides a general view of the studies and their respective results.In this way, Systematic Literature Mapping guides the researcher to seek a holistic view instead of just answering a question in detail (Kitchenham et al., 2011).
In terms of structuring, Systematic Literature Mapping is arranged in three phases: Planning (Input), Execution (Processing) and Discussion of Results (Output).The Planning stage included the elaboration of the research protocol which covered the research questions, the search string and search sources, selection strategy, extraction strategy as well as quality assessment.The processing stage consisted of performing searches in the databases, classification, ordering and quality assessment, while the stage named discussion presented the synthesis of the results obtained.
The research protocol was defined based on the guidelines presented by Dybå et al. (2007), Paternoster et al. (2014) and Petersen et al. (2015).Moreover, the protocol was evaluated by three experts who suggested arrangements related to search string and insertion of databases.To conduct the systematic mapping, Microsoft Excel and Mendeley software programs were used as support tools.
Five questions related to the objective were defined to direct information about each selected article, which were: (i) What kind of OSH data are explored?(ii) What types of DM tasks, techniques and tools are used?(iii) What industrial activity sector is explored in the research?(iv) Which OSH database was used?(v) Does the study use OSH data in a way related to other information?
Search strategies involve the definition of the sources that will be used, listing 12 sources for searches as shown in Table 1, with their respective results after the first search.As the study is described by a multidisciplinary theme, the data sources chosen are linked to different areas, some focusing on health and others more related to research in the areas of computing and information technology.Three groups of words were fixed, one related to health and safety; the other related to work; and the last one considering DM.From the combination of the key terms, the developed string is: (injury* OR health* OR safety* OR accident*) AND (work* OR labour* OR labor* OR occupational) AND ("data mining"* OR "machine learning"*).
The inclusion criteria adopted met the following restrictions: Published from 2008; English-language publications; and Evidence OSH in DM application.Exclusion criteria were considered: Not peer reviewed; No evidence of OSH in DM application; Does not respond satisfactorily to research questions; In case of duplicate articles, keep only the most complete one.Figure 1 highlights the results obtained in conducting systematic mapping.According to Kitchenham (2004), evaluate the quality of the chosen records makes it possible to assess the importance of each article as well as facilitate the interpretation of their result.Therefore, to guide the quality assessment in this research, 11 questions were used through a binary scale ("yes" or "no") (Dybå et al., 2007).
Each question is related to a category and two classes are presented to adjust the articles: rigor and relevance.The rigor category, represented by eight questions, is related to the research methods used, answering whether the approach used was complete and covered all important aspects of the research.Relevance, represented by two questions, evaluates whether the results are clearly described and significant for the research.In the last question, it is also considered if the research is important in the academic and industrial scenario (Dybå et al., 2007).

Results
Based on the 68 articles selected for the research, it is possible to carry out a descriptive analysis of the records and all the selected articles are detailed in Appendix A. The first analysis, represented by Figure 2, is related to the year the articles on this topic were published.
It is possible to observe an increase in the number of studies published over the years.Their publication, which was stable until 2014, had a growth in the following years and in 2018 the number was more than twice that from the previous years.It may be associated with the growth of research in DM.In this way, the last two years are representative and responsible for approximately 40% of the total sample of selected records, demonstrating the topicality, importance and relevance of this theme.Accidents are the main focus of the studies chosen, except for the years 2012 and 2013, when diseases represented the main interest.

Information extraction by the research questions
Regarding the data used in the chosen research studies, represent by the first question most of them represent typical accidents that occurred in the industry, with emphasis on specific accidents such as slipping, stumbling and falling (SSF) (Nenonen, 2013;Sarkar et al., 2019b), diseases (Krishna et al., 2015) and occupational injuries (Ciarapica & Giacchetta, 2009), or studies that focus on cases of death due to occupational accidents (Ruso & Stojanović, 2012;Shin et al., 2018;Shirali et al., 2018).
The search resulted in 1424 articles, from which 474 duplicate papers were excluded, reducing the sample to approximately 67% of the initial amount.After the first filtering, 236 records remained; with the second filtering, the number dropped to 105; and after the final selection there were 68 remaining articles.These selected articles are related to their original search sources in Table 1.In cases where there was duplicity of articles in different sources, only one of them was considered.
To classify the research four aspects were created, each one presenting its respective categories, as shown in Table 2, based on the study by Paternoster et al. (2014).Other studies use benefit claim data sets to analyze absences from work (Bertke et al., 2012;Kakhki et al., 2019).Some of them use interviews with employees (Del Pozo-Antúnez et al., 2018;Zhao et al., 2019) or ergonomic tests to analyze performed activities (Baghdadi, 2018;Zhao et al., 2019).In the machine learning area, there are articles that use sensors to assess the worker (Xie & Chang, 2018), as well as photos and videos to detect the use of protective equipment and work postures (Rubaiyat et al., 2016;Shein et al., 2015;Siddula et al., 2016).
Related to second question, to characterize DM in OSH, the use of tasks, techniques and tools of the selected records were analyzed.To list the tasks used in DM, four types of tasks were considered; three associated with supervised learning -association, classification and regression -and one -clustering -linked to unsupervised learning.
Some studies describe more than one task in their scope and are classified according to the individual occurrences of each task, also according to the joining of two tasks in the same research.However, the data general analysis shows that most studies encompass the classification task, characterizing 78% of the chosen records.Clustering and association tasks are described in 15 and 8 studies, respectively.The least used task is regression, found in only five works, what represents about 7% of the selected sample.A single study uses it as the only task. Figure 3 shows the distribution of the time used in each task: As for the techniques, those used in a higher number of studies were: decision tree, Support Vector Machine (SVM), naive Bayes and neural networks.Algorithms associated with decision tree, the most used technique, were observed in 20 studies, followed by SVM, applied in 17 articles.The use of algorithms associated to naïve Bayes and neural networks were observed in 14 studies, each.Other studies, in turn, presented the use of more Considering the tools, their use was not clearly described in 31 of the articles studied.Only eight tools were used in more than one study (MATLAB, Weka, Clementine, Statistica Data Miner, Python, R, SAS e TextMiner), whereas the others were used in only one study.The most of the records chosen for systematic mapping had undefined software programs, or this tool was used only once.
In relation to the third question, industrial sectors, most of the selected studies were on the areas of civil construction and steel industry.Civil construction as the most researched sector, present in 19 articles and steel sector are explored by eight articles.Also representative are the healthcare sector (five studies), mining and petrochemicals (four), the administrative and timber sectors (three) and the agriculture category (two).In 14 studies there was not a categorization of the sector to which they belong, since the sector used was not specified, or even described as a general data set.
About the databases (fourth question), the analysis of the records selected for the systematic mapping of the literature, some of them publicly available, showed that the highest occurrence of the database was on the use of Occupational Safety and Health Administration (OSHA), presented in five studies, with information from the United Studies and South Korea.
The Council of Labor Affairs (Executive Yuan) of Taiwan and the Istituto Nazionale Assicurazione Infortuni sul Lavoro (INAIL), in Italy, come in second place, both presented in three studies.The Spanish Ministry of Employment and Social Safety and the Occupational Health Center of Presidente Prudente data sets are also noteworthy, for they were used in two studies.The other data sets are considered in only one study.Half of the studies represent internal data sets of the companies, tests carried out specifically for the article or data whose origin was not informed or specified.
To answer the last question of the research protocol, which considers the relationship between the selected studies and other information, some articles presenting other aspects related to OSH can be highlighted.Some of them can associate the occurrence of accidents with climatic conditions in the workplace (Bohanec & Delibašić, 2015), mainly for events in the construction industry (Kang & Ryu, 2019;Liao & Perng, 2008).
The costs that companies or government have with workers' health and safety are demonstrated in five of the articles selected (Cheng et al., 2012;Kakhki et al., 2019;Meyers et al., 2018;Olsen et al., 2009;Shin et al., 2018).This difficulty in finding research that addresses OSH costs represents a gap in the literature.It means more studies need to be developed.

Record classification and quality evaluation
The selected articles were stratified according to the aspects and categories defined in section 2 and the results of the classification are shown in Figure 4. Rigor represents the precision of the study in its research This figure shows the incidence of studies on its upper right corner.It indicates they are characterized by high rigor and relevance/credibility.The higher results are represented by twenty-six studies that got marks 7 and 2 respectively in rigor and relevance and eleven studies that got 7-3.Low results are represented by one study with marks 4-3; one with 5-1 and three with 5-2.Finally, eight studies got the highest marks in both aspects (8-3).
The eight studies that got the maximum marks, thus considered the most rigorous with the most relevant themes, are represented by Abad et
The data used may have different characterizations, as some correspond to claims for benefits after an accident or diseases at work (Bertke et al., 2012;Kakhki et al., 2019;Gross et al., 2013;Meyers et al., 2018).Other studies focus on the accident data set in general (Bevilacqua et al., 2008;Sarkar et al., 2019a), specific accidents such as slipping, stumbling and falling (SSF) (Nenonen, 2013;Sarkar et al., 2019b), roof fall (Mistikoglu et al., 2015), or major accidents, which injure at least three people or cause one or more deaths (Cheng et al., 2013).
Regarding cases of death, some data presented sets of fatal and non-fatal accident cases (Jocelyn et al., 2018;Mistikoglu et al., 2015;Shin et al., 2018;Shirali et al., 2018), but there are also cases in which the set presents only accident events with death (Ruso & Stojanović, 2012).There are also studies that show severity levels of occurrences (Shirali et al., 2018), considering more and less serious accidents, or only less serious cases of occupational diseases (Krishna et al., 2015).Another analysis that can be performed has to do with the industrial sector of workers who are affected by accidents or occupational diseases.Most research studies are concentrated in the construction industry and at different levels of this scenario, such as in landfills preparation steps to start the work (Gerassis et al., 2017), in construction dockyard (Kang & Ryu, 2019) or construction sites (Shin et al., 2018).
Considering the DM techniques used, some studies present similar data using different techniques.For instance, two studies used data sets of accidents in civil construction, but with different techniques: decision tree and association rules, respectively.Related to this, there is a large number of articles that present the decision tree technique and its specific algorithms, such as C 4.5 (Gross et al., 2013;Sanmiquel et al., 2015;Shein et al., 2015) and C 5.0 (Hajakbari & Minaei-Bidgoli, 2014;Mistikoglu et al., 2015;Sarkar et al., 2018Sarkar et al., , 2019c)).
Concerning the application of techniques, there are many tools which perform and support DM methods.Some tools are easy to handle, with ready internal packages and part of their programming already developed, such as Weka (Gerassis et al., 2017;Jocelyn et al., 2018;Pekel et al., 2018;Sanmiquel et al., 2015;Waghmare & Pai, 2013) and (Heo et al., 2019;Sanmiquel et al., 2018).Other studies, in their turn, use Python language to elaborate the code for DM (Goh & Ubeynarayana, 2017;Heo et al., 2019;Marucci-Wellman et al., 2017) and present a higher complexity of running.Some software programs have their specific functions for a type of information, such as TextMiner for textual data sets (Nanda et al., 2016;Taylor et al., 2014).
As for the types of data that can go through the mining process, this sample includes research studies that use images (Rubaiyat et al., 2016;Siddula et al., 2016) and videos (Paliyawan et al., 2014;Ueno et al., 2008), being directly associated with the concept of machine learning and automation in the worker's environment.Other studies present textual data sets in formats of injuries (Tixier et al., 2017) and accidents reports (Liao & Perng, 2008;Sarkar et al., 2016) and medical examinations (Bonneterre et al., 2012).
Some research studies also use accident data related to other information to generate associations or enrich the models developed.Some seek to associate accidents with a certain location (Rashid et al., 2017;Valêncio et al., 2011), others associate the sources of injuries, accidents and deaths with climatic conditions (Bohanec & Delibašić, 2015;Kang & Ryu, 2019;Liao & Perng, 2008).Figure 5 shows the relationship between data, focus and type of learning.
As can be seen, 36 (52.94%) of the 68 articles studied referred to Typical Work Accidents, 20 (29.41%) were classified as Occupational Diseases, 6 (8.82%) were categorized as Accidents of Typical Work and Fatalities.Of the studies in question, 4 (5.88%) were simultaneously shown as Typical Work Accidents and Occupational Disseases and 2 (2.95%) could be classified as Typical Work Accidents, Occupational Disseases and Fatalities.
Regarding the Focus of the 36 studies classified as Typical Occupational Accident, 18 (50.00%)were shown as Process Analysis of Variance, 11 (30.55%) as Predictive Monitoring,6 (16,67%) as Compliance Check and 1 (2.78%) as Variance Analysis of Processes and Predictive Monitoring.

Roadmap for recommending future research
From the systematic mapping of the literature in context of DM and OSH, it was possible to build a roadmap for future research in the area.Future research is based on data published on accidents at work by national and international agencies responsible for monitoring the financial, economic and productive impacts of such events.They are also built by the type of method used for DM and the type of results that the research will present to society.Based on these paths, it is possible for researchers and managers in the area to identify the studies that have already been carried out and what are the possibilities and combinations of new research.The characteristics of the roadmap (data, method and results) are validated based on the results described in the previous sections and presented, as well as their subdivisions, by Figure 6, which allow classifying the studies identified in the systematic mapping conducted by the researchers.
Quality, continuity and reliability of the databases selected for future research are critical attributes for research in the area, since they allow increasing the impact, the evaluation and the improvement of preventive actions or measures that can be adopted for Health and Safety policies or programs at work.There is a possibility to separate public, private and test data.Public data are those made available through organizations such as the Taiwan Labor Affairs Council, INAIL in Italy and Spanish Ministry of Employment and Social Security of Spain.
The use of public data made available by agencies in research in the area can result in benefits for operational safety and health.These advances are related to the creation of policies and the implementation of prevention programs, as well as the development of indexes that allow the monitoring of the accident scenario and assist in decision making.Some of these benefits can be found in the research results found by Cheng et al. (2012) and Marucci-Wellman et al. (2017) through proposed actions, or in Del Pozo-Antúnez et al. ( 2018) with a suggestion of organizational changes.Another advantage of using public data is the application in different contexts, because when using private data, in most cases, we are talking about specific industrial sectors, differently from what happens with public data, that are more comprehensive.
Private data are data provided by a specific company, such as, for example, data from an insurance company (Kakhki et al., 2019), petrochemical industry (Bevilacqua et al., 2008) or accidents at a ski resort (Bohanec & Delibašić, 2015).Using these data allows industrial sectors with lower rates to be studied in depth and in detail.Besides, private data sets can contain more specific information than public data in general.However, the data selection process can be a difficult stage, both for public and private organizations.Sometimes the data are not updated, they take time to be published, or there are still obstacles on the part of organizations, which are apprehensive about the exposure of internal and negative information.
Tests are specific observations for the study, such as ergonomic simulations (Antwi-Afari et al., 2018;Olsen et al., 2009) and risk situations of workers in their workplace (Rashid et al., 2017).There are also specific cuts that can be made in these data, such as, for example, temporal cuts, specific types of accidents, prioritized industrial sector or other divisions that the researcher wishes to investigate in order to define preventive actions and specific programs for companies (Paliyawan et al., 2014;Sarkar et al., 2019b;Taylor et al., 2014).
As for the method used to apply data mining, the researcher may decide to choose a single technique to mine his data, as it happened in the research by Cheng et al. (2013), in which only the decision tree technique was used.In addition, two or more techniques can be used at different stages of the data mining process, as described in the research by Sarkar et al. (2019c) with the random forest technique for data preprocessing, SVM and artificial neural networks (ANN) in a second step and decision tree to complete data mining.Another possibility of methodology for conducting the research is to make a comparison between two or more techniques, analyzing which one performs the best result for the chosen data set, as in Kakhki et al. (2019) comparing the performance of five classification algorithms (linear and quadratic SVM, RBF kernels, Boosted Trees and Naïve Bayes).
Selecting different techniques and comparing the results of their applications allows the researcher to make decisions about the best mining strategy based on performance of algorithms, which makes their choice more reliable.The choice of using several methods at different stages of the process makes data mining more robust and also more reliable, since the model will be composed of different algorithms with specific focuses.In both cases, the methods allow for a more accurate result, reflecting in better performance in systems of accident prevention and mitigation.On the other hand, because they are more complex processes and with more steps involved, they can result in more time for planning and execution by researchers, demanding more investments and human resources to analyze research problems.
Regarding the results generated by the studies conducted, they can be defined by two complementary approaches.The first option is to present a visualization and analysis of the data, and the second is to complement the analytical step with preventive actions or tools that are proposed based on the results found during the research.The first case is commonly found in studies with only the presentation of the analyzes carried out from the mining results, as it is presented, for example, in studies by Ciarapica & Giacchetta (2009), Bohanec & Delibašić (2015) and Zhao et al. (2019).
The presentation and analysis of data mining allows us to understand and evaluate the method used, its results and the research scenario, but it is possible to go further.Some authors, besides presenting the analytical stage, also highlight actions that can be taken and present tools that can be used in this context, as it is the case of the study by Cheng et al. (2013), who, based on the associations made by algorithm, presented actions and areas of attention to reduce accidents in the industrial sector.Hajakbari & Minaei-Bidgoli (2014), in addition to presenting a guide and locations for future inspections based on the data used, also present a scoring system to prioritize the workplaces to be inspected, showing all stages, in a way that other researchers can use this tool in other contexts.

Threats to validity
Threats to validity exist in all empirical studies (Petersen et al., 2015).The construction and definition of the search string is one of the main challenges in defining the protocol.To minimize threats, the search string

Figure 1 .
Figure 1.Flowchart of the systematic mapping execution process.

Figure 2 .
Figure 2. Temporal distribution and interest of the selected articles.

Figure 3 .
Figure 3. Temporal distribution of tasks used in the articles.

Figure 4 .
Figure 4. Evaluation of the rigor and relevance criteria.

Figure 5 .
Figure 5. Relationship between Data, Focus and Type of Learning.

Figure 6 .
Figure 6.Roadmap for data mining in OSH.

Table 1 .
Relationship between researched data sources and their results.

Table 2 .
Aspects and categories to classify the studies.