Design of a geospatial model applied to Health management

Rev Bras Enferm [Internet]. 2019;72(2):420-6. http://dx.doi.org/10.1590/0034-7167-2018-0589 ABSTRACT Objective: To identify geographically the beneficiaries categorized as prone to Type 2 Diabetes Mellitus, using the recognition of patterns in a database of a health plan operator, through data mining. Method: The following steps were developed: the initial step, the information survey. Development, construction of the process of extraction, transformation, and loading of the database. Deployment, presentation of the geographical information through a georeferencing tool. Results: As a result, the mapping of Paraná according to its health care network and the concentration of Type 2 Diabetes Mellitus is presented, enabling the identification of cause-and-effect relationships. Conclusion: It is concluded that the analysis of georeferenced information, linked to health information obtained through the data mining technique, can be an excellent tool for the health management of a health plan operator, contributing to the decision-making process in Health. Descriptors: Health Care; Data Mining; Geographic Mapping; Supplementary Health; Chronic Disease.


INTRODUCTION
Epidemiology is the study of interrelationships between several determinants of disease frequency and distribution within a population.Such knowledge is fundamental in preventing and treating diseases, offering consistent elements for referrals in the Health field.The identification of patterns on the occurrence of diseases in human populations, in addition to the factors that influence and condition them, define the study object of Epidemiology (1) .
In London, 1854, the number of cholera cases was regarded as a stable and low incidence; nonetheless, at a given time, cholera became a major epidemic, coming to register more than 500 fatal cases within about 12.5 hectares in a 10-day period.John Snow, considered by many as the father of Epidemiology, developed at the time an observational study linking the cases geographically, identifying the midpoint -a water spout in the neighborhood of Soho -as being responsible for the spread of the disease.Through the cause-and-effect relationship, after careful and intensive investigations, he concluded the other hypotheses about the origin of the disease should be rejected, claiming that the water route was responsible for the transmission of Vibrio cholerae (1) .
Another example of the association of geographical positioning with a cause-and-effect relationship is the Li-Fraumeni syndrome, a pathology that makes individuals more vulnerable to certain types of neoplasms, identified in the mutation of gene expression.It was observed these cases were usually located in the South and Southeast regions of Brazil and, particularly, the geographical points of the individuals with the syndrome were bound to the path of a Portuguese immigrant in the 18th century -today, a proven theory of the genetic offspring of the disease (2) .
Currently, Brazil is going through a demographic transition due to the decreased infant mortality and increased life expectancy of the population, which went from 50 to 73 years, on average.Other factors also influence this scenario, e.g., the decreased fertility rate among women, which in the 1960s was on average six children and currently is less than two.These factors have a direct impact on the future national scenario, and it is estimated there will be 64 million older people in 2050, representing 29.7% of the Brazilian population (3) .
With the increased life expectancy, one can see a change in the epidemiological scenario -the growth of non-communicable chronic diseases in the general population (4) .Population aging has a direct impact on health spending, driven by the increasing proportion of older adults in the population and the growing use of assistance resources.The magnitude of the increase in health costs will depend on the people's quality of life and on the existence or not of diseases and comorbidities (3) .Before these scenarios, the World Bank underlines the importance of organizing the health systems to adapt to the new epidemiological profile, stating that health promotion and disease prevents will continue to be the greatest challenges for the sector.
In the field of supplementary health insurance, there is a challenge for health plans regarding their sustainability.This task has been hampered due to legal constraints posed by the National Agency of Supplementary Health (ANS), which impact directly on their costs due to the fact the limitation of contract adjustments exceed the medical-hospital inflation Variance of Medical-Hospital Costs (VCHM) that, in December 2016, in the period of 12 months, had a buildup of 20.4% (5) , an index much higher than the general inflation index Extended Consumer Price Index (IPCA), which for the same period was 6.29% (6) .In addition, every two years there is the introduction of new and expensive technologies and procedures that extend the list of mandatory coverage services for the operators of health plans.
Given this context, it is necessary to use applications that enable quick and intelligent measures to minimize the direct costs of the operation, seeking a balance between customer satisfaction and the service providers since "guaranteeing the minimum solvency conditions of insurers is essential to enable the existence of an insurance market that meets the objective of protecting the interests of insured persons" (7) .
However, health plan operators encounter difficulties to act in the prevention of chronic diseases.One of the reasons for this is directly associated with the lack of beneficiaries' clinical data systematized in their databases, making it impossible to extract information and knowledge in a more automated way.This is also due to the fact the construction of information systems within these companies is directed only at the administrative control, aiming only at the payment of service providers and the management of contracts (8) .
In 2007, Resolution No. 1,819 of the Federal Council of Medicine (9) was published, prohibiting the placing of the International Classification of Diseases (ICD) on completing the Supplementary Health Information Exchange guidelines (TISS) in ambulatory care.
The great amount of information generated by health information systems, added to information from external environments that are often fed in real time and made available in different formats (texts, videos, images, messages, gene expressions etc.), define the term "Big Data" (10) .
The discovery of patterns in the "Big Data" environment using traditional methods of analysis, besides demanding a lot of time and resources, does not guarantee the full exploitation of its potentials (10) .With the use of data mining, a non-trivial process for finding hidden and possibly useful information, this work becomes more efficient, enabling the support to decisionmaking processes (11) .
Integrating health information, linked to geographic, environmental, socioeconomic, and demographic data, allows the creation of hypotheses for scientific research on the causes and origins of certain diseases, providing knowledge on the prevalence, incidence, transmission control, and treatment of diseases.The analysis of georeferenced information is also of great importance in the discovery of cause-and-effect relationships, providing a dynamic analysis and enabling the identification of more vulnerable groups and the knowledge on the current health status of the population (12)(13) .
There are numerous databases that, when aggregated and enriched with other information linked to a geographical location record, allow the discovery of important characteristics for the identification of patterns in a given region.
Linking pattern recognition methodologies with georeferencing tools to a well-formulated research question can generate useful tools for the discovery of knowledge in the health area, enabling the implementation of important practices to reduce costs and improve the population quality of life, even considering the difficulties related to the absence of clinical information.
Thus, the question this research intends to address arises from the difficulties and needs identified.What are the techniques and methods of knowledge discovery that enable the creation of geographically referenced indicators for monitoring the health care of the group of beneficiaries of a health plan operator, aimed at health promotion and disease prevention?Some related works were identified for the recognition of diseases through the database, which consider the use of associated procedures, algorithms, and methods that enable identifying in the accounts payment records the discovery of patterns of use linked to certain diseases, among which one can cite a study in the region of Pávia, Italy, which consisted of research with administrative and clinical data on diabetes, using rules of temporal association.The research aimed at analyzing data from the health system of the region.The method also highlighted the frequent temporal associations of interest in the diagnosis related to the patient's clinical condition (14) .
Another study, also developed at the University of Pavia, Italy, consisted of a method to identify patterns based on temporal data, which were extracted as rules of the temporal association.This research concluded that there is much potential for data mining in searching temporal association rules, suggesting that these methods can be more exploited since the demand for tools that uncover potential rules of interest for managers is increasing (15) .
One can also mention another methodology for identifying beneficiaries in a health plan operator with Type 2 Diabetes Mellitus indicatives in the state of Paraná, Brazil.By a history of use, the selection of relevant variables for data generation was carried out.The selection was submitted to the algorithm J48 for finding rules, later evaluated by a group of experts.From this technique, one could extract the patterns for other chronic diseases, which are now part of applications for identifying and categorizing beneficiaries (8) .
From this problematization and the definition of these concepts, the need arises for constructing an environment that aggregates information from diverse sources for the enrichment of the internal databases, integrated with pattern recognition techniques and associated to a geographical reference, allowing the identification of diseases that affect the population to promote a more effective monitoring of chronic non-communicable diseases and to enable the development of regional actions of health promotion programs.

OBJECTIVE
To identify geographically the beneficiaries categorized as prone to Type 2 Diabetes Mellitus (DM), using the recognition of patterns in a database of a health plan operator, through data mining.

METHOD Ethical aspects
This research was carried out with the acquiescence of the institution and project submission to the Research Ethics Committee of the Pontifical Catholic University of Paraná (PUCPR).

Study design, location, and period
This was a descriptive study of a quantitative approach, retrospective in nature, using the database of a large operator of health plans in the State of Paraná, in 2017.

Population or sample: inclusion and exclusion criteria
The sample was based on beneficiaries of the health insurance plan of the operator, who were active at the period observed and who, due to their use and frequency of services used, were categorized as prone to Type 2 DM.As inclusion criterion, the selection of active health plan beneficiaries in 2017, who fell into the category of prone to Type 2 DM, was established; the exclusion criteria were inactive beneficiaries in 2017, and who were not categorized for Type 2 DM.The choice for this pathology is due to this being a constantly-growing disease, stemmed from several factors, such as obesity, sedentary lifestyle, population aging, and increased survival rate of patients with the disease (8) .In addition, it is estimated that non-communicable chronic diseases prompted 38 million deaths in 2012; only Type 2 DM was the root cause for 5.3% of them (16) .

Study protocol
After structuring the database, the study was divided into three stages: the initial stage, development, and deployment.In the initial stage, a survey of the possible bases of public research on health records was carried out with the help of the institution's sectors, aimed a constructing the theoretical framework and identifying and justifying possible applications and opportunities for the development of the project, with due approval by the managers of the organization.Furthermore, in planning and in conjunction with managers, the possible databases with areas of the organization were listed.Listed bases were obtained through a questionnaire.The research was formulated by the authors of the research, with the purpose of raising the required databases to compose the project.Such questionnaire was applied on October 3rd, 2016, at the Head office of the health plan operator (HPO).Prior to the questionnaire application, a presentation was held, discussing the main objective of the project, as well as the public bases already covered.Then, 27 questionnaires were delivered and, of them, ten (41.6%) were returned completed.The research had as target audience experts who work in the sectors of information of the HPO in different cities in the State of Paraná.The questionnaire was structured in three questions: one referring to the field of Health, another to the Market, and the latter to the strategic area.Each question contained an assessment on the importance of each area, which ranged from 0 to 5 -with 0 standing for little importance, 3, average importance, and 5 referring to very important; moreover, an open question was proposed, requiring interviewees to describe other possible information sources to be applied.Still in this stage, a tool was defined to be used in elaborating the process of development and deployment, with the premise of using a free software.In the development stage, the software previously listed was installed and a training on the software applications Quantum GIS (17) and Pentaho PDI (18) was applied to the internal team responsible for

Design of a geospatial model applied to Health management
Dallagassa MR, Iachecen F, Carvalho DR, Ioshii SO.
the project; then, the step of environment preparation started, disposing the data of beneficiaries categorized as prone to Type 2 DM according to city of dwelling.Finally, the integration of information of interest obtained with the questionnaire results in georeferenced layers was started, with the construction of a panel for viewing this information geographically.In the deployment stage, the activity was the ratification of the tool, with the presentation of results obtained and the mapping of information layers, in which the more intense the color the greater the number of beneficiaries prone to Type 2 DM.

Analysis of results and statistics
Data were organized into tables, for developing studies and discussing them according to the literature available on the topic.For developing georeferenced layers, the georeferencing software Q-GIS was used.Data contained in the table were transferred to this software through a junction with a statewide layer with the 399 municipalities in Paraná, obtained on the IBGE website.The results of the beneficiaries, categorized through pattern recognition using data mining, with Type 2 DM were disposed according to the city within the Paraná state.The prevalence of each municipality was calculated based on the population who had a health plan -the numerator was the Type 2 DM and, as the denominator, the number of beneficiaries in this period, multiplied by 1,000.

Initial stage -Theoretical framework and planning
In the planning activity, through interviews and meetings with the sectors of the organization, some opportunities were identified that could be useful for composing the geographical database, including the following sources: ANS (19) , MS (Ministry of Health) (20) , SESAPR (State Secretariat of Health of Paraná) (21) , PMC (City Hall of Curitiba) (22) , Datasus (Department of Informatics of SUS) -Tabnet (Health Information) (23) , DW, Simepar (Meteorological System of Paraná) (24) , Satisfaction and Use Surveys, among others.The idea of choosing these databases consists of composing the information relating to epidemiological profile, health care network, health technology assessment, data on climate, health guidelines and protocols on the quality of care.
In the questionnaire application, the following results were achieved: in the field of Health, 100% of the interviewees scored 5 (very important).In the area of Market, 80% of the interviewees scored 5 (very important), 10% considered of average importance and 10% did not respond.In the area of Strategy, 80% of the interviewees scored 5 (very important) and 20% did not respond.
In the open questions, the following results were obtained: in the area of Health, were cited as relevant sources of application to the study: knowledge on the dwelling location, sanitation, access to goods and services, cellular network, vaccine coverage indicators, research of temporary partial coverage with beneficiaries, health applications, mortality, morbidity and birth rates.In the area of Market, were cited: demographic density of municipalities, mapping of industries and companies, the population of the cities, the population of beneficiaries, and commercial associations.In the strategic area, they cited: knowledge on the greater movement of people, movement of a control group in the State.
As a result of the questionnaire application, the direction of care of the project in the Health area and the addition of new sources of information to be incorporated in the georeferenced database were obtained.

Development stage
Through the method of identifying chronic disease patterns using the data mining technique -sorting task (8) , it was possible to categorize the chronic diseases: Type 2 DM, Neoplasms, Lung Diseases, Cerebrovascular Disease Hypertension, Obesity, Psychiatric Diseases, and Ischemic Heart.
Thus, a set of records of interest to chronic diseases was selected, without individual identification, analyzing only the information on geographic locations.Information on the health care network (Figure 1) were also implemented; for the process of integration of network information, an ETL tool was used (extraction, transformation, and loading of data) -the Pentaho, which allows one to intuitively define them graphically, thus enabling the documentation of the entire environment (18) .
Source: Prepared by the authors.With the information of the health care network layers on the identification of chronic diseases, associated with external information, the layers are integrated into the layers of a geographical database for the construction of a viewing tool.
Based on the formed database, we created through a geographic information system tool -Quantum GIS (17) , a free software with general public license (GNU) -several layers for geographic analysis.

Deployment stage
Through the application of categorization rules for identifying beneficiaries with Type 2 DM, a sample of 18,013 individuals was obtained.The epidemiological profile of these individuals is exposed in Table 1.Among the individuals within the sample, 10,495 (58.3%) are female and 7,518 (41.7%) are male.Type 2 DM is predominant in the age group from 60 to 69 years, with 3,910 cases (21.7%), followed by 50 to 59 years, with 3,484 cases (19.3%), 40 to 49 years, with 2,742 cases (15.2%), 30 to 39 years, with 2,553 cases (14.2%) and 70 to 79 years, with 2,494 cases (13.8%).It should be noted these five age groups accumulate 84.3% of the cases.The remaining age groups analyzed (80 years or older and below 30 years of age) represent 15.7% of the cases, of which 8.5% is related to ages below 30 years, situations of values in the database that recognized possible cases of other types of DM.An important risk factor for diabetes is obesity; in the population studied, 9.4% of the cases (1,702) presented attendances with the ICD of obesity in the health records, which may be E66 -Obesity, E66.0 -Obesity due to excess calories, E66.1 -Drug-induced obesity, E66.2 -Extreme obesity with alveolar hypoventilation, E66.8 -Other obesity, and E66.9 -Obesity, unspecified.
The geographical mapping of the beneficiaries analyzed was based on the State of Paraná, the main area of practice of the health plan operator.Individuals in the sample were prepared according to the city of residence and compared to the total number, with the purpose of knowing in which municipalities the eligible individuals were.Results can be analyzed through Table 2. Twentyfive cities in the State of Paraná concentrate 80.65% of the beneficiaries categorized as Type 2 DM in 2017.In the remainder, 374 cities concentrate 19.35% of the individuals with type 2 DM.In this table, the N of beneficiaries of the HPO decreases because not all of them reside in the State of Paraná.
Seeking to meet the study goals and providing different perceptions of visualization, the individuals in the sample were arranged in a georeferenced way, using the software application mentioned.
In Figure 1, one can observe that the pioneer North, Central North, and Central South regions are the ones with the highest percentage of individuals within the sample.Georeferenced analysis allows the geospatial visualization of cases, enabling knowing in which regions of the State one must invest in preventive actions and health promotion.and Guarapuava (301.57/1000beneficiaries).The research was conducted in the own database, with cases categorized as Type 2 DM.The incidence was calculated based on the number of HPL beneficiaries dwelling in each municipality -the numerator was the Type 2 DM and, as the denominator, the number of beneficiaries in this period, multiplied by 1,000.Data from the municipalities of the state with a sample smaller than 50 were excluded.
Figure 2 shows an example of the application of the visualization model, with the mapping of HPO beneficiaries categorized as Type 2 DM, where the more intense the colors the higher the incidence of the disease per municipality.The rate ranged from 15.38/1000 to 448.37/1000 beneficiaries.The cities with the highest incidence were: Jacarezinho (448.37/1000beneficiaries), Londrina

Design of a geospatial model applied to Health management
Dallagassa MR, Iachecen F, Carvalho DR, Ioshii SO.

DISCUSSION
Information obtained through the identification of standards using the data mining technique and presented geographically potentiates the discovery of new knowledge in the database and, thus, allows dynamic and agile health management actions.
An example of such practice is the geographic mapping shown in Figure 2, which identifies the need for developing actions related to Type 2 DM in the Pioneering North and North of Paraná, mainly due to its high prevalence.In another study, one could note the concern about the high prevalence of hepatitis B cases in the southwestern region of Paraná.As an action resulting from this finding, together with the health area of HPO, we sought to identify if there were beneficiaries resident in that region who had not immunized against Hepatitis B. Through a survey carried out in the database, with subsequent telephonic contact with the beneficiaries identified, it was found that a large part of the beneficiaries residing in that region, about 200, were not immunized against Hepatitis B. Based on this information, the expert on health management via telemonitoring provided guidance to the beneficiaries regarding the importance of immunization, proactively acting towards health education.Many beneficiaries were unaware of the disease scenario in that region and also that the vaccine is available for all age groups, at no cost, in the health facilities of their cities.
The results of applying this model in geographic distribution, using the data mining method, were efficient for the eligibility of beneficiaries who could potentially evolve to chronic diseases.According to the reference (8) , 5,953 beneficiaries were indicated as eligible for the diabetic program in 2011, representing 5.7% of the total beneficiaries within the portfolio.This result is compatible with the Surveillance of Risk Factors and Protection for Chronic Diseases by Telephone Inquiry (Vigitel) of 2012, which showed that 5.6% of the adults (> = 18 years) had a medical diagnosis of Diabetes (25) .
Preventively addressing epidemiological alerts and identifying potential risks that may be mitigated by the adoption of proactive actions is a key role for the HPO sustainability, collaborating and acting in partnership with public institutions focused on the care of the population.

Study limitations
This study presented some limitations.In the integration of some public bases listed with the health of managers of the institution, it was observed that some of them had not been updated.Some bases presented data only up to 2012, reflecting the quality of the databases.

Contributions to the fields of nursing and health
The study contributed to the field of nursing and public health, as the use of georeferencing tools is a current necessity of health services.It should be highlight the research valued the use of a free software tool, without costs for purchase and maintenance, an opportunity that enables the access by any health service.

CONCLUSION
The model proposed and described in this study, which is based on a georeferencing tool applied to health, has proved to be efficient and even useful for uncovering patterns at regional levels and, by a cause-and-effect relationship, shall contribute to the formulation and identification of situations involved in the prevention and promotion of health.
The geographical identification of beneficiaries categorized as prone to chronic diseases, using database recognition methodologies and aggregating the service occurrence information, through medical bill records and HPO release requests, will somewhat enable the identification of alert situations, promoting direct actions of prevention and health promotion in advance, which are linked to the main objective of the proposed methodology.
Products derived from this methodology will be useful for the management of HPO services, such as network management, health technology assessment, medical specialty analysis, among others, enabling managers to use agile and timely information for supporting the decision-making process.
In a future research, we also intend to record the alerts and situations identified by the environment, to evaluate the results the tool provided regarding cost optimization for HPO and as a benefit to its customers.

Figure 1 -
Figure 1 -Example of a data extraction, transformation, and loading process into the geographical database.

Figure 2 -
Figure 2 -Geographic distribution of cases (incidence) of beneficiaries of the health plan operator categorized as Type 2 Diabetes Mellitus, Paraná, Brazil, 2017, according to municipality of residence.

Table 1 -
Epidemiological profile and association with obesity records of the beneficiaries with Type 2 Diabetes Mellitus of the health plan operator, Paraná, Brazil, 2017

Table 2 -
Beneficiaries of the health plan operator categorized as prone to Type 2 Diabetes Mellitus, Paraná, Brazil, 2017, according to municipality of dwelling Fonte: Elaborado pelos autores.Source: Prepared by the authors.