Analysis of influencing factors of severity in acute pancreatitis using big data mining

1. Surgical Intensive Care Unit (SICU), Department of General Surgery, Jinling Hospital, Medical School of Nanjing University, Nanjing, 210002, China. 2. Information and Technology Office, Health Statistics and Information Center of JiangSu Province, Nanjing 210008, China. 3. Department of Surgery, Traditional Chinese Medicine Hospital of Jiangsu Province, Nanjing 210029, China. 4. Department of General Surgery, the 81st Hospital of P.L.A./Bayi Hospital Affiliated Nanjing University of Chinese Medicine, Nanjing 210002, China.


INTRODUCTION
2][3] The mortality rate of mild AP (MAP) is lower than 1%, however, 20% of all cases may develop into severe acute pancreatitis (SAP).The mortality rate for SAP remains high, from 10% to 30% due to pancreatic necrosis and organ failure.Although many epidemiological studies on etiology and severity in AP have been carried out, most of them used simple statistical correlation analysis or single factor analysis, and their sample size was relatively small.
Big data is defined as information assets characterized by such high velocity, variety, and volume that specific data mining methods and technology are required for its transformation into value. 4In some studies, big data mining provided useful information for the healthcare area. Big data from healthcare may not only provide useful information, enabling the public sector and healthcare providers to assess their healthcare systems and distribution of resources, but also has excellent potential to improve our understanding of the effectiveness of treatments in the real world, as well as of the incidence, management, and prognosis of various medical conditions.Useful information obtained from big data will allow health professionals to provide better medical care. 7n our study, we chose the Jiangsu province the research target and analyzed data of hospitalized patients with AP in the multicenter of Jiangsu province, between January 2014 and December 2016.This was the first and most extensive series of patients with AP used to investigate its incidence, character and severity in China.The aim of this study was to evaluate the epidemiological characteristics of AP and explore potential relationships between these factors and severity.

METHODOLOGY
The data-set came from the Health Information Platform of the Jiangsu province and was provided by the Health Statistics and Information Center of the Jiangsu province.Twelve prefecture-level cities, including Nanjing, Zhenjiang, Wuxi, Changzhou, Suzhou, Nantong, Yancheng, Xuzhou, Lianyungang, Yangzhou, Taizhou and Huai'an, were connected to the Health Statistics and Information Center of the Jiangsu province and authorized it to store, integrate and process data.The data storage and parallel computing architecture are based on open source big data processing framework, including Hadoop and Spark.The concrete frame can be found in Figure 1.
The data in the current study were structured, highly organized and searchable by a straightforward and simple algorithm.The data-set contained 5,659 consecutive AP patients who were admitted to hospitals from ten prefecture-level cities of the Jiangsu province, between January 2014 and December 2016.] In our study, a batch SOM algorithm was used for training.The training mode was composed of an input layer and a computational layer; six variables were brought into the input layer, including age (eight categories: 0~20y, 21~30y, 31~40y, 41~50y, 51~60y, 61~70y, 71~80y,>80y), gender (two categories: male and female), marital status (four categories: unmarried, married, divorced and widowed), blood type (four categories: A, B, O and AB), etiology (six categories: BAP, HAP, AAP, TAP, IAP and OTAP) and severity (two categories: MAP/moderately severe AP (MSAP) and SAP).First, initialization was performed, each node distributed its parameters randomly and had the same number of parameters as the input dimension.Then, each input data was matched to the most appropriate node, if the input is D dimensional, that was X = { i x , i= 1,... D}, the discriminant function could be represented by the Euclidean distance: . An adjacent neighborhood was set up as ( ) k S t , The power of the output neuron and its adjacent neurons were modified by the formula: , in which ( ) t h is a gain term which tended to gradually decreased to zero to facilitate convergence, and .Output (O k ) equals to , New normalized learning samples were used to repeat the learning process.The training automatically stopped when the full number of epochs had been reached.The learning rate of the classification phase was 0.8, and the neurons of the computational layer were presented with a 20×20 hexagonal topology.The graphical diagram of the SOM neural network can be found in supplementary figure 1, and the Matlab code of the program in appendix.The Davies-Bouldin index (DBI) was used to access clustering separation.The formula was as the follows: ,in which D(X k) presented the in-class distance of class k, (Xi, X j) presented the between-class distance from class i to class j, and c was the number of classes.The results were evaluated according to the following rule: the lower the DBI value, the better the clustering effect.
Continuous data were expressed as mean values + standard deviation.Significant differences between groups were determined by chi-squared analysis and unpaired Student t-test.Modeling was developed to cluster the variables related to AP using Matlab2017 software (MathWorks Institute, USA).Statistical AP patients were extracted from the data-sets.The study was approved by the Ethics Review Committee at the Jinling Hospital and the Health Statistics and Information Center of the Jiangsu province.The research was carried out according to the principles of the Declaration of Helsinki.
Diagnostic standard and severity of AP were also in accordance with the revision of the Atlanta classification consensus of 2012. 8AP patients were diagnosed based on the presence of at least two of the following three criteria: (1) an initial serum amylase and/or lipase level at least three-fold above the normal upper limit; (2) typical abdominal pain consistent with AP; and (3) suggestive imaging evidence compatible with AP, such as CT, MRI and ultrasonography.SAP was diagnosed according to organ function failure, and/ or local complications (abscess, necrosis, or pseudocyst). 9Organ function failure was defined as a shock (systolic pressure less than 90 mm Hg), renal function failure (serum creatinine more than 2.0 mg/dL after hydration), pulmonary function insufficiency (PaO 2 no more than 60 mm Hg), or gastrointestinal hemorrhage (more than 500ml/24 hr.).The etiologies of AP were distinguished as follows: If gallstones were found by imaging tests in the gallbladder, or in the bile duct, or in both, the case would be diagnosed as acute biliary pancreatitis (ABP); 10 If triglyceride level in blood plasma was more than 11.3mmol/L, the case would be diagnosed as hyperlipidemic acute pancreatitis (HAP); 11 Alcoholic acute pancreatitis (AAP) was considered when daily alcohol consumption was over 80g for more than five years or if there was social or weekend abuse on a regular basis for not less than five years. 12When AP was induced by traffic accidents, falling injuries, abdominal surgery, and various kinds of injury factors, it was defined as traumatic acute pancreatitis (TAP). 13In addition to the above etiologies, AP induced by a parathyroid tumor, pancreatic carcinoma, mumps, hypertriglyceridemia, or some drugs (chlorpromazine, phenformin, azathioprine, etc.) was summarized as other types of AP (OTAP).If etiologies could not be found for AP patients in their clinical history, by laboratory or imaging tests, these cases would be defined as idiopathic AP (IAP).
A Self-organizing map (SOM) is a clustering tool that focuses on the neighborhood structure between classes.A SOM is a predefined network, it defines a mapping from the input data space R n onto a 2-d array of nodes.It can convert complex nonlinear statistical relationships between high-dimensional data items analysis was performed using SPSS 21.0 software (SPSS, Chicago, IL, USA).Dichotomous variables were created out of continuous variables by using clinically-important cut off points.In our study, P values<0.05were considered statistically significant.

Marital status and AP
Most of AP patients were married (75.4%).Among MAP/MSAP patients, 6% were unmarried.However, 16% of SAP patients were unmarried, thus the difference was significant (P=0.016).

ABO blood type and AP
In terms of ABO blood type, AP patients with blood type AB accounted for 8.8% of the general population and was significantly lower than that of AP cases (13.9%)(P=0.019).In subgroups of SAP cases, the number of AP patients with AB blood type reached 150 (18.7%)(P=0.007).A similar phenomenon was found in AP patients with blood type B (33.4%).However, the difference between AP cases and the general population (30.1%) wasn't significant (P=0.094).

Regional distribution of AP
A total of 5,659 patients with AP were included in the data-sets.Based on the heat map of the geographical distribution of AP patients in the Jiangsu province (Supplementary Figure2), there were much more AP patients in the southern Jiangsu province than in the northern Jiangsu province, especially in Nanjing (1,229; 21.7%), Suzhou (764; 13.5%) and Yangzhou (663; 11.7%).The total number of AP patients in the three areas made up approximately 50% of the whole of Jiangsu province.
The relationship between the geographical distribution of AP and etiological factors was analyzed.The proportion of AAP in the northern Jiangsu province (Xuzhou 18.4%, Lianyungang 16.9%, Huai'an 13.2%) was much higher than that in the southern Jiangsu province (Suzhou 2.6%, Wuxi 4.4%, Changzhou 5.8%).The incidence of HAP in all regions was approximately the same, but it was relatively lower in Wuxi (1.8%) and Lianyungang (2.5%) when compared with other places in the Jiangsu province.TAP occurred more frequently in Nanjing and Huai'an (Figure 2).

SOM neural network
After the training by the SOM network classification algorithm, we found that the DBI index was the smallest (DBI = 0.89) when the number of steps was 200 epochs, so the study was based on 200 epochs.Then the distance matrix and color matrix of the SOM neural network were calculated, and a cluster distribution feature map was drawn (supplementary figure 3).In the picture, each hexagon in the diagram represents a case, the between-classes distance was gradually increasing when the color of the unit changed from black to white, and there was no clear distinction between the cluster results.The class boundary was set by dark, continuous, adjacent nodes.The whole SOM network was divided into five parts, which defined the number of clustering categories as five.The clustering results of the SOM neural network were evaluated by the DBI index.When the number of clustering was five, the DBI index was the smallest value (0.89), which showed the effect of clustering model was the best.
In the data-sets attribute classification feature map (supplementary figure 4), the same neuron represented the same patient.In figure 4, the effect of clustering of etiological factors, severity of AP and ABO blood type was similar to that of the general clustering distribution, which illustrates that the four attributes played critical roles in the process of clustering.

Explanation of clustering results
The variables of age, gender, etiological factors, marital status, severity of AP and ABO blood type were compared in different categories of AP patients by One-way ANOVA.When the difference found was significant (P<0.05), it proved the effects of the clus-FIGURE 3 ter were good (Appendix Table ).The whole sample was divided into five classes by the SOM neural network (Figure 3).The characteristics of AP patients in class I could be described as follows: most of the AP patients were male (65.6%); the main range of onset age was 55~65 years; as far as marital status was concerned, the percentage of divorced patients in class I was higher than that of the entire sample (15.8% vs 9.6%); blood type B and blood type AB were more frequent than other types; the main etiology of the class was BAP (92.3%); there were 291 SAP patients in the current class (20.1%), which was more than the whole (14.2%).In class II, the proportion of men accounted for 75.1% of the total and the average age was about 59.5 years; most patients were married (95.5%); the percentage of blood type O patients was higher than that of the whole (42.9% vs 30.7%); in terms of etiology, AAP accounted for 17.5%, which was more than the entire sample (7.7%).Furthermore, few of them were SAP (4.1%).From class III, we found that most of the patients had the following characteristics: female (68.1%), 40~50 years, married (80.7%), blood type O (42.0%), BAP (94.6%) and the number of SAP patients in the class was only 145 (10.4%).However, in class IV, most patients were men (92.7%), the main range of onset age was 25~45 years; unmarried and divorced patients were the primary population (42.5% and 30.9%, respectively); blood type B and blood type AB were more frequent than blood type A or O; most patients suffered from HAP or APP.Furthermore, the proportion of TAP was much higher than in other classes (4.7% vs 1.2%, 0.4%, 0.6%, 0.9%); 55.4% of the patients were diagnosed as SAP, which was much more than in whole sample (14.2%).The characteristics of AP patients in class V were similar to that of the whole sample, including gender, age, marital status, blood type, etiology and severity of AP.

DISCUSSION
To the knowledge of the authors, the study described in the present report was the first and using the most extensive series of patients with AP to investigate characteristics by means of big data mining.The data-sets selected for current the study were nearly representative of the entire JiangSu province population; therefore, selection bias was not a problem.
In the present study, most of the AP patients were married, 40~60 years and BAP.There were relatively more male patients than female.SAP pa-tients accounted for 14.2% of all patients, which was similar to the results of other studies found in the literature. 18,19As far as blood type was concerned, blood type AB and B were more frequent in AP patients than in the general population of the JiangSu province (AB:13.9%vs 8.8%; B:33.4% vs 30.1% ); the phenomenon was more significant in patients with SAP.During the past eight decades, some publications have examined the possible associations between blood type and infection.They reflected uncritical attempts to mathematically link unstratified or random data.The interaction between pathogen and the erythrocyte membrane may reflect antigenic similarity, adhesion through specific receptors, or modulation of antibody response.Epithelial cells express ABH and Lewis antigens, which are effectively cell-surface glycoconjugates used by parasites, bacteria, and viruses as receptors for attachment, resulting in different susceptibilities depending on the antigen profile of an individual. 20By using the same blood group antigens as their host, certain microbial parasites utilize molecular mimicry as a defense against the host's immune system.The chemical signatures of the membranes of many gram-negative organisms, such as Escherichia coli, resemble A and B blood group antigens.In vitro experiments have shown that anti-B antibodies kill E. coli, and anti-A and anti-B antibodies may therefore play a similar role in destroying gram-negative bacteria in vivo. 21B and AB blood groups were associated with increased incidence of E. coli, streptococcus pneumoniae, and salmonella infections, which are important pathogens for pancreatic infection, necrosis or sepsis. 22egional distribution of AP was also analyzed.The number of AP patients in the northern Jiangsu province was much lower than that of the southern Jiangsu province.The reasons for that include several aspects: in addition to having a larger population, the economy and medical technology in the southern Jiangsu region is more developed than that of the northern Jiangsu region, which refers patients from surrounding areas to southern the Jiangsu region, especially to Nanjing.In terms of etiology, APP occurred more frequently in the northern Jiangsu region, especially in Xuzhou and Lianyungang.The reasons for geographic differences in the incidence of AAP were related to alcohol consumption.Men in these regions have a habit of alcohol abuse, especially on holidays.][25][26] The strong support of a close link between alcohol and pancreatitis comes from individual-level studies and large samples.TAP occurred more frequently in Nanjing and Huai'an and was associated with the frequent occurrence of fatal traffic accidents in both areas and a higher number of pancreatic injuries from abdominal surgery in the Nanjing region.
The SOM neural network is one of the most suitable networks for segmentation.This is an unsupervised network based on the competitive learning and discovering of topological structures hidden in the input data for visual display in one or two-dimensional spaces. 27Two huge advantages of the SOM-based segmentation methods are unsupervised training and fast learning.In our study, data-sets of AP patients were clustered by a SOM neural network.
Class I showed that the severity of old divorced male patients with blood type AB or B who suffered from BAP was usually serious.That is because older patients are not sensitive to pain due to the atrophy of the abdominal muscles and the degeneration of the peripheral nerves, meaning the peritonitis symptoms were not obvious, which led to misdiagnosis.The organ function of old patients is often poor, which can easily cause organ failure due to stress.Furthermore, blood type AB proved to be associated with gram-negative bacteria, which is an important type of pathogenic factor for infectious pancreatitis.All of the above would exacerbate AP.
Class II showed that if an AAP patient was old, male, married and had blood type O, he would not be severe.That may be because older men tend to consume less alcohol than the younger man, and the living habits of married men are bound to their families.9][30] These factors contributed to a relatively limited amount of alcohol consumption and didn't cause the necrosis of the pancreas, which reduced the likelihood of SAP.
In class III, middle-aged unmarried female BAP patients with blood type O often belonged to MAP or MSAP.These women may be more concerned with their career and have a more regular life.
Class IV disclosed that middle-age, unmarried or divorced male patients with blood type B/AB who suffered from HAP or AAP were likely to be SAP.Due to social factors, unmarried or divorced middle-aged men often have no good living habits, with excessive alcohol consumption.Crapulent phenomenon and alcohol abuse were much more evident than in other groups.Some studies have shown that blood type is related to infectious diseases.Moreover, the frequency of SAP and organ dysfunction of HAP was significantly higher than BAP.The combination of these factors significantly increased the severity of AP. 31 There are a few limitations to our study.Firstly, although the data of the present study contains most of the basic information of AP patients from hospitals in the Jiangsu province, there were still a few hospitals that were not connected to the Health Information Platform of the Jiangsu province, and the Suqian district hadn't established medical networks yet, all of which resulted in incomplete data.Secondly, There are some disadvantages to clustering methods that use SOM neural network: increasing the number of neurons in this network does not usually result in a better segmentation performance; they need high-dimensional input space with empirical features for optimal performance; 32 and images with heavy noise cannot be segmented successfully.
In the future, more variables or parameters of AP patients will be dug to acquire more valuable information, and the data in consecutive years will be contrasted to explore the tendencies in AP characteristics and build a guideline for future interventions.

CONCLUSIONS
The number of unmarried patients in MAP/MASP was lower than that of SAP.Blood types AB and B were more frequent in AP, especially in SAP.The differences between southern Jiangsu and northern Jiangsu in number of AP patients and the proportion of AAP were significant.If BAP patients were male, old, divorced, and had blood type AB or B, they were more likely to develop into SAP; Middle-age, unmarried or divorced male patients with blood type B/AB who suffered from HAP or AAP were also likely to be SAP.

DISCLOSURE OF CONFLICT OF INTERESTS
The authors state that they have no conflicts of interest.