Transcriptomic markers in pediatric septic shock prognosis: an integrative analysis of gene expression profiles

The goal of this study was to identify potential transcriptomic markers in pediatric septic shock prognosis by an integrative analysis of multiple public microarray datasets. Using the R software and bioconductor packages, we performed a statistical analysis to identify differentially expressed (DE) genes in pediatric septic shock non-survivors, and further performed functional interpretation (enrichment analysis and co-expression network construction) and classification quality evaluation of the DE genes identified. Four microarray datasets (3 training datasets and 1 testing dataset, 252 pediatric patients with septic shock in total) were collected for the integrative analysis. A total of 32 DE genes (18 upregulated genes; 14 downregulated genes) were identified in pediatric septic shock non-survivors. Enrichment analysis revealed that those DE genes were strongly associated with acute inflammatory response to antigenic stimulus, response to yeast, and defense response to bacterium. A support vector machine classifier (non-survivors vs survivors) was also trained based on DE genes. In conclusion, the DE genes identified in this study are suggested as candidate transcriptomic markers for pediatric septic shock prognosis and provide novel insights into the progression of pediatric septic shock.


Introduction
Sepsis is the world's leading cause of death of children, and it represents a complex disease with dysregulated inflammatory responses and a high mortality rate (1). An immune response initiated through an invading pathogen cannot be controlled to restore homeostasis in sepsis, which generates a pathological syndrome with sustained excessive inflammation as well as immune suppression (2). Toll-like receptors, nuclear factor kB, and cytokines including tumor necrosis factor a, interleukins 1, 2, 6, 8, and neutrophils all take part in the generation of sepsis (3). Although there are common points between adult and pediatric sepsis, crucial differences do exist in pathophysiology, clinical presentation, as well as therapeutic approaches (4).
High-throughput transcriptomic data grow rapidly, enabling gene expression profiling and identification of prognostic targets in disease. Over the last decade, several studies have explored the prognosis of pediatric septic shock by transcriptional profiling using microarrays (5)(6)(7)(8). Integrative analysis based on multiple transcriptomic datasets could help to discover robust prognostic candidates. Therefore, in this study, the gene expression patterns of pediatric septic shock patients (of survivors and non-survivors) were investigated using public microarray datasets. The differently expressed (DE) genes identified were then further interpreted by enrichment analysis, construction of co-expression networks, and receiver operating characteristic (ROC) curve analysis.
In this study, data pre-processing, identification of DE genes, ROC analysis, and support vector machine (SVM) model training were performed using R software (http://www.r-project.org/) and bioconductor packages (9). Enrichment analysis and construction of co-expression networks were also performed using Database for Annotation, Visualization and Integrated Discovery (DAVID) (10,11) and Cytoscape (12) software, respectively.

Microarray datasets search and selection
In this study, public microarray datasets were searched from the earliest dataset until Dec 3, 2018 using the keywords ''sepsis''/''septic shock'' in the Gene Expression Omnibus (GEO) database (http://www.ncbi. nlm.nih.gov/geo/) (13). Further selection of the datasets was performed for subsequent analyses. The selection criteria were: i) dataset using whole blood for gene expression analysis; ii) dataset with detailed gene expression data for both survivors and non-survivors; and iii) dataset with sample size not fewer than 30. Animal studies and adult studies were excluded.
Data collection from each eligible dataset was performed by two investigators independently. The data consisted of the sample size, GEO accession, sample source, raw gene expression data, and platform used. Final data collection was determined by checking between the two investigators.

Data analysis methods
Four Affymetrix datasets were included in this study (3 datasets for training and 1 dataset for testing). Raw data (CEL files) of the 3 training datasets were downloaded from the GEO database, merged, and preprocessed using the R package ''affy'', based on the Robust Multichip Average (RMA) method (14). Hybridization probes were then mapped to genes (Entrez IDs) based on the platform table. Probes not mapping to genes and probes mapping to multiple genes were excluded. If two or more probes mapped to the same gene, we calculated the arithmetic mean of the probe values to represent the gene expression. Using the R package ''limma'' (15,16), DE genes in septic shock non-survivors compared with survivors were identified based on the following criteria: i) absolute log2 fold change (LFC) greater than 0.8; and ii) false discovery rate (FDR)adjusted P-value less than 0.05.
We further performed functional interpretation (both gene ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis) of the DE genes identified by DAVID 6.8 (10,11). The P-value threshold was set at 0.05 in the GO analysis (17). Hypergeometric testing (P-value threshold: 0.05) was used in the pathway analysis (18). Then, calculation of Pearson's correlation coefficients was performed between the DE genes based on their expression levels. A correlation threshold of 0.8 was used, and highly correlated gene pairs were chosen to construct the co-expression network using Cytoscape software, version 3.4.0 (12). Furthermore, using the R package ''pROC'' (19), we evaluated the classification performance of each DE gene by ROC curve plotting and calculation of area under the ROC curve (AUC) values. Then, we performed recursive feature selection of DE genes using the R package ''caret'' (20) and trained an SVM model for pediatric septic shock prognosis according to the selected features using the R package ''kernlab'' (21) (Gaussian RBF kernel; 10-fold cross-validation). External model validation was also performed using the testing dataset.

DE genes in pediatric septic shock non-survivors
A total of 8 studies were obtained through the original search, among which 4 adult studies were excluded. The remaining 4 microarray datasets (3 training datasets and 1 testing dataset all from the Affymetrix microarrays) (5)(6)(7)(8) were used for subsequent analysis. Table 1 presents detailed information about the 4 datasets.
For the 3 training datasets, pre-processing resulted in expression data of 20,464 genes in 35 pediatric septic shock non-survivors and 175 survivors. A total of 32 genes were found to be differentially expressed between non-survivors and survivors across the microarray datasets based on the criteria (absolute LFC 40.8, FDR-adjusted P-value o0.05) (Supplementary Table S1).   Table 2 presents the top 10 most significantly upregulated and top 10 most significantly downregulated genes. The top upregulated gene was olfactomedin 4 (OLFM4) (LFC=1.72). OLFM4 is a glycoprotein that functions in innate immunity, inflammation, and cancer (22). It was suggested to be a neutrophil subset marker in patients suffering from septic shock (23). The most downregulated gene was membrane metalloendopeptidase (MME) (LFC=-1.11). MME is widely present in organs and tissues, contributing to the inactivation of many biological peptides (24,25). Decreased MME expression capacity was detected in the neutrophils from patients with septic shock (25).

Enrichment analysis and co-expression network construction
For further functional investigation, we performed advanced analyses (GO and pathway analysis) of the DE genes. Figure 2A and Supplementary Table S2 show the results of GO biological process analysis, which were similar to those reported in the original microarray studies (5-8). The DE genes were significantly enriched in 6 GO terms, and the top 3 were ''acute inflammatory response to antigenic stimulus'' (P=0.013), ''response to yeast'' (P=0.021), and ''defense response to bacterium'' (P=0.024). In the pathway analysis, when DE genes were mapped to the KEGG database, no significant pathway was identified.
We also constructed a co-expression network using significantly correlated DE gene pairs (correlation coefficients greater than 0.8). The co-expression network contained 16 nodes and 23 edges ( Figure 2B). We further extracted sub-networks for the GO enrichment analysis. One GO term (''chemotaxis'') was significantly enriched in the sub-network composed of ENPP2, LOC100134 822, CREB5, CXCR2, and FCGR3B. In the sub-network containing ELANE, MPO, and PRTN34, 4 significant GO terms (''response to yeast'', ''negative regulation of growth of symbiont in host'', ''defense response to bacterium'', and ''response to lipopolysaccharide'') were identified.

Classification performance evaluation and validation
ROC curves and AUC values suggested that the top performing DE genes were DDIT4, PRG2, ARH GEF40, MME, CD24, CEP55, CD69, UHRF1, and KIF11 ( Figure 3), with the best single gene's AUC as great as 0.829. The results of recursive feature selection showed that 11 genes (Supplementary Table S3) were sufficient to achieve close to 91% prediction accuracy. Based on the selected features, we trained an SVM classifier (pediatric septic shock non-survivors vs survivors) and evaluated its performance by 10-fold cross-validation (cross-validation error: 0.133). The SVM classifier performed well in the training dataset (AUC=0.986, Figure 4A). The pre-processed testing dataset was also used for independent validation, and the SVM classifier was constructed according to the training datasets, showing modest performance in the testing dataset (AUC=0.722, Figure 4B).

Discussion
Many genes are reported to be associated with septic shock mortality (23,26,27). It is a challenge currently to identify the most important candidate genes and pathways in septic shock prognosis. Growth of high-throughput transcriptomic data has enabled the integrative analysis of multiple datasets to discover robust candidates for prognosis and treatment. Therefore, to identify potential transcriptomic markers in pediatric septic shock prognosis, we performed an integrative analysis of multiple public microarray datasets.
In this study, 32 DE genes were identified in pediatric septic shock non-survivors, among which 18 genes were upregulated and 14 were downregulated. The most upregulated gene was OLFM4. The most downregulated gene was MME. Among the 32 DE genes identified, OLFM4 was reported to be present in only 20-25% of peripheral blood neutrophils and identified a subset of human neutrophils (28). MME is widely present in organs and tissues, contributing to the inactivation of many biological peptides (24,25). Decreased MME expression capacity was detected in the neutrophils from patients with septic shock (25). CXCR2 is a key stimulant to immune cell migration as well as recruitment, especially for neutrophils. The CXCR2 signaling pathway is suggested to be a potential target for modification of neutrophil dynamics in inflammatory disorders (29). CEACAM8 belongs to CEA family, which is known to function as intercellular adhesion molecules. CEACAM8 is suggested to be involved in the regulation of the neutrophil adhesion (30). ELANE encodes neutrophil elastase, which is a cytotoxic serine protease stored in azurophil granules and released after neutrophil activation. ELANE also functions in neutrophil extracellular traps (31). Neutrophil dysfunction is known to promote sepsis, and the alterations of chemokines, cytokines, as well as other mediators lead to neutrophil dysfunction in sepsis. Thus, the DE genes identified in this study, such as OLFM4, MME, CXCR2, CEACAM8, and ELANE, probably take part in pediatric sepsis development by regulation of neutrophil function (32).
Furthermore, a series of DE genes identified in this study have not yet been investigated in pediatric septic shock prognosis, and the exact contributions of these genes to pediatric septic shock are not yet clear. Since these DE genes could serve as potential transcriptomic markers for pediatric septic shock, further research is required.
In the enrichment analysis of the 32 DE genes, ''response to yeast'' (P=0.021) was the second most enriched GO term. This is interesting because yeast infection has been reported to cause sepsis (33).
According to the literature, machine learning methods, including SVM, usually have high prediction accuracy (34). Hence, in this study, we trained an SVM classifier (pediatric septic shock in non-survivors vs survivors) using gene expression data to improve the performance of prognostic prediction. The SVM classifier performed well in the training dataset with an AUC value of 0.986 ( Figure 4A), compared with Pediatric Sepsis Biomarker Risk Model-II (35). However, it showed modest performance in the testing dataset (AUC=0.722, Figure 4B), suggesting that there could be an over-fitting problem. To the best of our knowledge, no gene expression-based SVM model has yet been reported for pediatric septic shock prognostic prediction. We hope the predictive model trained in this study facilitates better prognosis and treatment of pediatric septic shock.
Because of the existence of several limitations, the results of our study should be considered cautiously. First, the sample size used in our study was insufficient. Second, stratified analyses according to potential influential factors (age, sex, and platform usage) could not be performed in this study due to the lack of corresponding data. Third, both the biological knowledge base and the pathway data are currently incomplete. In consideration of the reported impact of age and sex on pediatric sepsis patients (36), stratified analyses, based on factors including age, sex, and platform usage, are required in the future. Further analysis based on larger sample sizes is also necessary. Moreover, to explore the exact roles of candidate DE genes in pediatric septic shock, experimental verification and functional studies of these DE genes must also be performed in the future.
In conclusion, consistently DE genes in pediatric septic shock non-survivors were identified in this study, and they could be potential transcriptomic markers. GO analysis suggested that these candidates were strongly correlated with acute inflammatory response to antigenic stimuli, response to yeast, and defense responses to bacteria. We also trained an SVM classifier (non-survivors vs survivors) using candidate transcriptomic markers. These results provided novel insights into the prognosis of pediatric septic shock and promoted the generation of prognostic gene sets.

Supplementary Material
Click to view [pdf].