Elementary screening of lymph node metastatic-related genes in gastric cancer based on the co-expression network of messenger RNA, microRNA and long non-coding RNA

Gastric cancer (GC) is the fifth most common cancer and the third leading cause of cancer-related deaths worldwide. The high mortality might be attributed to delay in detection and is closely related to lymph node metastasis. Therefore, it is of great importance to explore the mechanism of lymph node metastasis and find strategies to block GC metastasis. Messenger RNA (mRNA), microRNA (miRNA) and long non-coding RNA (lncRNA) expression data and clinical data were downloaded from The Cancer Genome Atlas (TCGA) database. A total of 908 differentially expressed factors with variance >0.5 including 542 genes, 42 miRNA, and 324 lncRNA were screened using significant analysis microarray algorithm, and interaction networks were constructed using these differentially expressed factors. Furthermore, we conducted functional modules analysis in the network, and found that yellow and turquoise modules could separate samples efficiently. The groups classified in the yellow and turquoise modules had a significant difference in survival time, which was verified in another independent GC mRNA dataset (GSE62254). The results suggested that differentially expressed factors in the yellow and turquoise modules may participate in lymph node metastasis of GC and could be applied as potential biomarkers or therapeutic targets for GC.


Introduction
Gastric cancer (GC) is the fifth most common cancer and the third leading cause of cancer-related deaths worldwide (1). The high mortality of GC might be attributed to delay in detection and is closely related to metastasis and recurrence. Lymph node metastasis occurs in 70% of patients with advanced GC (2,3). It is an early event in GC metastasis and an independent prognostic factor, and can significantly affect the prognosis of patients (4). Therefore, predicting, diagnosing and investigating lymph node metastasis in GC is very important for the prognosis and treatment of patients.
The molecular mechanism of lymph node metastasis has been preliminarily clarified, and mainly includes cell migration and degradation of extracellular matrix, tumor cell apoptosis and immune escape, formation of new lymphatic vessels, and other aspects (5,6). A series of growth factors, cytokines, chemokines, miRNA and long non-coding RNA (lncRNA) associated with lymph node metastasis have been discovered, which were found to interact with each other to form a complex regulatory network and are involved in various processes of lymph node metastasis in GC (7)(8)(9)(10)(11)(12)(13). For example, miRNA-375 is downregulated in gastric carcinomas and regulates cell survival by targeting PDK1 and 14-3-3z (14), miRNA-7 functions as an anti-metastatic miRNA in GC by targeting insulin-like growth factor-1 receptor (15), and miR-148a contributes to the maintenance of homeostasis in normal stomach tissue and plays an important role in GC invasion by regulating MMP7 expression. (16). As for lncRNAs, the HOTAIR functions as a competing endogenous RNA to regulate HER2 expression by sponging MiR-331-3p (12), HMlincRNA717 may play crucial roles during cancer occurrence and progression (8), and ATB plays an important role in epithelial-mesenchymal transition to promote invasion and metastasis through the TGFb/miR-200s/ZEB axis, resulting in a poor prognosis in GC (10). Although researchers have explored the factors that affect lymph node metastasis, most of the studies focus on one or several factors, and the molecular mechanism of lymph node metastasis is still unclear.
Recent developments in bioinformatics and statistical genomics provide biological systems approaches to better understand the organization of the transcriptome and transcriptional regulation. Among all of the systematic biology approaches, gene network analysis is a powerful approach that considers gene interactions. It has been widely applied in gene expression studies of humans and model organisms (17)(18)(19).
In the present study, we used large quantities of messenger RNA (mRNA)-seq and microRNA (miRNA)seq data in GC patients in The Cancer Genome Atlas (TCGA) database to screen mRNA, miRNA, and lncRNA with differential expression between the samples with and without lymph node metastasis, and then construct the co-expression networks based on the differential expression of various factors. The network module and Cox regression model were combined to screen survivalrelated genes. Moreover, the correlation between different expression levels of these genes and prognosis of GC patients was verified in another independent dataset.

Data sources
Samples of gastric cancer were selected from the TCGA database (http://cancergenome.nih.gov/), in which mRNAs data and miRNAs data were profiled from the Illumina platform. The lncRNAs data were annotated from mRNA transcriptomic database through matching to the HGNC database. In total, 396 samples with mRNA-miRNA-lncRNA paired samples were obtained, and 356 samples remained after removing 40 samples with unclear status in lymph node metastasis. A total of 210 lymph node metastasis samples and 146 non-lymph node metastasis samples were included.

Data preprocessing
After download from TCGA, the expression profiles in level 3 were merged and prepared. The mRNA, miRNAs and lncRNAs with X20% missing values were removed, while those with o20% missing values were replaced by mean values. The values of lncRNA and mRNA were assessed by PRKM, and miRNA values were assessed by RPM. Next, these values were transformed by log2 logarithms to obtain Gaussian distribution.

Identification of differentially expressed genes and functional enrichment analysis
Significant analysis microarray (SAM) (20) is an algorithm used to screen the differentially expressed genes (DEGs). When differential gene expression was simply checked by t-test or variance analysis, high rates of false positive results were produced under repeated tests. However, SAM is effective at correcting the false positive rates through controlling the false discovery rate (FDR) in multiple tests to filter DEGs with significant differences. An absolute value of log2 ratio X1.5 and FDR p0.001 were set as the threshold for determining the significance of gene expression difference.
The identified DEGs were analyzed in terms of gene ontology (GO) function and pathway enrichment analysis using the DAVID (Database for Annotation, Visualization, and Integrated Discovery) hypergeometric test (21).

Construction of co-expression networks and excavation of network modules
Based on information of disease-related gene expression profiles under GC, co-expression networks were constructed by calculating adjacency matrix A of a gene pair using the WGCNA package (ohttps://cran.r-project.org/ web/packages/WGCNA/index.html4) (22). To calculate the adjacency matrix, an intermediate quantity called the co-expression similarity s ij was first defined. The default method defines the co-expression similarity s ij as the absolute value of the correlation coefficient between the profiles of nodes i and j: where x i and x j are the vector expression values of gene i and j, respectively, and Cor is used to evaluate Pearson correlation coefficient of the two vector values. A weighted network adjacency was defined by raising the co-expression similarity to a power: With bX1, the adjacency function calculates the adjacency matrix from expression data. The adjacency in Equation 2 implies that the weighted adjacency a ij between two genes is proportional to their similarity on a logarithmic scale, log(a ij ) = b Â log(s ij ). Pearson correlation coefficient S ij is exponentially transformed into connection coefficient a ij , to achieve a reliable network. Network topology property is taken into account to excavate modules of co-expression networks in WGCNA. This algorithm analyses not only the relationship of two conjoint node genes, but other genes correlated to the two nodes. Connection coefficient a ij in co-expression network was turned into weight coefficient W ij by the following formula: W ij considers overlapping between neighbors of the two conjoint genes i and j. Network modules were excavated after hierarchical cluster analysis of W gene weighted matrix.

Survival analysis
Gastric carcinoma related genes were sorted out on the background of expression profile data, and then were constructed into the co-expression network according to their expression levels, which was next divided into various modules. Samples were hierarchically clustered based on module genes and the significant differences of survival time between samples were analyzed through K-M simple clustering. Finally, analysis of COX single variable regression (23) was carried out between survival time and factors in modules.

Preprocessing of expression profile data
There were 560 miRNAs and 12,474 mRNAs left after preprocessing. Totally 1,165 lncRNAs were obtained from transcripts of mRNA expression profiles matched to HGNC database (http://www.genenames.org/). The expression data of mRNA, lncRNA and miRNA after preprocessing are displayed in Figure 1. After standardization, the distribution values among samples were relatively uniform.

Screening of differentially expressed factors and functional enrichment analysis
The differentially expressed mRNA, miRNA and lncRNA were screened by SAM algorithm. The result showed that 908 differentially expressed factors with variance 40.5 were screened, including 324 differentially expressed lncRNAs, 42 differentially expressed miRNAs and 542 differential expressed genes (see Supplementary Table S1). These differentially expressed factors divided the samples into two classes as shown in Figure 2.
In all, 542 DEGs were functionally enriched by DAVID analysis. There were 15 major enriched GO terms assigned into the biological process categories, including immune response ( Table 1, also see Supplementary Table S2).

Construction of co-expression networks and mining of modules
Based on the above mentioned 908 differential factors (including differentially expressed genes, miRNA and lncRNA), co-expression networks were constructed by using WGCNA package in R language. The gene degrees in the co-expression network were submitted to the power distribution law (24), which showed in line with the free scale characteristic in biological networks ( Figure 3).
Furthermore, we carried out the module mining of genes in the co-expression networks. As shown in Figure 4, these differentially expressed factors in the co-expression networks were divided into four module groups presented by blue, brown, turquoise and yellow color, which included 73, 39, 89, and 35 differentially expressed factors, respectively. In addition, there were other 672 factors that could not be modular, and were presented in grey.

Classification validation of factors in four modules
The following analysis was the classification of 356 samples based on the module factors. It was found that yellow and turquoise modules could separate samples efficiently (group 3 was excluded when classified by yellow module because there were rare death samples in the group), and there were significant differences between the survival curves of isolated samples (Po0.01; Figure 5). By contrast, the genes in blue and brown modules could not separate samples effectively, and therefore, these genes were given up in the following survival time analysis.

Classification validation of factors in yellow and turquoise modules
GSE62254, the expression profile dataset of gastric cancer, was obtained from GEO database (http://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE62254), including survival time information of 300 samples, which generated from the GPL570 platform. In the pretreated samples, 20,692 genes were found and their box plot distribution is shown in Figure 6.
Thirty-four genes within yellow module were gained from the GSE62254 database. Using these module genes, samples were well grouped into two classes. Moreover, the survival time in Figure 7A showed significant differences between the two groups of samples (P=0.0144). The similar results were received via 81 genes of turquoise module contained in GSE62254 database: samples were divided into three groups, and their survival times exhibited significant differences (P=0.00128; Figure 7B). These analyses demonstrated that selected genes of yellow and turquoise modules have significant correlation with survival time of GC samples.

Discussion
Lymph node metastasis and recurrence are the main factors that affect the prognosis of GC, and lymph nodes without metastasis have important immune monitoring functions. Therefore, it is very important to accurately determine the extent and degree of lymph node metastasis, and to carry out rational lymph node dissection. As an early and complicated event in GC metastasis, lymph node metastasis involves a series of functional and regulating genes (4), and therefore it is more meaningful to mine the co-expression network of various cancer related factors. In the present study, we downloaded large quantities of mRNA-seq and miRNA-seq data from the TCGA database to screen the mRNA, miRNA and lncRNA related to GC lymph node metastasis, and then constructed co-expression networks based on the differential expression of these factors. Compared with previous studies, which only focused on one or two factors of coding genes, miRNAs and lncRNAs (7)(8)(9)(10)15,(25)(26)(27), the present study  Fisher's exact test was used for statistical analyses, and the P value was adjusted by Bonferroni correction.
performed a systematic analysis of the three factors for the first time. There were 542 DEGs which were functionally enriched into 15 major GO terms in the biological process category, most of which were related to cancer, such as immune response, cell proliferation, cell migration, cell differentiation and cell adhesion.
Thanks to the rapid development in bioinformatics and statistical genomics, gene network analysis has become a powerful approach that can explore the interactions between genes and has been widely applied in gene expression studies of humans and model organisms (17)(18)(19). Meanwhile, genes that are highly interconnected within the network are usually involved in the same biological modules or pathways, and therefore, modular analysis also plays an important role in the analysis of gene co-expression network (28,29). Several studies have demonstrated the value of analyzing networks based on TCGA database. In the present study, 908 differentially expressed factors including 542 genes, 42 miRNAs and 324 lncRNAs were included in the network analysis, and furthermore, genes in the co-expression networks were used for the modular mining. Four module groups were coded in blue, brown, turquoise and yellow color, which included 73, 39, 89, and 35 differentially expressed factors. Except genes in the group 3 of the yellow module lacking sufficient death number, the other genes in yellow and turquoise modules could separate  samples efficiently, and there were significant differences between the survival curves of isolated samples.
Therefore, the remaining seven identified factors including three genes in yellow module (RTN1, GAPT, GRAP2) and five factors in turquoise module (CILP, SCRG1, TNFAIP8L3, MIR345 and HCG18) could be new factors related to survival of GC. In fact, there is evidence that these genes may be associated with several diseases and even cancer. For example, RTN1, a neuroendocrine cell specific protein, localized in endoplasmic reticulum, might be involved in the activation of the expression of androgenresponsive genes and related to prostate cancer (37). As an adapter protein, GRB2 has been identified as a major mediator in Ras-mitogen-activated protein kinase (MAPK) activation, which is essential for growth factor-induced cell proliferation and differentiation and plays a central role in embryo development and malignant transformation. Therefore, we believe that GAPT and GRAP2 are involved in the activation of MAPK and growth factor-induced cell proliferation and differentiation, which are often associated with the development of cancer (38). CILP, an extracellular matrix protein abundant in cartilaginous tissues, is implicated in common musculoskeletal disorders, including osteoarthritis and lumbar disc disease (39). It is worth  noting that the TNFAIP8 family members are usually associated with immune homeostasis and inflammatory cancer diseases. For example, TNFAIP8 itself usually functions as an oncogenic molecule and it is also associated with enhanced cell survival and inhibition of apoptosis, and TNFAIP8-like 2 (TIPE2) governs immune homeostasis in both the innate and adaptive immune system and prevents hyper-responsiveness (40). However, the function of TNFAIP8L3 remains unclear. Therefore, our study provides some insight into the emergent properties of prognostic genes, and further investigation of the functional roles of these newly identified factors is urgent for a functional validation system in GC.
In conclusion, our data provides a comprehensive bioinformatics analysis of genes and pathways, which may be involved in the lymph node metastasis of GC.  We found a total of 542 genes, 42 miRNAs and 324 lncRNAs, and then constructed the interaction networks of these differentially expressed factors. Furthermore, we conducted functional modules analysis in the network, and found that except for the genes in group 3 of the yellow module, the other genes in the yellow and turquoise modules could separate samples, and therefore these genes could be potential prognostic biomarkers. However, further analyses are still required to reveal their mechanism in the process of tumor genesis and development in GC.