Combination of meta-analysis and graph clustering to identify prognostic markers of ESCC

Esophageal squamous cell carcinoma (ESCC) is one of the most malignant gastrointestinal cancers and occurs at a high frequency rate in China and other Asian countries. Recently, several molecular markers were identified for predicting ESCC. Notwithstanding, additional prognostic markers, with a clear understanding of their underlying roles, are still required. Through bioinformatics, a graph-clustering method by DPClus was used to detect co-expressed modules. The aim was to identify a set of discriminating genes that could be used for predicting ESCC through graph-clustering and GO-term analysis. The results showed that CXCL12, CYP2C9, TGM3, MAL, S100A9, EMP-1 and SPRR3 were highly associated with ESCC development. In our study, all their predicted roles were in line with previous reports, whereby the assumption that a combination of meta-analysis, graph-clustering and GO-term analysis is effective for both identifying differentially expressed genes, and reflecting on their functions in ESCC.


Introduction
Esophageal squamous cell carcinoma (ESCC) is one of the six most common cancers worldwide, especially in China and other Asian countries (Lehrbach et al., 2003;Zhi et al., 2003). Despite improvements in detection, surgical techniques and chemoradiotherapy, the five-year survival rate remains low (Kato et al., 2002). Prediction is usually according to tumor, node and metastasis system (TNM) classification. However, TNM classification merely reflects the status of cancer progression at the time of diagnosis. In contrast, molecular biological analysis clarifies biological behavior during cancer progression. Thus, the combination of the two could be more accurate in reflecting the clinical outcome (Takeno et al., 2007).
Overexpression of steroid receptor coactivatior-3(SRC-3) was more frequently observed in primary ESCC in the late T stages (T3/T4) than in the earlier T1/T2 (Xu et al., 2007). Although over-expression of cysteine-rich 61 (Cyr61) was related to less overall survival of patients in stage I/II, there was no affect on the overall survival of patients in stage III/IV (Xie et al., 2011). Low claudin-4 expression was found to be significantly associated with histological differentiation, invasion depth, and lymph node metastasis. Low claudin-4 expression revealed an unfavorable influence on disease-free and overall survival (Sung et al., 2011). PRL-1 protein expression significantly correlated with the stage of ESCC, 79.4% of the cases (27/34) in stage III ESCC, and 33.3% of the cases (1/3) in stage 1 ESCC (Yuqiong et al., 2011).
However, additional molecular prognostic markers are essential as aids in developing more effective therapeutic strategies for better prognosis. In this study, the aim was to identify more differentially expressed genes in ESCC, and predict their underlying functions. Meta-analysis provides a powerful tool for analyzing microarray experiments by combining data from multiple studies, besides presenting unique computational challenges. The Bioconductor package RankProd provides a novel and intuitive tool for this, by detecting differentially expressed genes by means of the non-parametric rank product method (Hong et al., 2006). The graphclustering approach was used for identify-ing gene expression profiles that distinguish ESCC from normal samples. Furthermore, the relevant pathways in the cluster were also analyzed by GO term analysis, to so explain potential mechanisms in response to ESCC.

Data and Methods
Meta-analysis for expression profile and differentially expressed gene (DEG) analysis Two ESCC related expression profiles, GSE23400 and GSE20347, were obtained separately from a public functional genomics data repository GEO, based on the Affymetrix Human Genome U133A Array and Affymetrix Human Genome U133A 2.0 Array, respectively.
In the GSE23400 dataset, 53 ESCC and 53 matchednormal samples were analyzed. Contributors chose not to include clinical phenotypes in their GEO submission (Su et al., 2011). In the GSE20347 dataset, 17 ESCC and 17 matched-normal samples were approved by the Institutional Review Boards of the Shanxi Cancer Hospital, and the US National Cancer Institute (NCI). Cases diagnosed with ESCC between 1998 and 2001 in the Shanxi Cancer Hospital in Taiyuan, Shanxi Province, PR China, and considered candidates for curative surgical resection, were identified and recruited for participation in the study. None of the cases had undergone prior therapy. Shanxi was the ancestral home of all of them. After obtaining informed consent, cases were interviewed, as a means of obtaining information on demographics, cancer risk factors (e.g., smoking, alcohol consumption, and a detailed family history of cancer), and clinical information (Yoo et al., 2008;Hu et al., 2010).

Statistical analysis
DEGs for the GSE23400 and GSE20347 dataset were first independently identified with the limma method, whereupon the RankProd package was then applied to overcome heterogeneity. Only those with a percentage of false-positives (PFP) (Hong et al., 2006) = 1% were considered differentially expressed between treatments and controls.
The Spearman rank correlation (r) was used for assaying comparative target-gene correlations, to thus demonstrate the potential connection between DEGs. This coefficient (Cureton 1965), which is conceptually similar to the Pearson correlation, measures the strength of the associations between two variables. The significance level was set at r >0.9, which is a more stringent threshold than the empirical value (Fukushima et al., 2011). Detection of this level was with DPClus. All statistical tests were carried out with the R program. A detailed workflow is shown in Figure 1.

Co-expression Network analysis and graph-clustering
DPClus (Altaf-Ul- Amin et al., 2006), a graph-clustering algorithm that can extract densely connected nodes as a cluster, was used to identify co-expressed groups. It is based on the density-and-periphery tracking of clusters. In this study, the overlapping-mode with DPClus settings was used. The parameter settings of cluster property cp; density values were set to 0.5 (Fukushima et al., 2011).

GO Term enrichment analysis
The use of Gene Ontology (GO) terms by collaborating databases facilitates uniform queries. The controlled vocabularies are structured for queries at different levels, thereby also facilitating the assignment of properties to genes or gene products, also at different levels, depending on the depth of knowledge involved.
DAVID (Huang da et al., 2009) was used to identify which GO terms were significantly over-represented in the biological process. The terms with p-value <0.05 and count numbers >2 were considered as significant (Boyle et al., 2004;Guo et al., 2006).

Differently expressed gene selection and correlation network construction
The publicly available microarray data sets, GSE23400 and GSE20347, were obtained from GEO. Dif- Gao et al. 531 Figure 1 -Workflow of our study. The RankProd package was used for merging the GSE23400 and GSE20347 datasets, as was the Spearman Rank Correlation for constructing a co-expression network based on their expression profiles. The Graph-clustering approach was applied for identifying enriched clusters. Finally, the function annotation of each cluster was found through GO-Term enrichment analysis.
ferentially expressed genes (DEGs) with fold change >2 and p-value <0.05 were selected by microarray analysis. Following limma method analysis, 519 genes from GSE23400 and 1360 from GSE20347 were selected as DEGs. On applying RankProd packages for meta-analysis, 9 up-regulated genes and 1876 down-regulated ones, with a percentage of false-positives (PFP) 1% and fold change value >2, were considered differentially expressed. 1885 DEGs were finally collected after meta-analysis ( Figure 2). To obtain the relationships among DEGs, the coexpressed value r = 0.9 and corrected p-value = 0.01 were considered as threshold. Finally, a correlation network was constructed from 724 relationships among 202 DEGs (Figure 3).

Graph-clustering identifies modules significantly enriched in biochemical pathways
At r >0.9, DPClus (Altaf- Ul-Amin et al., 2006) was used for identifying 6 clusters, ranging in size from 8 to 22 genes, in the correlation network of ESCC ( Figure 3). Clusters 1, 4, 5 and 6, in particular, have mutual connections since they share the same genes. The more genes shared, the greater mutual connectivity (corresponding to thicker 532 Identify prognostic markers of ESCC  lines). Graph-clustering results are presented in Figure 4. The over-represented GO terms in the clusters were used to assess their significance. The results of graph-clustering by GO term enrichment analysis appear in Table 1.

Discussion
In this study, 1885 DEGs were first identified through meta-analysis. Among these, a correlation network was constructed with 202 DEGs producing 724 relationships. By applying graph-clustering, these 202 DEGs were then clustered Gao et al. 533 into six clusters. Clusters 1, 4, 5 and 6 seemed to be mutually connected. Details of these connections were confirmed by GO-term enrichment analysis, whereby it was shown that the genes of cluster 1 may be involved in the fucose catabolic process, whereas those in clusters 4 and 5 were mainly associated with epidermis development and differentiation. Al-though there was no connection between cluster 2 and the others, it was also significantly effective in ESCC invasion and metastasis. The genes in this cluster were related to cytoskeleton and extracellular matrix organization. Therefore, the proposal is to speculate on certain genes in these clusters, identifiably involved in ESCC.

534
Identify prognostic markers of ESCC CXCL12, also known as stromal cell-derived factor-1, was first discovered among the chemokines secreted by mouse bone-marrow stromal cells, with CXCR4 as its specific receptor. The two interact to form a coupled molecular pair, which plays a prominent role in regulating directional migration and proliferation in ESCC, where both are positively located within the membrane and cytoplasm (Wang et al., 2009). Expression of the two is significantly correlated with lymph node metastasis, in the tumor stage, gender and lymphatic invasion. The overall and diseasefree survival rate is significantly lower in patients with positive CXCL12 expression than in those with negative (Sasaki et al., 2009). Furthermore, the CXCL12-CXCR4 combination possibly induces up-regulated expression of matrix metalloproteinases, whereby their further involvement in extracellular matrix modeling and the mediation of metastasis in ESCC, (Zhang et al., 2005a;Bartolomé et al., 2006;Lu et al., 2010). Apparently, MMP-7, MMP-9, and MT1-MMP are also closely associated with invasion depth and venous invasion in ESCC (Samantaray et al., 2004). This predicted mechanism was in accordance with our GO-term analysis.

Cluster 3: CYP2C9
In ESCC, higher CYP2C9 expression levels occur in the early tumor stages (pT1/pT2), compared to more advanced local tumors (pT3/pT4). Its selective inhibition decreases tumor-cell proliferation and G0/G1 phase cell-cycle arrest (Schmelzle et al., 2010). Although, even with GO-term analysis, the role of these genes in cluster 3 remained unknown, the inference is their possible involvement in the tumor-cell cycle, in accordance with their function.
Cluster 4: TGM3 and MAL genes TGM3 is a member of a family of Ca 2+ -dependent enzymes, thought to be critically involved in the cross-linking of structural proteins and formation of the cell envelope (CE), thereby contributing to the rigid structures that play vital roles in shape determination and barrier functions. The down-regulation of TGM3 genes in human ESCC tissues may lead to incapacity to form CEs and sustain toxic material (Chen et al. 2000;Luo et al. 2004). Although TGM3 expression is significantly inversely correlated with the histological grade of esophageal carcinoma, there is no obvious correlation with lymph node metastasis and depth of invasion. So, there is every indication that this gene may be an important adhesion molecule expressed by epithelial cells, and thus regarded as an invasion suppressor molecule in ESCC (Liu et al., 2006). Although details of its role in ESCC were not clear, it is proposed that it may be involved in esophageal epithelial cell differentiation and keratinocyte differentiation, as previous described (Zhang et al., 2005b). The MAL gene, a T-cell differentiation antigen, was found to be down-regulated in 10 ESCC-patients bearing tumors in different stages of development (Kazemi-Noureini et al., 2004). It has already been shown to express in four alternatively spliced forms of transcripts during the intermediate and late stages of T-cell differentiation, and in that of epithelial cells (Alonso and Weissman, 1987). Its ectopic expression in carcinoma TE3 cells could lead to the repressed formation of tumors in nude mice, the inhibition of cell motility, and the production of apoptosis by the Fas pathway (Mimori et al., 2003), whereby the proposal that Mal may be a tumor suppressor gene in ESCC development.
Cluster 5: S100A9 gene S100A9, a calcium-binding protein belonging to the S100 family, was detected as significantly down-regulated in ESCC . Moreover, S100A9 staining is decreased in poorly and moderately differentiated ESCCs, when compared with those well-differentiated, hence the inferrence that loss of S100A9 expression in ESCC generally occurs along with worsening esophageal epithelial differentiation in histological grades . These genes, together with the other 11 S100 genes and epidermal differentiation complex (EDC) genes, have been mapped to human chromosome 1q21, which is a region of structural and numerical aberration, involved in esophageal carcinogenesis and progression (Luo et al., 2004). In brief, down-regulation of S100A9 is a common event in esophageal carcinogenesis and progression through affecting epithelial cell differentiation, proliferation and apoptosis, as well as the expression of genes encoding epidermal structural proteins in a calcium-dependent manner (Luo et al. 2003;Zhi et al., 2003;Zhou et al., 2005).

Common genes in clusters 4 and 5: EMP-1 and SPRR3
Clusters 4 and 5 were closely connected by several genes (Figure 4), such as EMP-1 and SPRR3. The expression of EMP-1, a member of the PMP22 family, is found to be down-regulated in esophageal cancers (Zinovyeva et al., 2010). Over-expression of this gene in the EC9706 cell-line leads to inhibited cell proliferation, S-phase arrest, and G1-phase prolongation, thus indicating its participation in the cell-cycle. In addition, EMP-1 transfection could induce different forms of gene expression, such as integrin beta 7 (ITGB7), integrin beta 8 (ITGB8) and cadherin 5 (CDH5), involved in cell signaling, cell adhesion and cellcell communication. Retinoic acid receptor beta (RAR-b) is another up-regulated gene induced by EMP-1. Retinoids can modulate epithelial cell-growth, differentiation, and apoptosis in vitro and in vivo by binding to specific nuclear retinoid receptors, such as RAR-b. They can also prevent abnormal squamous cell differentiation in non-keratinizing tissues. Therefore, the EMP-1 and RAR-b interaction might also play important roles in epithelial cell and keratinocyte differentiation . SPRR3, one of the substrates for TGM3, is a small proline-rich protein, abundantly expressed in oral and esophageal epithelia. Its expression, although less concomitant with TGM3 loss in ESCC (Luo et al. 2004), is highly induced during human epidermal keratinocyte differentiation, thus being considered a differentiation marker of squamous epithelium (Abraham et al., 1996). Recent reviews imply that SPRR3 is frequently down-regulated in ESCC, when compared to adjacent paired mucosa (Chen et al., 2000a;de A Sim o et al., 2011). Further studies have shown that exogenous expression of SPRR3 significantly suppresses ESCC cell formation by inducing CDK11p46 protein expression and apoptosis (Zhang et al., 2008). Therefore, SPRR3 might play a crucial role in the maintenance of normal esophageal epithelial homeostasis.
Finally, the combination meta-analysis, graph-clustering and GO term analysis is presumedly effective for identifying differentially expressed genes, and speculating on their respective roles in ESCC. Hence, other unidentified genes will be the focus of our future studies.