Identification of candidate genes for lung cancer somatic mutation test kits

Over the past three decades, mortality from lung cancer has sharply and continuously increased in China, ascending to the first cause of death among all types of cancer. The ability to identify the actual sequence of gene mutations may help doctors determine which mutations lead to precancerous lesions and which produce invasive carcinomas, especially using next-generation sequencing (NGS) technology. In this study, we analyzed the latest lung cancer data in the COSMIC database, in order to find genomic “hotspots” that are frequently mutated in human lung cancer genomes. The results revealed that the most frequently mutated lung cancer genes are EGFR, KRAS and TP53. In recent years, EGFR and KRAS lung cancer test kits have been utilized for detecting lung cancer patients, but they presented many disadvantages, as they proved to be of low sensitivity, labor-intensive and time-consuming. In this study, we constructed a more complete catalogue of lung cancer mutation events including 145 mutated genes. With the genes of this list it may be feasible to develop a NGS kit for lung cancer mutation detection.


Introduction
Lung cancer is the most common cancer in terms of incidence and mortality throughout the world, accounting for 13% of all cases and for 18% of deaths in 2008 (Jemal et al., 2011). In China, lung cancer rates are increasing because smoking prevalence continues to either rise or show signs of stability (Youlden et al., 2008;Jemal et al., 2010). Lung cancer is most often diagnosed at late stages, when it has already presented local invasion and distal metastases (Perez-Morales et al., 2011). Therefore, the identification of early molecular events inherent to lung tumorigenesis is an urgent need, so as to provide a basis for intervention in carcinogenesis.
All cancers arise as a result of the acquisition of a series of fixed DNA sequence abnormalities, mutations, many of which ultimately confer a growth advantage to the cells in which they have occurred. Several mutated genes related to tumor growth, invasion or metastasis have been identified in lung cancer, and new agents that inhibit the activities of these genes have been developed, aiming to improve the outcome of lung cancer treatment (Dy and Adjei, 2002). Among these genes, EGFR (epidermal growth factor receptor) is frequently overexpressed in non-small-cell lung cancer (NSCLC) (Rosell et al., 2009). EGFR tyrosine kinase inhibitors (e.g. Gefitinib and Erlotinib) have been tested in trials for treating NSCLC (Fukuoka et al., 2003;Kris et al., 2003;Giaccone et al., 2004;Spigel et al., 2011;Liu et al., 2012). Furthermore, KRAS and TP53 gene mutations have been found in up to 30% of lung cancer cases and have been considered as predictive factors of poor prognosis (Huncharek et al., 1999;Pao et al., 2005;Mogi and Kuwano, 2011).These frequently mutated genes can be used to design kits for early detection of carcinogenesis. For example, a kit from Life Technologies Corporation (Ion Ampli Seq TM ) was designed to detect 739 COSMIC mutations in 604 loci from 46 oncogenes and tumor suppressor genes, with emphasis on the deep coverage of genes KRAS, BRAF and EGFR for the detection of somatic mutations in archived cancer samples.
In this study, we analyzed the latest data on lung cancer, aiming to identify frequently mutating genomic "hotspot" regions in human lung cancer genes. The results are significant and promising, once the ability to identify the actual sequence of mutations may help determining which mutations lead to precancerous lesions and which produce invasive carcinomas. Thus, our study may contribute to improve lung cancer diagnosis and design better prognosis kits. details and contains information on human cancers. The current release (v64) describes over 913,166 coding mutations of 24,394 genes from almost 847,698 tumor samples. To construct a complete dataset of cancer mutation information, we had to start by finding a complete catalogue of gene mutations in lung cancer patients. Therefore, we downloaded somatic mutation data from the COSMIC database. All genes selected for the COSMIC database came from studies in the literature and are somatically mutated in human cancer (Bamford et al., 2004). Based on this authority resource, we constructed a complete dataset of cancer mutation information for the analysis described in the following.

Lung cancer mutation extraction
As our aim was to collect data on lung cancer, we searched for mutation information in the web-software BioMart Central Portal. BioMart offers a one-stop shop solution to access a wide array of biological databases, such as the major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE (Haider et al., 2009). We used the Cancer BioMart web-interfere, with the following criteria: 1. Primary site = "lung"; 2. Mutation ID is not empty. The first criterion ensures that the mutation occurs in lung tissues, and the second criterion helps excluding the samples without mutation in a specific gene. Thereby we obtained the list of mutations in lung cancer.

Mutation frequency calculation
In order to identify the most important mutated genes in lung cancer, we calculated the mutation frequency for each mutated gene. In this calculation, we considered the same sample used in different experiments as a different sample. For example, if a gene AKT1 mutation was found in two different experiments, gene AKT1 was assigned a mutation frequency of 2, even if both experiments were performed with samples from the same tissue of the same patient. Sometimes, frequencies are presented as percentages. In this study, however, we did not divide the frequency of 2 by the whole sample, because we focused only on how common the mutation is and how many of these mutations were identified. For example, if the mutation percentage was 100%, but the number of samples with the mutation was only 3, this gene was not accepted in our diagnostic kit.

Protein-Protein Interaction (PPI) network
The number of mutation events in the list of lung cancer mutations is very high, but some of these mutations are not found in lung cancer only. So, in order to find the key genes of this list, we analyzed the relationship between those genes. We started with the intent of using KEGG for digging into these relations. However, KEGG shows the very putative gene in a specific biological pathway, and there are many genes which cannot be located in the accurate site in some pathways. For the past few years, PPI databases have become a major tool for digging into biological relations. The great protein-protein interaction source offers a possible way of guessing their function through the interacted protein. If an interacted gene has a lung-regulated mechanism, the anchor gene will always show a similar function. Then, if all genes inputted to PPI have similar functions, there will be a regulation network among them.
As there are so many public PPI databases and each database has its own features, we combined the following databases, introduced by a former paper (Mathivanan et al., 2006): HPRD, IntAct, MIPS, BIND, DIP, MINT, PDZBase and Reactome. Genes of the mutation list were mapped to these PPI databases and a PPI network was constructed. Thereafter, we found that some genes were isolated from the main network and could exclude them from our list of candidate genes for lung cancer. With this combined database, we were able to narrow down our lung cancer candidate gene list as much as possible.

Results
The most complete catalogue of lung cancer mutation data Using the methods described above, we obtained a complete list of lung cancer mutations (data not shown) comprising a total of 21,135 mutation events. To our best knowledge, this is the most complete and detailed catalogue of mutation events associated with lung cancer. Almost all the 21,135 listed events are somatic mutations, with only two exceptions: mutation c.1334_1335ins17 in gene FLCN is a confirmed germline mutation, and mutation 456 Chen et al.

Figure 1
-Mutation types in lung cancer genome. Mutation types included three major types: substitution, deletion and insertion. Each of the major mutation types was categorized into frameshift mutation or in-frame mutation. The latter, although not causing a shift in the triplet reading frame, can, however, lead to the encoding of abnormal protein products.
c.1579_1580GG > CT in gene SF3B1 is a nonspecified type of mutation. To obtain a profile of the mutation type distribution in lung cancer, we calculated the statistical frequency of each mutation type, presented in Figure 1, showing that there are many mutation subtypes, such as missense, nonsense, deletions and insertions. Among them, the missense mutations accounted for the largest proportion (61%).

Calculation of mutation frequency in lung cancer
The gene mutation list contains 21,135 mutation events related to 20,906 unique samples. In order to screen the most important mutated genes, we calculated the mutation frequency of each gene in the list. Figure 2 illustrates the top 23 genes found in lung cancer, clearly showing that the most frequently mutated genes in lung cancer are EGFR, KRAS and TP53, with a mutation frequency of Lung cancer somatic mutation test kits 457  10957, 3106 and 2034, respectively. Next, the mutation events in each gene were sorted (Figure 3), this showing that the mutation type of each gene varies dramatically, even in the top 23 mutated genes. As shown in Figure 3, gene TP53 was the one with the largest number of mutation types, amounting to more than ten times the number of mutation types of KRAS, although the mutation frequency of KRAS was higher than that of TP53. 458 Chen et al.

Construction of the PPI network
By mapping the mutated genes into PPI databases, we constructed a PPI network, shown in Figure 4. For a deep data-mining of this network, we calculated the interaction weight (numbers of neighbors) of each core node and visualized the relationships of weight and mutation event for each gene ( Figure 5). Analyzing Table 1 and Figure 5, it becomes evident that genes with high mutation frequencies also had higher interaction weights. For example, the top 3 mutated genes EGFR, KRAS and TP53 also had higher interaction weights: 32, 37 and 41, respectively. On the other hand, we noticed that some genes with relative lower mutation frequencies were the core nodes in the PPI network. For example, AKT1 has a high PPI weight (41) but a low mutation frequency (6).

Candidate genes for sequencing kits
After mining the COSMIC database and analyzing the lung cancer PPI network, we screened the most important mutated genes in lung cancer based on one of the following criteria: PPI weight > 7 and mutation frequency > 5. After selection, 145 genes meeting the cutoff criteria were screened out (Table 2). We consider that these mutated genes could be used to design sequencing kits for diagnostic purposes.

Discussion
Many researchers have attempted to find a complete mutation profile of each cancer. In this study, we obtained a list of lung cancer mutations totaling 21,135 mutation events. We believe that to this date this list is the most complete and detailed catalogue of lung cancer mutation events available. Mutations from Stage I to Stage II, from cell line to biopsy, from small cell carcinoma to NSCLC, were almost all included in this list.
As expected, by calculating the mutation frequency for each gene in this list, EGFR, KRAS and TP 53 were found to be the top 3 most frequently mutated genes in lung cancer. In addition, these three genes were the hub nodes in the PPI network. EGFR and KRAS have been proved to be lung cancer oncogenes for years. An investigation done in 2004 on the gefitinib therapy effect found somatic mutations of EGFR in 15 of 58 unselected tumors from Japan and in one out of 61 from the United States (Paez et al., 2004). EGFR has since been accepted as a target for lung cancer therapy, and EGFR mutations may predict sensitiv-Lung cancer somatic mutation test kits 459 Figure 5 -PPI core genes showing number of neighbor genes vs. somatic mutation frequency. Each dot represents a lung cancer gene; genes with more than 10 neighbors or with more than 10 COSMIC somatic mutation events are shown.
ity to gefitinib. In recent years, developing EGFR mutations into a diagnostic target has been a research hotspot. In 2008, Maheswaran et al. (2008) used molecular characterization of circulating tumor cells as a strategy for noninvasive serial monitoring of tumor genotypes during treatment. It is known that most lung adenocarcinoma-associated EGFR mutations confer sensitivity to specific EGFR tyrosine kinase inhibitors. Politi and Lynch (2012) found that EGFR exon 19 insertion mutations are also sensitive to this class of drugs. All these findings suggest that lung cancer patients should be tested for EGFR mutations.
After EGFR, the second most important gene in the development of lung cancer is KRAS. As early as in 2001, Johnson et al. (2001) found that mice carrying KRAS muta-tions were highly predisposed to a range of tumor types, predominantly early-onset lung cancer. Furthermore, mutations of KRAS and EGFR can be combined to predict prognosis. For example, Massarelli et al. (2007) found that patients with both EGFR mutation and increased EGFR copy number had a > 99.7% chance of objective response to EGFR-TKI therapy, whereas patients with KRAS mutation with or without increased EGFR copy number had a > 96.5% chance of disease progression. They concluded that the KRAS mutation should be included as an indicator of resistance in the panel of markers used to predict response to EGFR-TKI lung cancer therapy. Based on the fact that these core genes in the PPI network are strongly re- 460 Chen et al. lated to lung cancer, we believe that this PPI network contains the most important genes related to lung cancer.
Many companies detect lung cancer by only four somatic gene mutations (EGFR, KRAS, BRAF and PI3K). As expected, these genes are all included in our list ( Table 2; mutation frequency of BRAF = 130, weight = 25; mutation frequency of PI3K3A = 93, weight = 48). BRAF encodes a RAS-regulated kinase that mediates cell growth and the activation of the malignant transformation kinase pathway (Sithanandam et al., 1990). Brose et al. (2002) found that BRAF mutations in human lung cancers may identify a subset of tumors sensitive to targeted therapy. Furthermore, an in vivo study with the inhibitor of the last of the four genes, PI3K, aimed at testing its activity in lung cancer treatment (Engelman et al., 2008), this leading to the conclusion that inhibitors of the PI3K-mTOR pathway may be activated in cancers with PIK3CA mutations and, when combined with MEK inhibitors, may effectively treat KRAS mutated lung cancers.
As EGFR and KRAS kits are widely used, we listed our EGFR and KRAS mutation events in Tables 3 and 4. In these tables, we sorted the mutations in EGFR and KRAS by frequency, with "Y" meaning the typical mutation used in the detection kits supplied by many companies; and "-" meaning that the mutation has a location in the genome similar to some of the other detected mutations. But, first of all, "-" is an alert to the fact that there are many different kinds of mutation in the same region. Traditional methods such as PCR are unable to detect such complicated mutations. This is the first advantage that a Next-Generation Se-Lung cancer somatic mutation test kits 461 It is really urgent to develop a NGS kit for detecting lung cancer mutations. Our genes for the sequencing kit can be designed for somatic mutation detection. The 145 gene set comprises all of the somatic mutation detecting purpose genes -EGFR, KRAS, BRAF and PIK3CA (Saal et al., 2005) -and may provide a feasible choice for a NGS kit. With the progresses in sequencing technology, mutations in lung cancer patients can be detected in one day or even less time. This technology applied to cancer genome sequencing can speed up cancer research, and the kit for diagnostic or recurrence evaluation should be introduced in clinical care as soon as possible, in order to offer patients a better chance of less suffering and a higher survival perspective. 462 Chen et al.