Whole-exome sequencing of oral epithelial dysplasia samples reveals an association with new genes

Abstract The genetic basis of oral epithelial (OED) is unknown, and there is no reliable method for evaluating the risk of malignant transformation. Somatic mutations are responsible for the transformation of dysplastic mucosa to invasive cancer. In addition, these genomic variations could represent objective markers of the potential for malignant transformation. We performed whole-exome sequencing of 10 OED samples from Brazilian and Chilean patients. Using public genetic repositories, we identified 41 deleterious variants that could produce high-impact changes in the amino acid structures of 38 genes. In addition, the variants were filtered according to normal skin and Native American genome profiles. Finally, 13 genes harboring 15 variants were found to be exclusively related to OED. High-grade epithelial dysplasia samples showed a tendency to accumulate highly deleterious variants. We observed that 62% of 13 OED genes identified in our study were also found in head and neck squamous cell carcinoma. Among the shared genes, eight were not identified in oral squamous cell carcinoma. To our knowledge, we have described for the first time 13 genes that are found in OED in a Latin American population, of which five genes have already been observed in oral squamous cell carcinoma. Through this study, we identified genes that may be related to basal biological functions in OED.


Introduction
The transition of normal epithelium to oral epithelial dysplasia (OED) and oral squamous cell carcinoma (OSCC) is the result of the accumulation of genetic and epigenetic alterations. 1 This complex relationship has not yet been clarified at the molecular level, which may explain treatment failures related to these diseases. Molecular stratification is an excellent tool for diagnosing benign and malignant tumors. This characterization uses last-generation technology based on sequencing and identification of typical mutations repeatedly found in the same lesions. 2 There are many extensive studies on different neoplasms in advanced stages, but very few studies have comprehensively described the genomic changes found in precancerous lesions. [3][4][5] However, the correct characterization of molecular alterations in potentially malignant oral disorders (PMODs) and the corresponding changes in the microenvironment associated with progression can help contribute to the development of biomarkers for early detection and risk stratification and of preventive interventions to reverse or delay the development of cancer. Considering the complexity and diversity of the changes to be determined, comprehensive methods such as next-generation sequencing (NGS) are necessary. 6 Whole-genome sequencing of OED lesions was first performed in a study 7 in 2009, where genomic imbalances were demonstrated in the lesions with a high risk of malignant transformation; the study also showed that the genomic profile of low-grade OED lesions that progressed to OSCC more closely resembled that of high-grade OED than that of lesions with the same histopathological diagnosis that did not progress to OSCC. 7 Another study suggested that most genomic alterations that lead to oral cancer occur in prior stages of the condition and are the result of gradual accumulations of random alterations rather than a single event. 8 In view of the paucity of studies on the application of large-scale sequencing in PMODs and the lack of the use of this technology in the Latin American context, the aim of the present study was to identify genomic alterations in 10 low-and high-grade OED samples obtained from Brazilian and Chilean patients with a clinical diagnosis of leukoplakia using whole-exome sequencing. An understanding of DNA variations in these samples may reveal new genes associated with malignant transformation and new therapeutic targets for OED lesions.

Methodology Patient samples
The study was approved by the Ethics and Biosafety Committees of the School of Dentistry, University of Chile (Approval nº . 2014/29) and was conducted in full accordance with local ethical guidelines and the principles of the Declaration of Helsinki. The patients were not directly involved in this study. Ten OED samples, six classified as low-grade dysplasia (LGD) and four as high-grade dysplasia (HGD), were selected from the databases of the Pathological Anatomy Service of the University of Chile, Federal University of Pelotas, and Federal University of Bahia. Histopathological diagnosis of OED was confirmed by a specialist using the binary system described by Kujan et al. 9 The selected samples were clinically diagnosed with leukoplakia according to the criteria of Van der Waal. 10 Oral medicine specialists performed the clinical and biopsy assessments.

Genomic DNA extraction
Genomic DNA was extracted from paraffinembedded t issue samples accordi ng to the manufacturer's instructions (Puregene ® DNA Purification Tissue Kit; Gentra Systems, Inc., Minneapolis, USA). DNA yield ranged from 0.2 to 2.0 μg. After processing each sample, 20 μL of the solution was obtained, and a 1.0-µL aliquot complemented with 99 µL Milli-Q ® water was analyzed using a spectrophotometer (DU-640, Beckman, Palo Alto, USA) to verify the quantity and purity of each DNA sample. The genomic DNA was stored at -80 °C.

Library preparation and sequencing
Whole-exome sequencing of the 10 samples was performed by Macrogen Inc. (Seoul, South Korea). The SureSelect XT Library Prep Kit (Agilent Technologies, Santa Clara, USA) was used for library preparation and exome sequencing of the DNA samples. The sequencing library was prepared by random fragmentation of each DNA sample, followed by 5′ and 3′ adapter ligations. The adapter-ligated fragments were amplified using polymerase chain reaction and purified on a gel. A post-capture classification protocol, SureSelectXT Target Enrichment System for Illumina Version B.2, was used to ensure high efficiency and coverage. The exome libraries were sequenced using 101bp paired-end reads in a Hiseq-2500 sequencer (Illumina ® ), with a target sequencing depth of at least 100x. The total number of bases, reads, GC (%), Q20 (%), and Q30 (%) were calculated for 10 samples. The GC content was 48.78%, and Q30 was 95.59%.

Data analysis
Data were analyzed in collaboration with the BioinfoGP group (Spanish National Biotechnology Centre, CNB-CSIC, Madrid, Spain). The GATK workflow was used for variant calling, 11 followed by quantification of the variants detected and comparative statistical analysis by group, based on the presence or absence of each variant, functional annotation, variant filtering, and format of the final data, to generate readable information for the end user.
The quality of the raw sequences was analyzed using the FastQC software. 12 The raw sequences were then aligned with BWA-MEM 6 against the human reference genome (Ensembl release GRCh38.91 13 ) usi ng default parameters. MarkDuplicates, BaseRecalibrator, and ApplyBQSR routines from GATK were applied to detect read duplicates and recalibrate alignment qualities. Recalibration was based on the 1000 Genomes Gold Standard provided by GATK.
The HaplotypeCaller and GenotypeGVCF GATK functions were used for SNP/indel calling and genotyping. Annotations for recalibration and variant filtering were also added. Recalibration with the VariantRecalibrator and ApplyVQSR GATK modules was based on HapMap, 14 1000-genome high-confidence omni SNPs, 15 and dbSNP. 16 Variants from each sample were combined into a single file for comparative analysis. The case-control routine included in SnpSift 17 was used to detect variants with differential occurrence in high-and lowrisk samples. P-values were obtained for different genetic models.
In addition, the effects of the sequence variants were evaluated using the following computer programs: PANTHER, STITCH, and PMut. Specific variant calling of HPV DNA sequences was performed for more than 170 HPV subtypes, including high-risk HPV strains 31 . Data were uploaded to the European Nucleotide Archive (https://www. ebi. ac. uk/ena) under accession number PRJEB42475.
The relationship between the number of variants per sample and per group was analyzed using the chi-square (X 2 ) and Fisher's exact tests. The correlation between the total number of variants and the degree of dysplasia was determined using Spearman's test. All statistical calculations were performed using GraphPad Prism 6.03 (San Diego, USA), and significance was set at p < 0.05.

Results
Clinical and histopathological diagnostic data of the samples are presented in Table and Figure 1. Patients had no history of head and neck tumors or genetic diseases. None of the 10 samples was identified by NGS as presenting any of the 170 HPV strains. A total of 3,055,651 variants were identified and analyzed in the 10 OED samples; of these, 90.4% (2,761,210) were single-nucleotide variants (SNVs). Further, 1069 variants showed   the highest probability of causing changes with a low, moderate, or high impact on amino acid structures. These variants were responsible for changes in 773 genes, with the impact on amino acid structures being low in 416, moderate in 319, and high in 38.
Among the 773 altered genes in the OED samples, the molecular functions that clustered with the largest number of genes evaluated were binding and catalytic activities, with 196 (25%) and 171 (22%) genes, respectively. These genes participate in more than 60 signaling pathways; however, the pathways that clustered a larger number of genes were the Wnt and integrin signaling pathways, involving 15 and 13 genes, respectively ( Figure 2). Forty-one variants had a high impact on the amino acid structure of 38 genes. Table 2 summarizes all high-impact variants for each sample. The mean number of variants was higher in the HGD group than in the LGD group (p > 0.05). In order to exclude other putative germline variants, the normal skin series (HIPSCI) database was accessed 29 , remaining 22 genes harboring 24 variants. In addition to the HIPSCI database, the PRJEB24629 series 30 was also analyzed remaining 13 genes harboring 15 variants exclusive to our samples of OED (Table 3).
Of the 15 variants observed in our samples, highlighting six variants that have not been described, 8 (53%) were SNVs and 7 (47%) included a frameshift variant as the calculated functional consequence. Among the 13 genes identified, 23% were detected exclusively in LGD samples, 54%, in HGD samples, and 23%, in both samples.
Finally, considering the mutated genes in head and neck malignant tumors, Table 4 shows the mutated genes shared between OED in the present

Discussion
Considering the paucity of studies applying NGS to OED samples in Latin America, the present study contributes to the description of genomic alterations in the exomes of 10 samples from Chilean and Brazilian patients with leukoplakia associated with low-and high-grade OED. Despite the difficulties in obtaining samples with sufficient quality and quantity to perform NGS, the present study included samples from patients who were clinically diagnosed with different types of leukoplakia that were compatible with low-and high-grade OED. The advantage of the present study is the representation of the correlation between genomic data of OED and clinical features.
Most variants identified in the present study were SNVs, in agreement with the literature, which highlights this as the most common alteration in whole-genome or -exome analysis. 32 Although indel variants are less frequent than SNVs, the former are, in general, extremely important in NGS because they are implicated in many constitutional and oncological diseases. 33 SNVs and indel data for each sample were filtered using databases of previously described human genetic variants to remove all known germinal variants. The highest and significant number of average SNVs was observed in the HGD group, in line with the results of studies showing that precancerous lesions are characterized by progressive changes in the DNA sequence, gene expression, and protein structures as well as by microscopic rearrangements. 3,4 In addition, a previous study reported a smaller number of mutations in LGD samples than in HGD and OSCC 8 samples.
Most genes identified in the present study, with a high impact on amino acid structures, are related to metabolic functions such as binding and catalytic activities and participate in the Wnt and integrin signaling pathways. Functional dysregulation of the Wnt signaling pathway has been shown to promote the development and progression of oral cancer. Therefore, it is an interesting target for treatment strategies for this cancer. 34 Integrins, the main components of cell adhesion, have been implicated in almost all stages of cancer progression, from development of the primary tumor to metastasis. 35 We identified six new variants, including three SNVs with functional consequences at the splice acceptor and splice donor sites and three deletions that lead to changes in the reading frame. Harboring these variants, the GAREM1, GIPC1, and LRRC37A2 genes are associated with mutations described in HNSCC and OSCC. 36   Similar to the HGD samples that showed the accumulation of a large number of total variants, the same trend was observed with LGD samples when only high-impact variants were considered; however, this association was not significant. This finding is in agreement with that of a previous study that showed a smaller number of mutations in LGD samples than in HGD samples 8 . Regarding clinical characteristics of patients, it was not possible to establish a relationship with the variants found because the sample size was too small for this type of correlation. The correlation between clinical characteristics and genomic variation has not yet been established.
Regarding the CELA1 gene, it is important to note that this study detected three variants with a high impact on this gene, which were identified in the same samples with the same type of inheritance. CELA1, also known as ELA1, encodes elastase-1 and is localized on chromosome 12q13, near the locus for diffuse non-epidermolytic palmoplantar keratoderma.
Expression of this gene has been observed in cultured human primary keratinocytes. 37 It is well known that the distribution of mutations in the genome is not completely random. In the present study, the observation of variants that affected CELA1 in the same group of samples may be explained by mutation showers, which are not yet fully understood. This phenomenon is characterized by the simultaneous presence of multiple mutations in the same gene or small regions of the chromosomes; 38 these alterations have not yet been explained or associated with cancer; however, an analysis of available mutation catalogs revealed clustered mutagenesis in multiple myeloma and prostate and head and neck tumors. 38 On comparison of the most severely affected genes identified here with the mutated genes reported by Wood et al., 8 , although they are different variants, a match was found with the mutated WNK1 gene only in OED samples, with the mutated MCF2L gene in OED and OSCC samples, and with the mutated LAMA5, FARP1, and SHANK2 genes exclusively in OSCC samples. 8 Observation of this match in only two mutated genes in OED might be explained by the fact that, contrary to the present study, which used clinically and histopathologically representative samples, Wood et al. 8 extracted OED areas from OSCC samples, which may have increased the probability of molecular differences. Although there were fewer mutations in OED samples than in OSCC samples, in that study, most mutations detected in OED samples were also observed in OSCC samples. 8 In the present study, no normal paired controls were available for Whole-exome sequencing; however, we used bioinformatics methods to remove false-positive variants. To address similarities and differences with normal epithelial tissue, the HIPSCI genome database 29 was analyzed, and the variants were found to be germlines that were not filtered within other public genetic repositories. It is difficult to explain how these variants found in normal skin were not previously found after filtering using tools such as 1000 genomes, cosmic, and dbSNP. O´Huallachain et al. 39 confirmed the presence of a large number of variations in somatic tissues. This can be partly explained by the fact that Table 4. Similarities of genes with variants identified in OED samples from the present study (n = 15 genes), with the groups of genes mutated in samples of HNC (n = 16807 genes), HNSCC ( n = 16099 genes), and OSCC (n = 2656 genes).  choosing the relevant tissue for the comparison of genomic profiles might influence the data analysis. It is also important to understand the history and diversification of human populations at the southern tip of the Americas. The South American population has a unique genetic conformation composed of pre-Columbian and post-colonization genetic signatures. This heterogeneity could play an important role in explaining the number of variants found in this study after variant calling based on international databases, including HIPSCI normal skin genomes. 29 To address this point, we used only the available genome profiles of native Americans representing the pre-Columbian southerners. Interestingly, 38% of the variants not described in either public genetic repositories or the normal skin database were found in the native American genomes, and these could be considered the germline. Of note, these "southern variants" were localized within the same mutated genes referred to in OED previous studies, such as Wood et al. 8 In addition, some genes are considered to harbor a high malignant potential. 8,40 It is also important to evaluate the roles of these 13 OED genes in malignancy. Similar to The Cancer Genome Atlas (TCGA), which was established for consultations on the genomic diversity of various types of cancer, the Pre-Cancer Genome Atlas project was started in 2016. However, mutated genes for any of the lesions that precede different types of cancers, including OED, are not yet available on these platforms 3,4 . Based on TCGA, 36 62% of the 13 OED genes identified in our study were also found in HNSCC. Among the shared genes, eight were not identified in OSCC.
It is important to mention that nine of the 11 altered genes identified in more than half of the samples in this study were also mutated in HNSCC and six were mutated in OSCC. In the study of Wood et al., 8 SHANK2 and FARP1 had mutated in OSCC, and MFC2L, in OSCC and OED, which were also altered in our samples and in those of other studies on HNSCC 36 and OSCC. 36 Similarities in the genomic profile of OED and cancer have been described for the intestinal, breast, brain, kidney, lung, and skin epithelium, showing that the mutational process can cause clonal evolution from normal to neoplastic cells. 3,4 However, Wood et al. 40 observed subclonal heterogeneity of OSCC in five OED samples and suggested that mutational changes in stages prior to cancer do not predict the onset of invasion. 40 In 2009, a study demonstrated completely different genomic profiles of OED that progressed to OSCC compared to some other OED that did not progress to cancer despite histological similarities. 7 Despite this observation, 10 years after these discoveries, h istopat holog ica l diag nosis, i ncludi ng t he identification of different stages of OED, continues to be the standard complementary test for outlining the risk of progression and treatment decisions for PMODs. However, this method remains subjective, and diagnostic agreement between pathologists is low. In addition, regardless of their degree, not all OEDs progress to OSCC, and this information cannot be obtained through histopathological analysis. Given this current scenario, our study describes 13 genes harboring 15 variants, providing relevant information for the genomic characterization of OED. Despite the small sample size, the use of a sample comprising a heterogeneous population and an in-depth genomic evaluation method, which currently has an extremely low error rate, allowed us to obtain highly reliable results. However, it is important to complement the main results of prospective multicenter studies using studies with large sample sizes, including validations and healthy controls.

Acknowledgments
All procedures performed in this study were in accordance with the ethical standards of the Scholl Dentistry of the University of Chile (Approval N o. 2014/29) committee and with the 1964 Declaration of Helsinki and its later amendments or comparable