Identification of SNPs in RNA-seq data of two cultivars of Glycine max (soybean) differing in drought resistance

Vidal, Ramon Oliveira; Nascimento, Leandro Costa do; Maurício Costa Mondego, Jorge; Amarante Guimarães Pereira, Gonçalo; Falsarella Carazzolle, Marcelo

doi:10.1590/S1415-47572012000200014

Abstract

The legume Glycine max (soybean) plays an important economic role in the international commodities market, with a world production of almost 260 million tons for the 2009/2010 harvest. The increase in drought events in the last decade has caused production losses in recent harvests. This fact compels us to understand the drought tolerance mechanisms in soybean, taking into account its variability among commercial and developing cultivars. In order to identify single nucleotide polymorphisms (SNPs) in genes up-regulated during drought stress, we evaluated suppression subtractive libraries (SSH) from two contrasting cultivars upon water deprivation: sensitive (BR 16) and tolerant (Embrapa 48). A total of 2,222 soybean genes were up-regulated in both cultivars. Our method identified more than 6,000 SNPs in tolerant and sensitive Brazilian cultivars in those drought stress related genes. Among these SNPs, 165 (in 127 genes) are positioned at soybean chromosome ends, including transcription factors (MYB, WRKY) related to tolerance to abiotic stress.

single nucleotide polymorphisms; deep sequencing; drought resistance

RESEARCH ARTICLE

Identification of SNPs in RNA-seq data of two cultivars of Glycine max (soybean) differing in drought resistance

Ramon Oliveira Vidal^I,II; Leandro Costa do Nascimento^I; Jorge Maurício Costa Mondego^III; Gonçalo Amarante Guimarães Pereira^I; Marcelo Falsarella Carazzolle^I,IV

^ILaboratório de Genômica e Expressão, Universidade Estadual de Campinas, Campinas, SP, Brazil

^IILaboratório Nacional de Biociências/Associação Brasileira de Tecnologia de Luz Síncrotron,Campinas, SP, Brazil

^IIICentro de Pesquisa e Desenvolvimento em Recursos Genéticos Vegetais, Instituto Agronômico de Campinas, Campinas, SP, Brazil

^IVCentro Nacional de Processamento de Alto Desempenho, Universidade Estadual de Campinas, Campinas, SP, Brazil

^{Send correspondence to} Send correspondence to: Marcelo Falsarella Carazzolle Laboratório de Genômica e Expressão, Universidade Estadual de Campinas Cidade Universitária Zeferino Vaz 13083-970 Campinas, SP, Brazil E-mail: mcarazzo@lge.ibi.unicamp.br

ABSTRACT

The legume Glycine max (soybean) plays an important economic role in the international commodities market, with a world production of almost 260 million tons for the 2009/2010 harvest. The increase in drought events in the last decade has caused production losses in recent harvests. This fact compels us to understand the drought tolerance mechanisms in soybean, taking into account its variability among commercial and developing cultivars. In order to identify single nucleotide polymorphisms (SNPs) in genes up-regulated during drought stress, we evaluated suppression subtractive libraries (SSH) from two contrasting cultivars upon water deprivation: sensitive (BR 16) and tolerant (Embrapa 48). A total of 2,222 soybean genes were up-regulated in both cultivars. Our method identified more than 6,000 SNPs in tolerant and sensitive Brazilian cultivars in those drought stress related genes. Among these SNPs, 165 (in 127 genes) are positioned at soybean chromosome ends, including transcription factors (MYB, WRKY) related to tolerance to abiotic stress.

Key words: single nucleotide polymorphisms, deep sequencing, drought resistance.

Introduction

Soybean (Glycine max) is a legume crop that plays an important economic role in the international market, with a world production of almost 260 million tons for the 2009/2010 harvest. Brazil ranks as the world's second largest producer and exporter, with about 25% of the production.

Soybean production is influenced by weather oscillations, especially long periods of drought. In the Brazilian soybean culture, the frequency of drought events has increased in recent decades, probably associated with the weather changes in the world (Stokstad, 2004; Schiermeier, 2006). For example, in the states of the south of Brazil, responsible for 40% of domestic production, losses have been as much as 25% of production in recent harvests. During this time some drought tolerant and sensitive cultivars were isolated from these regions. An understanding of the molecular mechanisms governing such contrasting phenotypes upon water deprivation could provide insights for the creation of new cultivars and help in assisted selection.

Single nucleotide polymorphisms (SNPs) are single base differences between DNA sequences of individuals or lines. They can be assayed and exploited as highthroughput molecular markers. SNP markers have the potential for use in association genetics approaches (Cardon and Bell, 2001; Flint-Garcia et al., 2003). The recent availability of high-throughput DNA sequencing data has enabled studies based on highly informative SNPs. The evaluation of SNPs in large EST sequence data sets from agricultural crops has been employed to generate highdensity genetic maps and identify variable genomic regions (Du et al., 2003; Choi et al., 2007; Novaes et al., 2009; Pindo et al., 2008; Duran et al., 2009). The scalability and availability of SNPs in highly automated genotyping assays has made this molecular marker the first choice in genetic linkage and association studies in a variety of species.

High quality reference genome sequences and resources used to perform low coverage resequencing by novel sequencing technologies such as 454 Life Sciences (Barbazuk et al., 2007), Illumina Solexa (Van Tassell et al., 2008), and SOLiD (Melum et al., 2010) on different individuals, cultivars or even species are prerequisites for the traditional method of whole genome SNP discovery. In this context, genomic sequences of different individuals are aligned to a reference genome and nucleotide variation is detected.

This study used high-throughput mRNA sequencing (RNA-seq) reads derived from suppression subtractive libraries (SSH) to identify SNPs in up-regulated genes from tolerant (Embrapa 48) and sensitive (BR 16) cultivars submitted to drought stress. Such SNPs can be useful for assisted selection of soybean varieties with higher drought tolerance.

Material and Methods

Construction of cDNA libraries and sequencing

Soybean genotype Embrapa 48 (tolerant) and BR 16 (sensitive) were analyzed under two experimental conditions: drought stress and normal irrigation (for further details see Rodrigues et al., 2012, this issue). Leaves and roots from stressed and control plants were collected at three different times (25-50, 75-100, 125-150 min). RNA was isolated from tissue samples and used to construct suppression subtractive hybridization (SSH) cDNA libraries (Rodrigues et al., 2012). These three libraries, enriched in genes up-regulated during drought stress, were sequenced using Illumina/Solexa technology. The reads corresponding to genes enriched in such libraries were used in SNP mapping (see below).

Gene identification from RNA-seq data

The reference transcriptome assembly was constructed using 1,276,813 soybean ESTs available at NCBI. The bdtrimmer software (Baudet and Dias, 2007) was used to exclude ribosomal, vector, low quality and short (less than 100 bp) sequences (using default parameters). The remaining sequences were assembled with the CAP3 program (Huang and Madan, 1999) using default parameters, generating 60,747 unigenes (30,809 contigs and 29,938 singlets) (Nascimento et al., 2012, this issue).

The RNA-seq reads from SSH libraries from tolerant and sensitive cultivars were submitted to quality filtration considering bases greater than Q20 and merged in one single file. The information about the cultivars was included on the read ID to facilitate SNP genotyping. Bowtie software (Langmead et al., 2009), considering default parameters (maximum of 2 mismatches), was used to map the reads against the reference transcriptome and the output file was saved in SAM format (Figure 1).

SNP detection

SNP detection was performed with the SAMtools pileup program (Li et al., 2009), which found the variations in the SAM file, followed by VarScan (Koboldt et al., 2009), which identifies and filters variants based on read counts, base quality and allele frequency (Figure 1).

We developed a Perl script to compare the filtered SNP lists generated by the pipeline described above for the two datasets. Putative SNPs were tagged if the reads involved were mapped unambiguously on the reference transcriptome and the minor allele appeared at least 10 times. The SNPs were discarded if the depth was less than 20 and the frequency of one allele was more than 80%. For each candidate SNP, the algorithm accessed the reads over that position in the SAM file, and observed if all variations occurred only between the two cultivars.

The unigenes identified through mapping of RNAseq reads that presented polymorphic sites between both cultivars were annotated and grouped into Gene Ontology classification (GO terms) using blast2go software (Conesa and Götz, 2008). In order to identify the chromosome position of polymorphic genes, the unigenes were mapped to the soybean genome assembly (Schmutz et al., 2010) using Exonerate software (Slater and Birney, 2005).

Results and Discussion

Alignment of RNA-seq reads to reference transcriptome

The soybean data available for bioinformatics analysis comprise the reference genome (Willians 82 cultivar; Schmutz et al., 2010), gene models and the EST assembly described in Material and Methods. We chose this assembly as reference to the mapping in order to avoid (1) splicing alignment problems and (2) the absence of the untranslated regions.

After evaluating the SSH libraries, a total of 12,285,871 reads of 45 bp and 30,326,963 reads of 76 bp were obtained for sensitive and tolerant cultivars, respectively. For the sensitive cultivar, 6,317,010 reads were aligned to 7,039 contigs and 2,659 singlets, providing an average depth of coverage of 651 reads by reference sequence. For the tolerant cultivar, 6,120,258 reads were aligned to 15,667 contigs and 6,284 singlets, providing an average depth of coverage of 279 reads by reference sequence. We searched for SNPs in contigs that mapped into both cultivars. After mapping, 7,897 contigs have reads assigned to both cultivars, and were used in SNP identification.

Polymorphism detection

The identification of sequence polymorphisms was performed using SAMtools pileup and VarScan software. We identified a total of 44,510 variations (in 11,000 unigenes) that come from SNPs within each cultivar, SNPs between the cultivars and reference assembly, and SNPs between the cultivars.

To identify inter-cultivar SNPs in the soybean reference sequences and to identify the SNPs between both cultivars (i.e. non-allelic SNPs), we used an in-house SNP filter developed to identify the positions of robust candidate sequence polymorphisms relative to the aligned Solexa reads from each cultivar. We applied another filter requiring at least 10 reads on each cultivar and a maximum of 80% difference between the major and minor allele.

The 6,698 putative polymorphisms were identified in over 2,222 transcripts in tolerant and sensitive cultivars (~3 SNPs/gene). The majority of these polymorphisms represent allelic SNPs (intra-cultivar SNPs) and are not useful as molecular markers for soybean breeding. Nevertheless, we found 165 putative SNPs between tolerant and sensitive cultivars in a total of 127 unigenes (Supplementary Material Table S1). Figure 2 summarizes the GO annotation. As expected, the GO analysis of these 127 genes revealed that many are related to stress response and other ontologies possibly related to stress. Among such genes are a series of signal transducers (calmodulin, ankyrin, GTP binding proteins) and transcription factors (i.e., WRKY, MYB, Zinc Finger, Homeodomain-ZIP), all of which were cataloged in a set of soybean transcription factors (TFs) database focusing on abiotic stress responsive transcription factors (Mochida et al., 2009). Among these soybean TFs, the WRKY and MYB genes were experimentally evaluated. Zhou et al. (2008) verified that soybean WRKY genes provide tolerance to abiotic stresses in transgenic Arabidopsis plants. Soybean MYB genes were considered part of the stress tolerance apparatus based on salt and freezing stress assays in transgenic plants (Liao et al., 2008). The presence of variability in genes that likely coordinate the first steps of stress signaling, thus controlling a series of protective proteins against drought effects, denote such genes as possible markers for assisted selection. We consider all of these polymorphisms as strong candidates for molecular markers for selective breeding.

Thumbnail

Mapping SNPs in soybean chromosomes

Putative SNPs detected in uniquely mapped reference sequence unigenes were plotted along the soybean chromosomes. Alignment with the soybean genome showed that the genes with these identified putative SNPs were not distributed uniformly across the genome. Instead, they are more prominent at the chromosome ends (Figure 3). Wu et al. (2010) detected that many SNPs were clustered in gene-rich, high-recombination euchromatic regions in soybean chromosomes. This positional trait of SNPs may be a consequence of intense recombination/mutation events that are crucial for increased variability in autogamous species such as soybean.

Conclusion

The identification of SNPs in contrasting cultivars for drought stress may be useful to breeders in Marker Assisted Selection (MAS) or even in Genome Wide Selection (GWS). They can be added to the upcoming markers derived from high-throughput gene sequencing. We believe that the presence of SNPS in transcription factors is outstanding. Such proteins could be controlling a series of genes responsive to stress, making them good candidates for transgenesis and as starting points to understand drought tolerance mechanisms in soybean.

Acknowledgments

The authors would like to thank all researchers of the Brazilian Soybean Genome Consortium (GENOSOJA) involved in the generation of data used in this paper, and CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil) for financial support.

Internet Resources

Soybean transcription factors database, http://soybeantfdb.psc.riken.jp (October 10, 2011).

License information: This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary Material

The following online material is available for this article:

Table S1 - All genes with intra-cultivar SNP annotations.

This material is available as part of the online article from http://www.scielo.br/gmb.

Table S1 - Click to enlarge

Barbazuk WB, Emrich SJ, Chen HD, Li L and Schnable PS (2007) SNP discovery via 454 transcriptome sequencing. Plant J 51:910-918.
Baudet C and Dias Z (2007) New EST trimming procedure applied to SUCEST sequences. In: Proceedings of the Second Brazilian Conference on Advances in Bioinformatics and Computational Biology. Springer-Verlag, Berlin, pp 57-68.
Cardon LR and Bell JI (2001) Association study designs for complex diseases. Nat Rev Genet 2:91-99.
Choi IY, Hyten DL, Matukumalli LK, Song Q, Chaky JM, Quigley CV, Chase K, Lark KG, Reiter RS, Yoon MS et al. (2007) A soybean transcript map: Gene distribution, haplotype and single-nucleotide polymorphism analysis. Genetics 176:685-96.
Conesa A and Götz S. (2008) Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics 2008:e619832.
Du CF, Liu HM, Li RZ, Li PB and Ren ZQ (2003) Application of single nucleotide polymorphism in crop genetics and improvement. Yi Chuan 25:735-739.
Duran C, Appleby N, Clark T, Wood D, Imelfort M, Batley J and Edwards D (2009) AutoSNPdb: An annotated single nucleotide polymorphism database for crop plants. Nucleic Acids Res 37:D951-D953.
Flint-Garcia SA, Thornsberry JM and Buckler ES (2003) Structure of linkage disequilibria in plants. Annu Rev Plant Biol 54:357-374.
Huang X and Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868-877.
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK and Ding L (2009) VarScan: Variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25:2283-2285.
Langmead B, Trapnel C, Pop M and Salzberg S (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078-2079.
Liao Y, Zou HF, Wang HW, Zhang WK, Ma B, Zhang JS and Chen SY (2008) Soybean GmMYB76, GmMYB92, and GmMYB177 genes confer stress tolerance in transgenic Arabidopsis plants. Cell Res 18:1047-1060.
Melum E, May S, Schilhabel MB, Thomsen I, Karlsen TH, Rosenstiel P, Schreiber S and Franke A (2010) SNP discovery performance of two second-generation sequencing platforms in the NOD2 gene region. Hum Mutat 31:875-885.
Mochida K, Yoshida T, Sakurai T, Yamaguchi-Shinozaki K, Shinozaki K and Tran LS (2009) In silico analysis of transcription factor repertoire and prediction of stress responsive transcription factors in soybean. DNA Res 16:353-369.
Nascimento LC, Costa GGL, Binneck E, Pereira GAG and Carazzolle MF (2012) A web-based bioinformatics interface applied to Genosoja Project: Databases and pipelines. Genet Mol Biol 35(suppl 1):203-211.
Novaes E, Drost DR, Farmerie WG, Pappas Jr GJ, Grattapaglia D, Sederoff RR and Kirst M (2009) High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics 9:e312.
Pindo M, Vezzulli S, Coppola G, Cartwright DA, Zharkikh A, Velasco R and Troggio M (2008) SNP high-throughput screening in grapevine using the SNPlex genotyping system. BMC Plant Biol 8:e12.
Schiermeier Q (2006) The costs of global warming. Nature 439:374-375.
Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178-183.
Slater G and Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:e31.
Stokstad E (2004) Global survey documents puzzling decline of amphibians. Science 306:391.
Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC and Sonstegard TS (2008) SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Meth 5:247-252.
Wu X, Ren C, Joshi T, Vuong T, Xu D and Nguyen HT (2010) SNP discovery by high-throughput sequencing in soybean. BMC Genomics 11:469.
Zhou QY, Tian AG, Zou HF, Xie ZM, Lei G, Huang J, Wang CM, Wang HW, Zhang JS and Chen SY (2008) Soybean WRKY-type transcription factor genes, GmWRKY13, GmWRKY21, and GmWRKY54 confer differential tolerance to abiotic stresses in transgenic Arabidopsis plants. Plant Biotechnol J 6:486-503.

Send correspondence to:

Marcelo Falsarella Carazzolle

Laboratório de Genômica e Expressão, Universidade Estadual de Campinas

Cidade Universitária Zeferino Vaz

13083-970 Campinas, SP, Brazil

E-mail:

mcarazzo@lge.ibi.unicamp.br

Publication Dates

Publication in this collection
01 June 2012
Date of issue
2012

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

[1] Barbazuk WB, Emrich SJ, Chen HD, Li L and Schnable PS (2007) SNP discovery via 454 transcriptome sequencing. Plant J 51:910-918.

[2] Baudet C and Dias Z (2007) New EST trimming procedure applied to SUCEST sequences. In: Proceedings of the Second Brazilian Conference on Advances in Bioinformatics and Computational Biology. Springer-Verlag, Berlin, pp 57-68.

[3] Cardon LR and Bell JI (2001) Association study designs for complex diseases. Nat Rev Genet 2:91-99.

[4] Choi IY, Hyten DL, Matukumalli LK, Song Q, Chaky JM, Quigley CV, Chase K, Lark KG, Reiter RS, Yoon MS et al. (2007) A soybean transcript map: Gene distribution, haplotype and single-nucleotide polymorphism analysis. Genetics 176:685-96.

[5] Conesa A and Götz S. (2008) Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics 2008:e619832.

[6] Du CF, Liu HM, Li RZ, Li PB and Ren ZQ (2003) Application of single nucleotide polymorphism in crop genetics and improvement. Yi Chuan 25:735-739.

[7] Duran C, Appleby N, Clark T, Wood D, Imelfort M, Batley J and Edwards D (2009) AutoSNPdb: An annotated single nucleotide polymorphism database for crop plants. Nucleic Acids Res 37:D951-D953.

[8] Flint-Garcia SA, Thornsberry JM and Buckler ES (2003) Structure of linkage disequilibria in plants. Annu Rev Plant Biol 54:357-374.

[9] Huang X and Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868-877.

[10] Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK and Ding L (2009) VarScan: Variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25:2283-2285.

[11] Langmead B, Trapnel C, Pop M and Salzberg S (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25.

[12] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078-2079.

[13] Liao Y, Zou HF, Wang HW, Zhang WK, Ma B, Zhang JS and Chen SY (2008) Soybean GmMYB76, GmMYB92, and GmMYB177 genes confer stress tolerance in transgenic Arabidopsis plants. Cell Res 18:1047-1060.

[14] Melum E, May S, Schilhabel MB, Thomsen I, Karlsen TH, Rosenstiel P, Schreiber S and Franke A (2010) SNP discovery performance of two second-generation sequencing platforms in the NOD2 gene region. Hum Mutat 31:875-885.

[15] Mochida K, Yoshida T, Sakurai T, Yamaguchi-Shinozaki K, Shinozaki K and Tran LS (2009) In silico analysis of transcription factor repertoire and prediction of stress responsive transcription factors in soybean. DNA Res 16:353-369.

[16] Nascimento LC, Costa GGL, Binneck E, Pereira GAG and Carazzolle MF (2012) A web-based bioinformatics interface applied to Genosoja Project: Databases and pipelines. Genet Mol Biol 35(suppl 1):203-211.

[17] Novaes E, Drost DR, Farmerie WG, Pappas Jr GJ, Grattapaglia D, Sederoff RR and Kirst M (2009) High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics 9:e312.

[18] Pindo M, Vezzulli S, Coppola G, Cartwright DA, Zharkikh A, Velasco R and Troggio M (2008) SNP high-throughput screening in grapevine using the SNPlex genotyping system. BMC Plant Biol 8:e12.

[19] Schiermeier Q (2006) The costs of global warming. Nature 439:374-375.

[20] Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178-183.

[21] Slater G and Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:e31.

[22] Stokstad E (2004) Global survey documents puzzling decline of amphibians. Science 306:391.

[23] Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC and Sonstegard TS (2008) SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Meth 5:247-252.

[24] Wu X, Ren C, Joshi T, Vuong T, Xu D and Nguyen HT (2010) SNP discovery by high-throughput sequencing in soybean. BMC Genomics 11:469.

[25] Zhou QY, Tian AG, Zou HF, Xie ZM, Lei G, Huang J, Wang CM, Wang HW, Zhang JS and Chen SY (2008) Soybean WRKY-type transcription factor genes, GmWRKY13, GmWRKY21, and GmWRKY54 confer differential tolerance to abiotic stresses in transgenic Arabidopsis plants. Plant Biotechnol J 6:486-503.