Genome of Leptospira borgpetersenii strain 4E, a highly virulent isolate obtained from Mus musculus in southern Brazil

A previous study by our group reported the isolation and characterisation of Leptospira borgpetersenii serogroup Ballum strain 4E. This strain is of particular interest because it is highly virulent in the hamster model. In this study, we performed whole-genome shotgun genome sequencing of the strain using the SOLiD sequencing platform. By assembling and analysing the new genome, we were able to identify novel features that have been previously overlooked in genome annotations of other strains belonging to the same species.

we performed a whole-genome shotgun analysis of the L. borgpetersenii serovar Ballum strain 4E to develop a more comprehensive characterisation of this isolate.
Bacterial culture and DNA extraction were performed in accordance with previously described methods (Kremer et al. 2016b). Whole-genome shotgun sequencing was performed using the ABI SOLiD v. 4 sequencing platform with a 50 base-pair (bp) single-end library.
Two assembly approaches were evaluated for the L. borgpetersenii strain 4E genome: de novo assembly and reference-guided assembly. De novo assembly was performed using Velvet, with different parameters of k-mer length, expected coverage and coverage cutoff, and the assembly metrics were accessed using QUAST (Gurevich et al. 2013). Reference-guided assembly was performed by mapping the reads to the genome of L. borgpetersenii serovar Ballum strain 56604 (GenBank: CP012029.1, CP012030.1) using SMALT (www.sanger.ac.uk/science/ tools/smalt-0). The resulting SAM file was then converted to BAM format and sorted using Samtools before a consensus sequence was extracted using Samtools, BCFtools, VCFutils.pl (Li et al. 2009) andGATk (McKenna et al. 2010). Genome annotation was performed using Genix (Kremer et al. 2016a) and manually reviewed and curated using Artemis (Rutherford et al. 2000).
A variant calling analysis using Samtools, BCFtools, and VCFutils.pl that was based on the BAM file generated from the aligned reads was performed to identify single nucleotide polymorphisms (SNPs) and insertions and deletions (INDELs). The effect of each variant was inferred based on the annotation of L. borgpetersenii serovar Ballum strain 56604 using Snpeff (Reumers 2004).
The reference-guided assembly covered > 99.99% of the reference sequence, with a mean coverage of ~ 400x. A lack of coverage was identified in five assembly gaps, which were associated with mobile elements, such as transposons, that can change their positions in the genome and usually result in gaps in reference-guided assemblies or collapses in a single contig in de novo assembly from short reads, even when they are present in multiple copies. The de novo assemblies generated by Velvet were highly fragmented, with more than 5,000 contigs and a very low N50 (53), thus making it inappropriate for any downstream analysis.
An overview of the features identified in the genome of L. borgpetersenii serovar Ballum strain 4E is shown in Table I. We identified a total of 3469 coding DNA sequences (CDSs), 37 transfer-RNAs (tRNAs), 4 ribosomal RNAs (rRNAs), one transfer-messenger RNA (tmRNA) and five riboswitch loci. Although the proteincoding genes found were almost the same as those identified in the genome of the 566604 strain, by using our annotation pipeline, we were able to identify new noncoding features that were overlooked in the reference annotation: a tmRNA gene and riboswitches. TmRNAs act as tRNAs and contain a small open reading frame (ORF) in their structure that encodes a peptide responsible for many regulation processes, including targeting proteins for degradation (Hayes & Keiler 2010). Riboswitches are non-coding motifs that are present in the untranslated regions (UTRs) of some messenger RNAs (mRNAs) that act as cis-regulatory elements and bind specific metabolites to inhibit the gene expression. Riboswitches are typically found in genes associated with vitamin metabolism, e.g., cobalamin (Garst et al. 2011, Serganov & Nudler 2013. Previous studies have demonstrated that riboswitch-regulated cobalamin (B12) autotrophy is a virulence factor in the Leptospira genus (Fouts et al. 2016). Therefore, a deeper annotation of the non-coding features may provide a better description of the resulting transcriptome.
The genes that presented missense mutations in the variant calling analysis are displayed in Table II, and their locations in the genome of L. borgpetersenii strain 4E are illustrated in Figure. A total 41 genes were pre-dicted as being affected by missense mutations in the variant calling analysis, although 33 of them had only one mutation. One of the genes, LB4E_3373, which encodes a protein from the PF07598 family, presented 27 missense SNPs compared with the genome of the strain 56604. The orthologous genes from the PF07598 family have already been associated with adaptation to the host in L. interrogans and regulation of gene expression during the life cycle and infection (Lehmann et al. 2013).
Another highly mutated gene, LB4E_1801, contains 10 single-nucleotide polymorphisms, but its function remains unclear, and no BLAST hit in Uniprot (Apweiler et al. 2004) could allow a deeper annotation or provide any clue regarding its molecular function. We also identified five mutations in a gene that encodes an M23 peptidase (LB4E_1800), which has already been associated with fibronectin binding in Leptospira and other closely related genera, such as Treponema, and may contribute to the pathogenesis process.
Although de novo assembly is usually preferred for microbial organisms, it is associated with many drawbacks in obtaining a finished genome (Miller et al. 2010). Therefore, reference-guided assembly, based on an already-finished genome, may be a more reasonable approach to assembly when a closely related reference is available. In our case, both the 4E and 56604 strains belonged to the same species and serovar, so there was no requirement for a de novo assembly in this case. In fact, the SOLiD sequencing platform offers a high-throughput platform, short read length (50 bp) and high accuracy (Liu et al. 2012); as such, it is more suitable for re-sequencing/ reference-guided assembly than de novo assembly.
The SOLiD sequencing process requires two hybridisation reactions to identify each base, so the probability of an erroneous identification or an artificial insertion / deletion tends to be much smaller compared with other platforms, such as Illumina and IonTorrent. In fact, in cases of sequencing artefacts, the decoding process of the colour-space data (csFASTA) to nucleotide-space format (FASTA) (based on nucleotide transitions) would generate an apparently random sequence after the erroneous base position, which probability would not align to the reference genome in the read mapping process (during a variant calling study) or be used in the assembly of a contig (in a de novo assembly). The reliability of this platform has already been demonstrated by previous studies, such as the benchmarking study performed by Ratan et al. (2013), which compared the accuracy of three different NGS platforms (ABI SOLiD, Illumina HiSeq and Roche 454 FLX) in the identification of SNPs in a human sample. In this case, the number of SNPs identified by SOLiD that were validated by mass-spectrometry was higher that what was observed in the other platforms. Therefore, although SOLiD is not a first option for microbial genomics, for which benchtop platforms are usually preferred, it may still be a valuable tool when aiming for a more accurate identification of mutations.
Finally, a de novo assembly using SOLiD data resulted in a more fragmented draft genome than other sequencing technologies because the short read length implies that there are many difficulties for the assem- a: includes other families of non-coding RNAs predicted by Genix which are neither tRNAs nor rRNAs; b: were considered as assembly gaps runs of "Ns" with length equal or longer than five nucleotides, and those shorter than this threshold were considered INDELs or base-calling errors.
bly algorithms due to the occurrence of repeated regions along the genome that may be collapsed by the de Bruijn graphs (Alkan et al. 2010); as such, this method would not be appropriate in this case.
In the context of Leptospira research, genomic data from highly virulent strains might provide useful information for the development of new vaccines and diagnostic methods and improve the understanding of bacterial pathogenesis and pathogen-host interactions. The presence of a high number of mutations in a gene that encodes a protein from the PF07598 family, which has already been suggested to be related to its pathogenesis in previous studies, may be one of the reasons for the greater virulence observed in this strain, although further studies are necessary to validate this relationship. Additionally, the availability of genomic characterisation from this strain might be useful for future epidemiological surveillance studies in southern Brazil.