Hidden Markov models applied to a subsequence of the Xylella fastidiosa genome

Dependencies in DNA sequences are frequently modeled using Markov models. However, Markov chains cannot account for heterogeneity that may be present in different regions of the same DNA sequence. Hidden Markov models are more realistic than Markov models since they allow for the identification of heterogeneous regions of a DNA sequence. In this study we present an application of hidden Markov models to a subsequence of the Xylella fastidiosa DNA data. We found that a three-state model provides a good description for the data considered.


Introduction
The rate of sequence data generation in recent years has provided abundant opportunities not only for the development of new approaches to problems in computational biology but also for the exploration of the already known techniques on data that have never been analysed before.
The starting point in most data analysis consists of the use of well established methodology.As the analysis progress, data particularities may require the development of specific tools that are more suitable to better describe and model the data.Creation of new methods requires deep understanding of the current ones, especially when these methods are incredibly powerful and are not as known as they should be due to their mathematical and computational complexity.We consider that hidden Markov Models (HMM) exemplify this notion very well since although these models are not new, we believe that molecular biologists are not aware of the possibilities that these models provide.
Our aim in this study is to discuss dependencies and heterogeneity in DNA data and how they can be appropriately accounted for by the use of HMM.We applied this sort of model to a subsequence of the Xylella fastidiosa (Xf) genome as a way to suggest possible analysis for the whole genome.
According to Lambais et al. (2000), Xylella fastidiosa is a bacteria associated with diseases that cause tremendous losses in many economically important plants, including citrus.Xylella fastidiosa is the causal agent of Citrus Variegated Chlorosis (CVC), a disease that affects all commercial sweet orange varieties and that represents a major concern to the Brazilian citrus industry.The plant pathogen attacks citrus fruits resulting in juiceless fruits of no commercial value.Xylella fastidiosa is the first plant pathogen to have its genome (the total genetic information stored in the chromosomes of an organism) completely sequenced.In addition, it is probably the least previously studied of any organism for which the complete genome sequence is available.
Data sets generated by sequencing the entire Xylella fastidiosa genome pose new challenges since now biologists need quantitative tools and statistical methods to help them to analyze sequences.Some recent publications about Xylella fastidiosa signal the need not only for the application of current statistical methods available to analyse its sequenced data but also for statistical research to attack its particularities.Chen et al. (2000) analysed sequenced data from 16 strains of Xylella fastidiosa originating from nine different hosts.They studied aspects such as sequence heterogeneity in the classification of X. fastidiosa at the subspecies level.The studies by Qin et al. (2000) and Mehta et al. (2001) are concerned with the evaluation of Xylella fastidiosa genetic diversity isolated from diseased citrus and coffee in Brazil.
Due to the huge size of the datasets, statistical analyses for the whole genome of many organisms demand the use of high power state of the art computers.That may represent a major problem since we do not have enough available for this purpose.
In this study we fit hidden Markov models to a dataset of the bacteria Xylella fastidiosa genome.Model selection is performed using the Bayesian Information Criterion (BIC) and Akaike's Information Criteria (AIC).In section 2 we talk about dependencies in DNA data.In section 3 we discuss heterogeneity in DNA sequences.Hidden Markov Models are introduced in section 4. In section 5 we briefly introduce AIC and BIC for model selection.Phage lambda and Xylella fastidiosa datasets are analysed in section 6.

Dependencies in DNA Data
An obvious first summary of a DNA sequence is just the distribution of the four base types.Although it would be convenient for mathematical modeling if the four bases were equally frequent, almost all empirical studies show an unequal distribution.That means that a simple independence model for DNA sequences have their uses, but only go a little way.
We need to take into account in a model the fact that neighboring bases in DNA sequences are not independent.According to Tavaré and Giddings (1989), associations between adjacent bases will lead to associations between more distant bases and an estimate of how far the relations extend may be found from Markov chain theory.
According to Weir (1996), Markov chain analyses are of use at the genome level, rather than at the level of an individual gene, since the last may involve very short sequences that are not sufficient to demonstrate the presence of higher order chains.The same author observes that it is unlikely that the same Markov chain can describe the whole genome, and if a Markov chain has been fitted to a genome, no biological mechanism is implied, but useful questions can be answered.For example, the frequency of particular subsequences (words) can be predicted.
According to the website http://www.accessexcellence.org/AE/AEC/, in genetic engineering it is common to use the many enzymes that are able to modify or join existing DNA molecules, or to aid in the synthesis of new DNA molecules.For example, the enzyme DNA polymerase makes possible the attachment of two or more DNA molecules to one another.The enzyme DNA ligase breaks DNA molecules into fragments, while the so called restriction endonuclease enzyme (REE) functions by "scanning" the length of a DNA molecule.Once the REE encounters its particular specific recognition sequence (word), it will bond to the DNA molecule and cut it in a predictable and reproducible way.It is important to use Markov chains to help a biologist to estimate the expected number of fragments produced when a specific restriction enzyme is applied to the genome.
Markov chains might describe DNA sequences in terms of their nucleotide composition, i.e., as a string of letters from a four-letter alphabet, {A, C, G, T}.Let us denote each one of the four base types as states.We are going to in-troduce some terminology and notation useful for Markov chains.
Generally speaking, for a given subject, let X t denote the response on a categorical variable at time t, t = 0, 1, ..., T. The sequence (X 0 , X 1 , X 2 , ...) is an example of a stochastic process, an indexed family of random variables.In this paper X t indicates the nucleotide at position t in the sequence.
Without invoking any biological mechanism, a Markov chain of order r implies that the base present at certain position in a sequence depends only on the bases present at the previous r positions.In more formal grounds, a stochastic process is a rth-order Markov chain if, for all t, the conditional distribution of X t + 1 , given X 0 , ..., X t , is identical to the conditional distribution of X t + 1 , given X t , ..., X t -r + 1 .Given the states at the previous r times, the future behavior of the chain is independent of the past behavior before those r times.For a first-order Markov chain with I possible states, the conditional probabilities with i, j = 1, ..., I are called transition probabilities.The extension for higher orders is immediate.If η ij (t) does not depend on t, the Markov chain is called homogeneous.Statistical inference for Markov chain uses standard methods of categorical data analysis, such as log-linear models.Some useful references are Anderson and Goodman (1957), Birch (1963), Bishop et al. (1975), McCullagh and Nelder (1989), Agresti (1990), and Avery et al. (1999).

Heterogeneity in DNA Sequences
Markov chains and log-linear models are important tools to help us describe local properties of DNA sequences.However, Markov chains cannot account for the heterogeneity that may be present in different regions of the same DNA sequence.The basic assumption of this kind of model is that the chain is homogeneous, meaning that the same transition probability matrix is assumed true for the whole sequence being analysed.However, biologists know that coding and non-coding regions of DNA present different nucleotide frequencies.Thus a Markov model would predict some behavior that is not observed in the data.Therefore, this kind of model may be of little practical use in a variety of problems.
An example of heterogeneous DNA is presented by Bernardi and Bernardi (1986).Working with biochemical aspects of DNA, they explain that the nuclear genome of warm-blooded vertebrates exhibits a compositional compartmentalization, in that it consists mainly of a mosaic of very long DNA segments, the isochores.According to the authors, isochores are characterized by fairly homogeneous regions in C + G content, and distinct isochores present distinct proportions of C + G.The authors also state that genome does not present very many isochores and that het-erogeneity within an isochore is very low but is high between isochores.Heterogeneity may be due to differences in patterns of base composition and dependence between neighboring bases, and it might reflect functional and structural differences between regions.
It is possible to describe those heterogeneous unobserved regions of the genome of a given organism by using statistical tools instead of biochemical ones that would then be used more parsimoniously.The referred tools are statistical models that can account for heterogeneity that is present in the sequences.This is the subject of our next discussion.

A Hidden Markov Model for DNA Sequences
In this section we are going to present some hidden Markov models developed by Churchill (1989).These models are still very popular (see Boys et al., 2000).We will make a brief description restating some aspects of section 4 in Churchill (1989).For major details about this issue the referred paper should be consulted.
While the bases A, C, G, T represent observed outcomes and for short will be denoted outcomes, the homogenous unobserved regions we are looking for will be called hidden states and for brevity will be denoted states.Our job is to estimate how many hidden states there are and to present a map describing where they are located.The number of states is considered to be finite and fixed and corresponds to the different regions of the DNA.We introduce now some notation and definitions needed for describing hidden Markov models for DNA sequences.
Consider a sequence of random variable {Y i : i = 1, ..., n} with distribution determined by a corresponding sequence of unobserved states {s i }.Denote the sequence of observed outcomes and states up to time t by, respectively, y t = {y 1 , ..., y t } and s t = {s 1 , ..., s t }.
Admitting a fixed number of states and multinomial outcomes, let y t = (y t,0 , ..., y t,m-1 ) be a vector whose components are all zero except for one equal to unity, indicating which of the m possible outcomes is observed.Each observation is associated with one of r states indicated by the vector s t = (s t,0 , ..., s t,r-1 ).There is a vector π 0 of initial probabilities associated to s 1 , such that Σ i π 0i = 1.Thus, for the π 0i 's there are r -1 parameters to estimate.
The distribution of y t given that the state at time t is k is multinomial, that is, y t | s t,k Multinomial(1, p 0,k , ..., p m-1,k ).The parameter p i,k is the probability of observing outcome i when the current state is k, subject to the constraint Therefore, for the p ij 's there are r x (m-1) parameters to estimate.
The observation equations, considering independence between the outcomes are P(y s p It is also possible to allow for Markov dependence between the observed outcomes.In the case of first-order dependence, the probability of observing outcome j given that the previous outcome was i and the current state is k is p ij,k , where Therefore for the p ij,k 's there are r x m x (m-1) parameters to estimate.
The observation equations allowing for first-order dependence are P(y y , s p The state process is a Markov chain on the r states.Denote the r x r matrix of state transition probabilities by Λ = (λ ij ).Thus, for the λ ij 's there are r x (r -1) parameters to estimate.
The system equations can be written as: The observation and system equations are assumed to be completely specified.The marginal posterior distribution of the state at time t, Pr(s t | y n ) is called the smoothed estimate of s t .Graphic displays of the underlying state process are produced by plotting the smoothed estimates against the sequence index t.That represents the mentioned map that describes where the homogeneous regions of the genome are located.
A recursive updating algorithm can be applied as follows to reconstruct the underlying state process.The general filtering and smoothing equations needed in the algorithm are Filter: to begin, suppose that Pr(s t-1 | y t-1 ) is known.A prediction of the state at time t (predictive equations) can be computed using the system equation and the filtered densities are: P(s y P(y s P(s y P(y t, j t-1 t t ,j t ,j t-1 i, j i r 1 ) ) Smoother: the joint distribution of adjacent states is: ) ) The recursive updating algorithm requires that the parameters of the observation and system equations be specified.The parameter vector Θ = {π 0 , p, Λ} has to be estimated from the data using the EM-algorithm (Dempster, Laird and Rubin, 1977) where the missing information is the state at each time {s t , t = 1, ..., n}.
The likelihood of the incomplete data is P(y ) P(y s y P(s y ) and the likelihood of the complete data is P(y ,s ) P(y s y (s s ) The closed-form solutions for the likelihood of the complete data are: When first order Markov dependence between outcomes is present The initial probability vector is estimated by The EM-algorithm is implemented as follows: 1. Start with an initial guess Θ (0) of the parameter vector; 2. E-step.run the recursive updating algorithm with current parameter estimate Θ (p) M-step.treat the estimated states as data, solve the complete-data likelihood equations to obtain an updated estimate Θ (p + 1) .
The recursive updating algorithm is then updated in step (2) above until convergence.

BIC and AIC for Hypothesis Testing
Due to the large size of the Xylella subsequence we are working with, traditional Chi-square tests for comparing competing models systematically reject simpler models (the ones with fewer parameters to estimate) in favor of larger ones.That means we need a different methodology to perform our tests.Such methodology was developed by Schwarz (1978) and Sakamato et al. (1986), and applied in our tests in section 6.
Sakamoto et al. developed Akaike's Information Criterion (AIC) that has the form of a penalized maximum likelihood function where the size of the penalty depends on the number of units required to encode the parameters.Schwarz (1978) developed the Bayesian Information Criterion (BIC) (also known as Schwarz's Bayesian Criterion (SBC)) which is based on Bayesian theory.Raftery (1986), presents a very good description of the BIC.
Let θ be a vector of parameters, λ be the loglikelihood function in study, K as the number of parameters in the model (degrees of freedom), and n be the sample size.Then and When comparing fitted objects, the smaller the AIC (and BIC) the better the fit is.AIC and BIC values have no intrinsic meaning, except in relation to other models based upon the same dataset.

Examples: Bacteriophage Lambda and Xylella
In this section we illustrate the application of the hidden Markov models discussed above using a subsequence of the Xylella genome and the data for the entire genome of the virus Bacteriophage lambda which has been studied by Churchill (1989).Our codes have been written in FORTRAN and we used the Bacteriophage lambda data to illustrate the methodology and also to check whether our results match Churchill's.
Bacteriophage lambda.The DNA sequence for the Bacteriophage lambda has been acquired from the Genbank website http://www.ncbi.nlm.nih.gov/Genbank/.According to the website http://latin.arizona.edu/~plpweb/lecture/lect39, bacteriophages are basically viruses that specialise in the infection of bacteria.Bacteriophage uses the bacteria Escherichia coli as a host.The bacteriophage lambda ge-530 da-Silva nome size is 48,514 bases and is by far the most completely studied bacteriophage known.A total of 46 genes have been identified on the circular lambda map.The bacteriophage's G + C content has been studied by Skalka et al. (1967).Using chemical analyses, they concluded that the genome is composed of six segments with different G + C content.Churchill (1989) found a three-state first-order dependent model as the one that best fits the data.Following Churchill (1989), in Figure 1 we show the smoothed estimates of local A + T composition based on a four-state independent outcome model.Our map is largely in agreement with the one produced by Churchill (1989).
(b) Xylella subsequence.According to information obtained from the website http://aeg.lbi.ic.unicamp.br/xf/,the main chromosome of the Xylella fastidiosa (cataloged by the National Center for Biotechnology Information (NCBI) by code number AE003849) is composed of 2,838 genes.This chromosome has a total number of 2,679,305 nucleotide bases (A, C, G, T).Among them, 1,411,300 (52.67%) are C or G.The website http://aeg.lbi.ic.unicamp.br/xf/makes available the main chromossome gene map which lists adjacent genes.This gene map has the advantage of being presented in the form of colored horizontal bars, where each color represents the predominant gene function.As we mentioned in the introductory section, we worked with only a subsequence of the genome of Xylella fastidiosa.We chose Xylella main chromossome subsequence composed of genes XF1141 and XF1196, a total of 38,730 bases.This subsequence is mainly formed by genes that describe three well-defined and contiguous regions.Region 1 starts at base 1 and goes as far as approximately base 7,401.It contains genes related to energy metabolism.Region 2 is located between bases 7,402 and 21,723 and it is basically composed of RNA processing genes.After a very heterogeneous region including several small genes, each with a particular major function, we find region 3 located between bases 26,767 and 32,514.Region 3 is basically composed of macromolecule metabolism genes.Therefore we are aware of the existence of three functionally speaking different regions (energy metabolism, RNA processing and macromolecule metabolism).That may have an impact on the dependence structure between nearby bases.If that is so, hidden Markov models can help us to locate them.
We fitted independent or first-order dependent outcomes and up to four latent states.Table 1 summarizes the results.BIC and AIC (see Schwarz, 1978 andSakamoto et al., 1986) pointed out a three-state independent outcome model as the one providing the best fit.
Figure 2 shows the smoothed estimates of local C + G composition based on a three-state independent outcome model .The local composition was estimated as: In (19) y t,1 implies we are dealing with local proportion of C + G -there are only two outcomes: 1 (for C or G) or 2 (for A or T).
In Figure 2 we notice that there are four regions.The first and the last one present comparable C + G smoothed proportions.
The second region is the largest and presents the largest C + G concentration.Region three is narrow and spiked.Approximately first region ranges from base 1 to 7,000, the second region goes up to base 21,000, and the third goes from base 22,000 to 24,000.Regions one and two are in very good agreement with our data description, matching energy metabolism and RNA processing genes.Region three incorporates, functionally speaking, very heterogeneous data, so it is not possible to characterize it in those Hidden Markov models applied to Xylella fastidiosa genome 531 terms.Region four is well defined with a regular low expression of C + G.It does not contradict the data description in the sense that approximately from bases 26,000 to 32,000 the smoothed proportions show homogeneity which matches the location of macromolecule metabolism genes.These results confirm that hidden Markov models are useful tools to reveal homogeneous regions of DNA data.
We notice that there are non negligible differences in scale of the maps obtained for the studied organisms.We can observe that the homogeneous regions are separated more clearly in the case of the phage.This might be due to the weaker kind of dependencies in the Xylella data we worked with compared to the phage data.Despite the fact that the best model was found to be a three state one, model selection procedures did not support order dependencies among Xylella outcomes in contrast with the phage outcomes.Therefore, it is reasonable that the phage map presents a better discrimination of homogeneous regions than the Xylella map.
Finally, even though we could tell the major function of the genes included in the Xylella dataset we used by inspecting the chromossome map (website http://aeg.lbi.ic.unicamp.br/xf/),this is no longer feasible when dealing with the whole genome of the Xylella genome.Thus, computational methods are needed to extract and summarize major underlying features that help the analyst to understand DNA structure and function.Hidden Markov models seem to be a good option.

Discussion
Hidden Markov models are useful for describing and revealing some special features of temporal biological series.The main advantage over the regular Markov models is the possibility that HMMs have of accounting for heterogeneity that may be present in the data.As a result, more sensible models and better data descriptions might be available to the analyst.
In this work we applied HMM methods to a subsequence of the Xylella fastidiosa genome.In order to describe possible variations in the expression of C + G, we worked with the C + G sequence data instead of with the original DNA sequence.The HMM that better describes the data was able to correctly discriminate regions in the data corresponding to distinct biological functions.In the future, this kind of model may be used in the study of larger subsequences or even the whole Xylella genome.
There are other already known uses of HMM in computational biology.Those may be tried on the Xylella data.For example, HMMs are used for obtaining multiple aligned sequences and also for gene detection.
According to Hugley and Krogh (1996), HMM are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences.HMMs are particularly useful in the study of protein molecules since they make possible the automatic discrimination of evolutionarily close proteins.Proteins are built from an alphabet of twenty smaller molecules known as amino acids.According to http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html,when a cell reproduces, a protein inside the cell is most of the time exactly duplicated in the daughter cell.However, over long periods of time, errors occur in the copy process.When this happens, a protein in the daughter cell is slightly different from its counterpart in the parent.The three most common errors are "substitution" of an amino acid in a given position, "insertion" of one or more new amino acids, and "deletion" of one or more amino acids.As a result of these errors, proteins which share a common ancestor are not exactly alike.However, they inherit many similarities in primary structure from their ancestor.This is known as "conservation" of primary structure in a protein family.In an HMM for multiple alignment the states are S (substitution), I (insertion), D (deletion) and M (matching of amino acids).According to Krogh et al. (1994), an HMM used for multiple alignment identifies a set of positions that describes the conserved primary structure in the sequences from a given family of proteins, i.e., the model identifies the core elements of homologous proteins.In the case of the Xylella it may be important to know how evolutionarily close is this organism to another, since similar methods used in the combat of the latter could be tried.
HMMs are also a very useful tool in gene prediction.According to Stormo (2000), the states correspond to exons, introns, and any other class of sequences desired (such as 5' and 3' UTRs, promoters regions, intergenic regions, repetitive DNA, etc).These genetic entities signal the points in the DNA where a new gene starts or ends.The probability of changing from an intron to an exon depends on the local sequence such that it is high only at plausible splice junctions.Stormo (2000) explains that the "hidden" in these HMMs denotes that fact that we see only the DNA sequence directly, and the state that generated the sequence (exon, intron, etc) is not visible.These methods are important because genes can be located using computational biology tools.

Figure 2 -
Figure 2 -Plot of the local proportions of C+G based on a three-state model fit by maximum likelihood.

Table 1 -
Summary statistics used for selecting the best HMM model for the Xylella dataset.Plot of the local proportions of A+T based on a four-state model fit by maximum likelihood.