<< Chapter < Page Chapter >> Page >

Note that every row of the first order Markov model above sums to 1. Since the probability of T following G (row 3, column 4) is zero, no sequence generated by this model will contain the subsequence "GT". We can use the model to generate a sequence of length L, using a small variation of the code snippet shown above.

i = 1; S[i] = a base chosen uniformly randomly from {A,C,T,G}.for (i = 1; i less-than L; i++) do S[i+1]= a base chosen from a discrete distribution from row corresponding to base S[i] in Markov modelend

We can use the model for evaluating the likelihood that a given sequence is generated from it. It represents the probability of that sequence given the model.

P(s1...sn) = P(s1) * P(s2|s1) * P(s3|s1) * ... * P(s{n-1}|sn)

We can factor the probability of the whole sequence into the probabilities of observing each transition starting from the first base. We can simplify the probabiltiy computation because of the first order Markov condition -- the probability of a base at a given position depends only the base before it in the sequence. Thus, the probability of observing sequence ACGT based on the first order Markov model shown above is:

P(ACGT) = P(A) * P(C|A) * P(G|C) * P(T|G) =0.25 * 0.2 * 0.8 * 0.3

We can use the model to compare the likelihood of two sequences. Thus, given sequence "ACGT" and "TGAC", we can calculate their individual likelihoods as shown above. P(TGAC) = P(T) * P(G|T) * P(A|G) * P(C|A) = 0.25 * 0 * 0.2 * 0.2 = 0. The sequence "TGAC" can never be generated by this model!

If we have a first order Markov model of CpG islands, we can now calculate the probability of a DNA sequence with respect to that model. If that probability exceeds a given threshold (say, 0.8), then we will assert that the given DNA sequence is in fact from a CpG island. How can we acquire a first order Markov model of CpG islands? There are databases of CpG islands available from the NCBI. The CpG islands for human chromosomes can be obtained from here .

Given a set of CpG island sequences, we can calculate the probability P(a|b) in the model, for a, b in {A,C,G,T} by counting the percentage of times the subsequence "ba" occurs in those sequences. We can then estimate all sixteen entries in the first order Markov model over the nucleotides. These models are extremely easy to acquire, requiring just counting operations. A first order Markov model learned from CpG islands, and another from a data set of non-CpG islands are shown below.

A first order markov model for cpg islands
A C G T
A 0.18 0.27 0.43 0.12
C 0.17 0.37 0.28 0.19
G 0.16 0.34 0.38 0.13
T 0.08 0.36 0.38 0.18
A first order markov model for non-cpg islands
A C G T
A 0.30 0.21 0.29 0.21
C 0.32 0.30 0.08 0.30
G 0.25 0.25 0.30 0.20
T 0.18 0.24 0.30 0.30

You can see that P(G|C) = 0.28 in the CpG island model, while P(G|C)=0.08 in the non-CpG islands model.

Using generative models to classify sequences

Given first order models of CpG and non-CpG islands and a DNA sequence s, we can determine whether the sequence comes from a CpG island or not by computing its log-odds ratio with respect to the models. If P(s|CpG)>P(s|non-CpG), then s is classified as a CpG island. The decision rule can be alternately cast as P(s|CpG)/P(s|non-CpG)>1, or taking logarithms on both sides, we get log[P(s|CpG)/P(s|non-CpG)]>0. The logarithm of the ratios of the two probabilities is called the log-odds ratio . If the log-odds ratio is greater than 0, then s is part of a CpG island.

Histogram of log-odds ratios

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask