<< Chapter < Page Chapter >> Page >
A brief description of how genefinding is computationally modeled.

Our task is to find coding regions in eukaryotic genomes. We have already studied the complexity of the structure of eukaryotic genes. if you need a refresher, check out this self-paced tutorial from the University of Glasgow.

Early approaches to genefinding

One fundamental approach to finding genes is to detect functional sites in genomic DNA. Fixed length sites like splice sites, start and stop codons, polyA sites, ribosomal binding sites, and transcription factor binding sites are called signals, and algorithms that detect them are called signal sensors . Variable length regions like exons and introns in eukaryotic DNA are recognized by another family of methods called content sensors .

A consensus sequence

Show above is a sequence motif of size 6. The letters in each position are drawn in proportion to the probability of having that letter in that position. This probability information is summarized in a weight matrix shown below. Weight matrices are the simplest form of signal sensors.

Weight matrix
Position A C G T
1 0.028 0.034 0.026 0.912
2 0.805 0.031 0.123 0.041
3 0.046 0.158 0.022 0.774
4 0.669 0.019 0.253 0.059
5 0.024 0.044 0.028 0.904
6 0.962 0.012 0.014 0.012
The weight matrix is an example of a probabilistic sequence model. We treat each sequence as a string in the alphabet {A,C,G,T}. Each entry (i,j) in the matrix represents the probability of a base i in position j of the string. Given a DNA sequence or string s of length 6, we can evaluate its likelihood with respect to this model. That is, we can calculate the probability that the sequence is generated by the weight matrix above.

P(TATATA) = 0.912 * 0.805 * 0.774 * 0.669 * 0.904 * 0.962 = 0.33

P(ATATAT) = 0.028 * 0.041 * 0.046 * 0.059 * 0.024 * 0.012 = 8.9 * 10^(-10)

We can see that with respect to the sequence model described by the weight matrix, the string TATATA is overwhelmingly more likely than the string ATATAT. The weight matrix model assumes that the bases at each position are independent of each other. We will study more sophisticated models, called Markov models, which take dependencies between bases at different positions into account. The advantage of weight matrix models is their simplicity which allows them to be estimated with very little data. The disadvantage is that the models are quite rigid and do not accommodate the kind of variability seen in real biological sequences.

Content sensors include detectors of CpG islands, an example we will consider in detail later in this module. Exon and intron detectors are some of the most widely studied in the literature. The GRAIL system detects exons, polyAs,and CpG islands. You can submit a DNA sequence on the form linked above to check a content sensor program out.

Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and content that they cannot capture. Since the late nineties, attempts have been made to develop probabilistic systems that combine signal and content sensors to try to identify complete gene structure. One of the best known of these systems is Genscan, developed by Chris Burge and his advisor Samuel Karlin at Stanford University in 1997. Genscan is based on hidden Markov models.

Questions & Answers

what's Thermochemistry
rhoda Reply
the study of the heat energy which is associated with chemical reactions
Kaddija
How was CH4 and o2 was able to produce (Co2)and (H2o
Edafe Reply
explain please
Victory
First twenty elements with their valences
Martine Reply
what is chemistry
asue Reply
what is atom
asue
what is the best way to define periodic table for jamb
Damilola Reply
what is the change of matter from one state to another
Elijah Reply
what is isolation of organic compounds
IKyernum Reply
what is atomic radius
ThankGod Reply
Read Chapter 6, section 5
Dr
Read Chapter 6, section 5
Kareem
Atomic radius is the radius of the atom and is also called the orbital radius
Kareem
atomic radius is the distance between the nucleus of an atom and its valence shell
Amos
Read Chapter 6, section 5
paulino
Bohr's model of the theory atom
Ayom Reply
is there a question?
Dr
when a gas is compressed why it becomes hot?
ATOMIC
It has no oxygen then
Goldyei
read the chapter on thermochemistry...the sections on "PV" work and the First Law of Thermodynamics should help..
Dr
Which element react with water
Mukthar Reply
Mgo
Ibeh
an increase in the pressure of a gas results in the decrease of its
Valentina Reply
definition of the periodic table
Cosmos Reply
What is the lkenes
Da Reply
what were atoms composed of?
Moses Reply
what is chemistry
Imoh Reply
what is chemistry
Damilola
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask