2.3 Modeling the genefinding problem (Page 4/3)

Statistical machine learning Page 1 / 3

A brief description of how genefinding is computationally modeled.

Our task is to find coding regions in eukaryotic genomes. We have already studied the complexity of the structure of eukaryotic genes. if you need a refresher, check out this self-paced tutorial from the University of Glasgow.

Early approaches to genefinding

One fundamental approach to finding genes is to detect functional sites in genomic DNA. Fixed length sites like splice sites, start and stop codons, polyA sites, ribosomal binding sites, and transcription factor binding sites are called signals, and algorithms that detect them are called signal sensors . Variable length regions like exons and introns in eukaryotic DNA are recognized by another family of methods called content sensors .

A consensus sequence

Show above is a sequence motif of size 6. The letters in each position are drawn in proportion to the probability of having that letter in that position. This probability information is summarized in a weight matrix shown below. Weight matrices are the simplest form of signal sensors.

Weight matrix
Position	A	C	G	T
1	0.028	0.034	0.026	0.912
2	0.805	0.031	0.123	0.041
3	0.046	0.158	0.022	0.774
4	0.669	0.019	0.253	0.059
5	0.024	0.044	0.028	0.904
6	0.962	0.012	0.014	0.012

The weight matrix is an example of a probabilistic sequence model. We treat each sequence as a string in the alphabet {A,C,G,T}. Each entry (i,j) in the matrix represents the probability of a base i in position j of the string. Given a DNA sequence or string s of length 6, we can evaluate its likelihood with respect to this model. That is, we can calculate the probability that the sequence is generated by the weight matrix above.

P(TATATA) = 0.912 * 0.805 * 0.774 * 0.669 * 0.904 * 0.962 = 0.33

P(ATATAT) = 0.028 * 0.041 * 0.046 * 0.059 * 0.024 * 0.012 = 8.9 * 10^(-10)

We can see that with respect to the sequence model described by the weight matrix, the string TATATA is overwhelmingly more likely than the string ATATAT. The weight matrix model assumes that the bases at each position are independent of each other. We will study more sophisticated models, called Markov models, which take dependencies between bases at different positions into account. The advantage of weight matrix models is their simplicity which allows them to be estimated with very little data. The disadvantage is that the models are quite rigid and do not accommodate the kind of variability seen in real biological sequences.

Content sensors include detectors of CpG islands, an example we will consider in detail later in this module. Exon and intron detectors are some of the most widely studied in the literature. The GRAIL system detects exons, polyAs,and CpG islands. You can submit a DNA sequence on the form linked above to check a content sensor program out.

Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and content that they cannot capture. Since the late nineties, attempts have been made to develop probabilistic systems that combine signal and content sensors to try to identify complete gene structure. One of the best known of these systems is Genscan, developed by Chris Burge and his advisor Samuel Karlin at Stanford University in 1997. Genscan is based on hidden Markov models.

Questions & Answers

what's Thermochemistry

rhoda Reply

the study of the heat energy which is associated with chemical reactions

Kaddija

How was CH4 and o2 was able to produce (Co2)and (H2o

Edafe Reply

explain please

Victory

First twenty elements with their valences

Martine Reply

what is chemistry

asue Reply

what is atom

asue

what is the best way to define periodic table for jamb

Damilola Reply

what is the change of matter from one state to another

Elijah Reply

what is isolation of organic compounds

IKyernum Reply

what is atomic radius

ThankGod Reply

Read Chapter 6, section 5

Kareem

Atomic radius is the radius of the atom and is also called the orbital radius

Kareem

atomic radius is the distance between the nucleus of an atom and its valence shell

Amos

Read Chapter 6, section 5

paulino

Bohr's model of the theory atom

Ayom Reply

is there a question?

when a gas is compressed why it becomes hot?

ATOMIC

It has no oxygen then

Goldyei

read the chapter on thermochemistry...the sections on "PV" work and the First Law of Thermodynamics should help..

Which element react with water

Mukthar Reply

Mgo

Ibeh

an increase in the pressure of a gas results in the decrease of its

Valentina Reply

definition of the periodic table

Cosmos Reply

What is the lkenes

Da Reply

what were atoms composed of?

Moses Reply

what is chemistry

Imoh Reply

what is chemistry

Damilola

Got questions? Join the online conversation and get instant answers!

Jobilize.com Reply

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask

	2 Business Law MCQ 2 By Maureen Miller Start Exam
	23 AP 23 Digestive System MCQ By OpenStax Start Quiz
	How much do you love him? By Zarina Chocolate Start Quiz
	PE Power Enigeering Safety By Gerr Zen Start Quiz
	2 Timeshare 2 By Jams Kalo Start Quiz
	College physics By OpenStax Read Online Course
	3 Understanding Societies MCQ By Jessica Collett Start Quiz
©flickr: Miguel	Protozoal and Parasitic Infections By Cath Yu Start Quiz
©flickr:	Dairy Cattle Evaluation Exam By Katy Keilers Start Exam
	12 AP 12 Nervous System Essay By OpenStax Start Flashcards