2.3 Modeling the genefinding problem

Statistical machine learning Page 1 / 3

A brief description of how genefinding is computationally modeled.

Our task is to find coding regions in eukaryotic genomes. We have already studied the complexity of the structure of eukaryotic genes. if you need a refresher, check out this self-paced tutorial from the University of Glasgow.

Early approaches to genefinding

One fundamental approach to finding genes is to detect functional sites in genomic DNA. Fixed length sites like splice sites, start and stop codons, polyA sites, ribosomal binding sites, and transcription factor binding sites are called signals, and algorithms that detect them are called signal sensors . Variable length regions like exons and introns in eukaryotic DNA are recognized by another family of methods called content sensors .

A consensus sequence

Show above is a sequence motif of size 6. The letters in each position are drawn in proportion to the probability of having that letter in that position. This probability information is summarized in a weight matrix shown below. Weight matrices are the simplest form of signal sensors.

Weight matrix
Position	A	C	G	T
1	0.028	0.034	0.026	0.912
2	0.805	0.031	0.123	0.041
3	0.046	0.158	0.022	0.774
4	0.669	0.019	0.253	0.059
5	0.024	0.044	0.028	0.904
6	0.962	0.012	0.014	0.012

The weight matrix is an example of a probabilistic sequence model. We treat each sequence as a string in the alphabet {A,C,G,T}. Each entry (i,j) in the matrix represents the probability of a base i in position j of the string. Given a DNA sequence or string s of length 6, we can evaluate its likelihood with respect to this model. That is, we can calculate the probability that the sequence is generated by the weight matrix above.

P(TATATA) = 0.912 * 0.805 * 0.774 * 0.669 * 0.904 * 0.962 = 0.33

P(ATATAT) = 0.028 * 0.041 * 0.046 * 0.059 * 0.024 * 0.012 = 8.9 * 10^(-10)

We can see that with respect to the sequence model described by the weight matrix, the string TATATA is overwhelmingly more likely than the string ATATAT. The weight matrix model assumes that the bases at each position are independent of each other. We will study more sophisticated models, called Markov models, which take dependencies between bases at different positions into account. The advantage of weight matrix models is their simplicity which allows them to be estimated with very little data. The disadvantage is that the models are quite rigid and do not accommodate the kind of variability seen in real biological sequences.

Content sensors include detectors of CpG islands, an example we will consider in detail later in this module. Exon and intron detectors are some of the most widely studied in the literature. The GRAIL system detects exons, polyAs,and CpG islands. You can submit a DNA sequence on the form linked above to check a content sensor program out.

Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and content that they cannot capture. Since the late nineties, attempts have been made to develop probabilistic systems that combine signal and content sensors to try to identify complete gene structure. One of the best known of these systems is Genscan, developed by Chris Burge and his advisor Samuel Karlin at Stanford University in 1997. Genscan is based on hidden Markov models.

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask

	Business fundamentals By OpenStax Read Online Course
	SCJP Online Exam 310-065 By Prateek Ashtikar Start Quiz
	6 Neuroanatomy 06 Head Somatic Visceral Sensory By Stephen Voron Start Quiz
	Object Oriented Programming Test 1 By Ali Sid Start Test
	How to Analyze Stocks By Yasser Ibrahim Start Quiz
	Anthropology Economic System By Richley Crapo Start Assignment
©flickr:	Morphology By Jugnu Khan Start Quiz
©flickr: Miguel	Gram Positive Infections and Clostridium By Cath Yu Start Quiz
	Human A&P 2- Final By Madison Christian Start Flashcards
	1 Arts Society: Theater 1 By Jonathan Long Start Quiz