<< Chapter < Page Chapter >> Page >

is not Spanish (the input file was Cervante's Don Quijote , also with m=3 ), and

Seule sontagne trait homarcher de la t au onze le quance matices Maississait passepart penaientla ples les au cherche de je Chamain peut accide bien avaien rie se vent puis il nez pande

is not French (the source was Le Tour du Monde en Quatre Vingts Jours , a translation of Jules Verne's Around the World in Eighty Days. )

The input file to the program textsim.m is a M atlab .mat file that is preprocessed to remove excessive line breaks, spaces, and capitalization using textman.m , which is why there is no punctuation in these examples. A large assortment oftext files is available for downloading at the website of Project Gutenberg (at http://promo.net/pg/).

Text, in a variety of languages, retains some of the character of its language with correlationsof 3 to 5 letters (21–35 bits, when coded in ASCII). Thus, messages written in those languagesare not independent, except possibly at lengths greater than this. A result from probability theory suggests that ifthe letters are clustered into blocks that are longer than the correlation, then the blocksmay be (nearly) independent. This is one strategy to pursue when designing codes that seek to optimize performance. "Source Coding" will explore some practical ways to attack this problem, but the next two sections establish a measure ofperformance such that it is possible to know how close to the optimal any given code lies.

Entropy

This section extends the concept of information from a single symbol to a sequence of symbols.As defined by Shannon, Actually, Hartley was the first to use this as a measure of information in his1928 paper in the Bell Systems Technical Journal called “Transmission of Information.” the information in a symbol is inversely proportional to its probability of occurring.Since messages are composed of sequences of symbols, it is important to be able to talk concretelyabout the average flow of information. This is called the entropy and is formally defined as

H ( x ) = i = 1 N p ( x i ) I ( x i ) = i = 1 N p ( x i ) log 1 p ( x i ) = - i = 1 N p ( x i ) log ( p ( x i ) ) ,

where the symbols are drawn from an alphabet x i , each with probability p ( x i ) . H ( x ) sums the information in each symbol, weighted by the probability of that symbol.Those familiar with probability and random variables will recognize this as an expectation. Entropy Warning: though the word is the same, this is not the sameas the notion of entropy that is familiar from physics since the units here are in bits per symbol whilethe units in physics are energy per degree kelvin. is measured in bits per symbol, and so gives a measure of the average amount of informationtransmitted by the symbols of the source. Sources with different symbol sets and different probabilitieshave different entropies. When the probabilities are known, the definition is easy to apply.

Consider the N = 3 symbol set defined in Example [link] . The entropy is

H ( x ) = 1 2 · 1 + 1 4 · 2 + 1 4 · 2 = 1 . 5 bits/symbol .

Reconsider the fair die of [link] . What is its entropy?

Suppose that the message { x 1 , x 3 , x 2 , x 1 } is received from a source characterized by

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Software receiver design. OpenStax CNX. Aug 13, 2013 Download for free at http://cnx.org/content/col11510/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Software receiver design' conversation and receive update notifications?

Ask