<< Chapter < Page Chapter >> Page >

The autoregressive model

Speech recognition also utilizes probability theory to understand the effects of the vocal tract system. Since there are many physiological factors that contribute to the buzzing of the glottis, the Central Limit Theorem allows us to model it, ( x [ n ] ) , as an iid process with Gaussian distribution of mean zero and constant variance.

μ = 0 , σ 2 = K

Knowing that the signal value at any time t is some random variable, we naturally begin to wonder about the autocorrelation of the process, the measure of how correlated signal values at different times are. For independent processes, we know that the autocovariance (which is the same as the autocorrelation in this case) equals zero for any time t 1 with some other t 2 , and equals the variance at t 2 = t 1 . This is illustrated in the derivation below for the zero-mean case:

C x = R x = E [ x ( t ) x ( t + τ ) ] τ
for τ 0 , becomes E [ x ( n ) ] E [ x ( n + τ ) ] = 0
for τ = 0 , becomes E [ x 2 ( n ) ] = σ 2

With this in mind, we know that

R x [ k ] = σ 2 δ [ k ]

We now know the autocorrelation of the input and the autocorrelation of the output y [ n ] (the output is known - it's the speech signal). We define the DTFT of R x to be the power spectral density of x. That is,

S x ( f ) = τ = - R x ( τ ) e - j 2 π f τ

We can show that for LTI/LSI systems, the power spectral density of the output becomes the power spectral density of the input multiplied by the square of the magnitude of the transfer function. Note that this holds only for wide-sense stationary processes - i.e., processes for which R x is only dependent on τ and not on t . Starting with the convolution sum,

y [ n ] = k = - x [ n - k ] h [ k ]
R y [ n ] = E ( y ( n ) y ( n + τ ) ] = E ( k = - x [ n - k ] h [ k ] r = - x [ n + τ - r ] h [ r ] )
= k = - r = - h [ r ] h [ k ] E ( x [ n - k ] x [ n + τ - r ] )
= k = - r = - h [ r ] h [ k ] R x ( τ + k - r )
S y ( f ) = τ = - k = - r = - h [ r ] h [ k ] R x ( τ + k - r ) e - j 2 π f τ

Letting u = τ + k - r , this becomes:

u = - k = - r = - h [ r ] h [ k ] R x ( u ) e - j 2 π f ( u - k + r )
= k = - h [ k ] e j 2 π f k H * ( f ) r = - h [ r ] e - j 2 π f r H ( f ) u = - R x [ u ] e - j 2 π f u S x ( f )
S y [ ω ] = σ 2 H [ ω ] H [ ω ] *
= σ 2 | H [ ω ] | 2

So now we can solve for the transfer function. Note that the value of the variance does not matter, because it affects the magnitude of the graph but does not affect its shape. The system is depicted below.

x [ n ] white noise H vocal tract y [ n ] observed speech

If we use a linear, time invariant system with only poles, we can model the output as a recursive average as shown.

y [ n ] = x [ n ] + i α i y [ n - i ]

Statisticians often refer to a model of the following form as an autoregressive process. Note that it has the same form as our equation above.

Y t = ϵ + i α i y [ n - i ]

They have algorithms which can find the values of that constant. Those constants are the coefficients of the polynomial in the denominator of the transfer function. This is essentially the autoregressive model we talked about earlier. The input is the numerator of our transfer function. Again, the actual value of the numerator doesn'€™t matter because it only changes the magnitude of the frequency response. Now that we know how to model the transfer function, we can tackle the formants.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Vowel recognition using formant analysis. OpenStax CNX. Dec 17, 2014 Download for free at http://legacy.cnx.org/content/col11729/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Vowel recognition using formant analysis' conversation and receive update notifications?

Ask