<< Chapter < Page Chapter >> Page >
p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp - 1 2 ( x - μ ) T Σ - 1 ( x - μ ) .

In the equation above, “ | Σ | ” denotes the determinant of the matrix Σ .

For a random variable X distributed N ( μ , Σ ) , the mean is (unsurprisingly) given by μ :

E [ X ] = x x p ( x ; μ , Σ ) d x = μ

The covariance of a vector-valued random variable Z is defined as Cov ( Z ) = E [ ( Z - E [ Z ] ) ( Z - E [ Z ] ) T ] . This generalizes the notion of the variance of a real-valued random variable. The covariance canalso be defined as Cov ( Z ) = E [ Z Z T ] - ( E [ Z ] ) ( E [ Z ] ) T . (You should be able to prove to yourself that these two definitions are equivalent.)If X N ( μ , Σ ) , then

Cov ( X ) = Σ .

Here're some examples of what the density of a Gaussian distribution looks like:

a 3D coordinate plane. Everything is centered around (0,0), medium height
a 3D coordinate plane. Everything is centered around (0,0), tall height

The left-most figure shows a Gaussian with mean zero (that is, the 2x1 zero-vector) and covariance matrix Σ = I (the 2x2 identity matrix). A Gaussian with zero mean and identity covariance is also called the standard normal distribution . The middle figure shows the density of a Gaussian with zero mean and Σ = 0 . 6 I ; and in the rightmost figure shows one with , Σ = 2 I . We see that as Σ becomes larger, the Gaussian becomes more “spread-out,” and as it becomes smaller, the distribution becomesmore “compressed.”

Let's look at some more examples.

a 3D coordinate plane. Points are centered around the line x=y, Mostly centered around (0,0), medium height
a 3D coordinate plane. Points are centered around the line x=y, Mostly centered around (0,0), high height
a 3D coordinate plane. Points are centered around the line x=y, Mostly centered around (0,0), highest height

The figures above show Gaussians with mean 0, and with covariance matrices respectively

Σ = 1 0 0 1 ; Σ = 1 0.5 0.5 1 ; . Σ = 1 0.8 0.8 1 .

The leftmost figure shows the familiar standard normal distribution, and we see that as we increase the off-diagonal entry in Σ , the density becomes more “compressed” towards the 45 line (given by x 1 = x 2 ). We can see this more clearly when we look at the contours of the same three densities:

density view, most dense in middle. Spread out in a circle
density view, most dense in middle. Spread out in a line similar to x=y, spreading out like an ellipse
density view, most dense in middle. Spread out in a line similar to x=y, spreading out like an ellipse, but skinnier than above

Here's one last set of examples generated by varying Σ :

density view, most dense in middle. Spread out in a line similar to x=-y, spreading out like an ellipse
density view, most dense in middle. Spread out in a line similar to x=-y, spreading out like an ellipse, skinnier than above
density view, most dense in middle. Spread out in a line similar to 2x=y, spreading out like an ellipse

The plots above used, respectively,

Σ = 1 -0.5 -0.5 1 ; Σ = 1 -0.8 -0.8 1 ; . Σ = 3 0.8 0.8 1 .

From the leftmost and middle figures, we see that by decreasing the diagonal elements of the covariance matrix, the density now becomes “compressed” again, but in the opposite direction.Lastly, as we vary the parameters, more generally the contours will form ellipses (the rightmost figure showing an example).

As our last set of examples, fixing Σ = I , by varying μ , we can also move the mean of the density around.

a 3D coordinate plane. Points are centered around the line x=y, Mostly centered around (0,1), medium height
a 3D coordinate plane. Points are centered around the line x=y, Mostly centered around (0,-1), medium height
a 3D coordinate plane. Points are centered around the line x=y, Mostly centered around (-1,-1), medium height

The figures above were generated using Σ = I , and respectively

μ = 1 0 ; μ = -0.5 0 ; μ = -1 -1.5 .

The gaussian discriminant analysis model

When we have a classification problem in which the input features x are continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, whichmodels p ( x | y ) using a multivariate normal distribution. The model is:

y Bernoulli ( Φ ) x | y = 0 N ( μ 0 , Σ ) x | y = 1 N ( μ 1 , Σ )

Writing out the distributions, this is:

p ( y ) = Φ y ( 1 - Φ ) 1 - y p ( x | y = 0 ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp - 1 2 ( x - μ 0 ) T Σ - 1 ( x - μ 0 ) p ( x | y = 1 ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp - 1 2 ( x - μ 1 ) T Σ - 1 ( x - μ 1 )

Here, the parameters of our model are Φ , Σ , μ 0 and μ 1 . (Note that while there're two different mean vectors μ 0 and μ 1 , this model is usually applied using only one covariance matrix Σ .) The log-likelihood of the data is given by

( Φ , μ 0 , μ 1 , Σ ) = log i = 1 m p ( x ( i ) , y ( i ) ; Φ , μ 0 , μ 1 , Σ ) = log i = 1 m p ( x ( i ) | y ( i ) ; μ 0 , μ 1 , Σ ) p ( y ( i ) ; Φ ) .

By maximizing with respect to the parameters, we find the maximum likelihood estimate of the parameters (see problem set 1) to be:

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask