<< Chapter < Page Chapter >> Page >

Principal components analysis

An illustration of PCA. a) A data set given as 3-dimensional points. b) The three orthogonal Principal Components (PCs) for the data, ordered by variance. c) The projection of the data set into the first two PCs, discarding the third one.

For data points of dimensionality M, the goal of PCA is to compute M so-called Principal Components (PCs), which are M-dimensional vectors that are aligned with the directions of maximum variance (in the mathematical sense) of the data. These PCs have the following properties:

  • The PCs are ordered by data variance. In other words, the first PC is aligned with the direction of maximum variance, the second PC in the next direction contributing to the most variance, and so on.
  • The PCs form an orthonormal basis , that is, they are all mutually perpendicular and have unit length. This gives PCs the useful property of being uncorrelated .
For example, in figure 3 b), the PCs have been superimposed with the data set (PCs are drawn with different lengths to illustrate the amount of data variance they account for, but remember they are actually of unit length). Since the PCs are orthonormal, they form a vector basis in terms of which the data set can be expressed. An alternative, equivalent view, is that the PCs become aligned with the canonical axes X,Y,Z,... if the data set is rotated or rigidly transformed so that the directions of maximum cumulative variance are made to coincide with the canonical base. For the simple example, the reader can agree that the last direction of maximum variance, the 3rd in this case, accounts for little or no data variability. It is customary, when projecting the original data set into the principal components, to discard the components that do not add a significant contribution to the data variance. In figure 3 c), the third component has been discarded and the projection of the data onto the first two components is shown. Discarding the least-important components is how dimensionality reduction is performed.

For M-dimensional input data, the Principal Components (PCs) are M-dimensional vectors and have two main uses. These are to:

  • Project the input data onto the PCs. Taking the dot product of an input data point with any PC returns the scalar value of the projection of the point onto the PC. Since the PCs have unit length, this projection serves as the coordinate of the input point along the PC in question. In principle, M-dimensional input data can be projected onto its M PCs, but typically we are not interested in the lesser ones (the data set would be nicely aligned with the PCs, but we would still be using M coordinates for each point). Using just the first few PCs as a basis and computing the projections onto say, the first d PCs yields the best d -dimensional representation for each point, from a maximum variance point of view (e.g., d=2 in figure 3-c). This is the actual dimensionality reduction.
  • Interpolate or synthesize new points. The PCs themselves point in the direction of maximum variance, as explained above. For this reason, PC i can be used as a direction vector along which new points can be synthesized by choosing parameter values a i and then producing artificial M-dimensional points by doing the linear combination a 1 PC 1 a 2 PC 2 ... . Points synthesized in this way would lie approximately on the low-dimensional hyperplane spanned by the original data set. The projections of the original points correspond to particular values for these new "coordinates" a i . Being able to interpolate other points not in the original data set is a useful property that other dimensionality reduction methods do not have.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Geometric methods in structural computational biology. OpenStax CNX. Jun 11, 2007 Download for free at http://cnx.org/content/col10344/1.6
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Geometric methods in structural computational biology' conversation and receive update notifications?

Ask