<< Chapter < Page Chapter >> Page >

As a very simple and abstract example of when non-linear dimensionality reduction would be preferred over a linear method such as PCA, consider the data set depicted in figure 8 a). The data is apparently two-dimensional, and naively considering the data variance in this way leads to two "important" principal components, as figure 8 b) shows. However, the data has been sampled from a one-dimensional, but non-linear, process. Figure 8 c) shows the correct interpretation of this data set as non-linear but one-dimensional.

Effects of dimensionality reduction on an inherently non-linear data set. a) The original data given as a two-dimensional set. b) PCA identifies two PCs as contributing significantly to explain the data variance. c) However, the inherent topology (connectivity) of the data helps identify the set as being one-dimensional, but non-linear.

Most interesting molecular processes share this characteristic of being low-dimensional but highly non-linear in nature. For example, in the protein folding example depicted in figure 1, the atom positions follow very complicated, curved paths to achieve the folded shape. However, the process can still be thought of as mainly one-dimensional. Linear methods such as PCA would fail to correctly identify collective modes of motion that do not destroy the protein when followed, much like in the simple example of figure 8.

Several non-linear dimensionality reduction techniques exist, that can be classified as either parametric or non-parametric.

  • Parametric methods need to be given a model to try to fit the data, in the form of a mathematical function called a kernel . Kernel PCA is a variant of PCA that projects the points onto a mathematical hypersurface provided as input together with the points. When using parametric methods, the data is forced to lie on the supplied surface, so in general this family of methods do not work well with molecular motion data, for which the non-linearity is unknown.
  • Non-parametric methods use the data itself in an attempt to infer the non-linearity from it, much like deducing the inherent connectivity in figure 8 c). The most popular methods are Isometric Feature Mapping (Isomap) from Tenenbaum et al. and Locally Linear Embedding (LLE) from Roweis et al. .
The Isomap algorithm is more intuitive and numerically stable than LLE and will be discussed next.

Isometric feature mapping (isomap)

Isomap was introduced by Tenenbaum et al. in 2000. It is based on an improvement over a previous technique known as MultiDimensional Scaling (MDS). Isomap aims at capturing the non-linearity of a data set from the set itself, by computing relationships between neighboring points. Both MDS and the non-linear improvement will be presented first, and then the Isomap algorithm will be detailed.

MDS. Multidimensional Scaling (see Cox et al. ) is a technique that produces a low-dimenisional representation for an input set of n points, where the dimensions are ordered by variance, so it is similar to PCA in this respect. However, MDS does not require coordinates as input, but rather, distances between points. More precisely, MDS requires as input a data set of points, and a distance measure d(i,j) between any pair of points x i and x j . MDS works best when the distance measure is a metric . MDS starts by computing all possible pairwise distances between the input points, and placing them in a matrix D so that D ij d(i,j) . The goal of MDS is to produce, for every point, a set of n Euclidean coordinates such that the Euclidean distance between all pairs match as close as possible the original pairwise distances D ij , as depicted in figure 9.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Geometric methods in structural computational biology. OpenStax CNX. Jun 11, 2007 Download for free at http://cnx.org/content/col10344/1.6
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Geometric methods in structural computational biology' conversation and receive update notifications?

Ask