<< Chapter < Page | Chapter >> Page > |
One special and important case of model selection is called feature selection. To motivate this, imagine that you havea supervised learning problem where the number of features is very large (perhaps ), but you suspect that there is only a small number of features that are “relevant” to thelearning task. Even if you use a simple linear classifier (such as theperceptron) over the input features, the VC dimension of your hypothesis class would still be , and thus overfitting would be a potential problem unless thetraining set is fairly large.
In such a setting, you can apply a feature selection algorithm to reduce the number of features. Given features, there are possible feature subsets (since each of the features can either be included or excluded from the subset), and thus feature selection can beposed as a model selection problem over possible models. For large values of , it's usually too expensive to explicitly enumerate over and compare all models, and so typically some heuristic search procedure is used to find agood feature subset. The following search procedure is called forward search :
The outer loop of the algorithm can be terminated either when is the set of all features, or when exceeds some pre-set threshold (corresponding to the maximum number of features that you want the algorithmto consider using).
This algorithm described above one instantiation of wrapper model feature selection , since it is a procedure that “wraps” around your learning algorithm,and repeatedly makes calls to the learning algorithm to evaluate how well it does using different feature subsets. Aside from forwardsearch, other search procedures can also be used. For example, backward search starts off with as the set of all features, and repeatedly deletes features one at a time (evaluating single-featuredeletions in a similar manner to how forward search evaluates single-feature additions) until .
Wrapper feature selection algorithms often work quite well, but can be computationally expensive given how that they need to makemany calls to the learning algorithm. Indeed, complete forward search (terminating when ) would take about calls to the learning algorithm.
Filter feature selection methods give heuristic, but computationally much cheaper, ways of choosing a feature subset.The idea here is to compute some simple score that measures how informative each feature is about the class labels . Then, we simply pick the features with the largest scores .
One possible choice of the score would be define to be (the absolute value of) the correlation between and , as measured on the training data. This would result in our choosing the features that are the moststrongly correlated with the class labels. In practice, it is more common (particularly for discrete-valued features ) to choose to be the mutual information between and :
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?