<< Chapter < Page | Chapter >> Page > |
Here, ${\nabla}_{\theta}\ell \left(\theta \right)$ is, as usual, the vector of partial derivatives of $\ell \left(\theta \right)$ with respect to the ${\theta}_{i}$ 's; and $H$ is an $n$ -by- $n$ matrix (actually, $n+1$ -by- $n+1$ , assuming that we include the intercept term) called the Hessian , whose entries are given by
Newton's method typically enjoys faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to theminimum. One iteration of Newton's can, however, be more expensive than one iteration of gradient descent, since it requires finding andinverting an $n$ -by- $n$ Hessian; but so long as $n$ is not too large, it is usually much faster overall. When Newton's method is applied to maximize thelogistic regression log likelihood function $\ell \left(\theta \right)$ , the resulting method is also called Fisher scoring .
So far, we've seen a regression example, and a classification example. In the regression example, we had $y|x;\theta \sim \mathcal{N}(\mu ,{\sigma}^{2})$ , and in the classification one, $y|x;\theta \sim \mathrm{Bernoulli}(\Phi )$ , for some appropriate definitions of $\mu $ and $\Phi $ as functions of $x$ and $\theta $ . In this section, we will show that both of these methods are special cases of a broader family of models, calledGeneralized Linear Models (GLMs). We will also show how other models in the GLM family can be derived and applied to other classificationand regression problems.
To work our way up to GLMs, we will begin by defining exponential family distributions. We say that a class of distributions is in the exponential family if it can be writtenin the form
Here, $\eta $ is called the natural parameter (also called the canonical parameter ) of the distribution; $T\left(y\right)$ is the sufficient statistic (for the distributions we consider, it will often be the case that $T\left(y\right)=y$ ); and $a\left(\eta \right)$ is the log partition function . The quantity ${e}^{-a\left(\eta \right)}$ essentially plays the role of a normalization constant, that makes sure the distribution $p(y;\eta )$ sums/integrates over $y$ to 1.
A fixed choice of $T$ , $a$ and $b$ defines a family (or set) of distributions that is parameterized by $\eta $ ; as we vary $\eta $ , we then get different distributions within this family.
We now show that the Bernoulli and the Gaussian distributions are examples of exponential family distributions. The Bernoulli distribution with mean $\Phi $ , written $\mathrm{Bernoulli}\left(\Phi \right)$ , specifies a distribution over $y\in \{0,1\}$ , so that $p(y=1;\Phi )=\Phi $ ; $p(y=0;\Phi )=1-\Phi $ . As we vary $\Phi $ , we obtain Bernoulli distributions with different means. We now show that this class of Bernoullidistributions, ones obtained by varying $\Phi $ , is in the exponential family; i.e., that there is a choice of $T$ , $a$ and $b$ so that Equation [link] becomes exactly the class of Bernoulli distributions.
We write the Bernoulli distribution as:
Thus, the natural parameter is given by $\eta =log(\Phi /(1-\Phi \left)\right)$ . Interestingly, if we invert this definition for $\eta $ by solving for $\Phi $ in terms of $\eta $ , we obtain $\Phi =1/(1+{e}^{-\eta})$ . This is the familiar sigmoid function! This will come up again when we derive logisticregression as a GLM. To complete the formulation of the Bernoulli distribution as an exponential familydistribution, we also have
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?