2.12 Machine learning lecture 12 course notes (Page 8/8)

Machine learning Page 8 / 8

In detail, the algorithm is as follows:

Randomly sample $m$ states $s^{(1)}, s^{(2)}, ... s^{(m)} \in S$ .
Initialize $θ : = 0$ .
Repeat ${$
1. For i = 1 , ... , m {
  1. For each action $a \in A$ ${$
  2. 1. Sample $s_{1}^{'}, ..., s_{k}^{'} \sim P_{s^{(i)} a}$ (using a model of the MDP).
    2. Set $q (a) = \frac{1}{k} \sum_{j = 1}^{k} R (s^{(i)}) + γ V (s_{j}^{'})$
    3. $/ /$ Hence, $q (a)$ is an estimate of $R (s^{(i)}) + γ E_{s^{'} \sim P_{s^{(i)} a}} [V (s^{'})]$ .
  3. $}$
  4. Set $y^{(i)} = {max}_{a} q (a)$ .
  5. $/ /$ Hence, $y^{(i)}$ is an estimate of $R (s^{(i)}) + γ {max}_{a} E_{s^{'} \sim P_{s^{(i)} a}} [V (s^{'})]$ .
2. $}$
3. $/ /$ In the original value iteration algorithm (over discrete states)
4. $/ /$ we updated the value function according to $V (s^{(i)}) : = y^{(i)}$ .
5. $/ /$ In this algorithm, we want $V (s^{(i)}) \approx y^{(i)}$ , which we'll achieve
6. $/ /$ using supervised learning (linear regression).
7. Set $θ : = arg {min}_{θ} \frac{1}{2} \sum_{i = 1}^{m} {(θ^{T} φ (s^{(i)}) - y^{(i)})}^{2}$
$}$

Above, we had written out fitted value iteration using linear regression as the algorithm to try to make $V (s^{(i)})$ close to $y^{(i)}$ . That step of the algorithm is completely analogous to a standard supervised learning (regression) problem in which we have a training set $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})$ , and want to learn a function mapping from $x$ to $y$ ; the only difference is that here $s$ plays the role of $x$ . Eventhough our description above used linear regression, clearly other regression algorithms (such as locally weighted linear regression) can also be used.

Unlike value iteration over a discrete set of states, fitted value iteration cannot be proved to always to converge. However, in practice, it often does converge (or approximately converge), and works well for many problems.Note also that if we are using a deterministic simulator/model of the MDP, then fitted value iteration can be simplified by setting $k = 1$ in the algorithm. This is because the expectation in Equation [link] becomes an expectation over a deterministic distribution, and so a single example is sufficient to exactly compute that expectation.Otherwise, in the algorithm above, we had to draw $k$ samples, and average to try to approximate that expectation (see the definition of $q (a)$ , in the algorithm pseudo-code).

Finally, fitted value iteration outputs $V$ , which is an approximation to $V^{*}$ . This implicitly defines our policy. Specifically,when our system is in some state $s$ , and we need to choose an action, we would like to choose the action

arg max_{a} E_{s^{'} \sim P_{s a}} [V (s^{'})]

The process for computing/approximating this is similar to the inner-loop of fitted value iteration, where for each action, we sample $s_{1}^{'}, ..., s_{k}^{'} \sim P_{s a}$ to approximate the expectation. (And again, if the simulator is deterministic, we can set $k = 1$ .)

In practice, there're often other ways to approximate this step as well. For example, one very common case is if thesimulator is of the form $s_{t + 1} = f (s_{t}, a_{t}) + ϵ_{t}$ , where $f$ is some determinstic function of the states (such as $f (s_{t}, a_{t}) = A s_{t} + B a_{t}$ ), and $ϵ$ is zero-mean Gaussian noise. In this case, we can pick the action given by

arg max_{a} V (f (s, a)) .

In other words, here we are just setting $ϵ_{t} = 0$ (i.e., ignoring the noise in the simulator), and setting $k = 1$ . Equivalently, this can be derived from Equation [link] using the approximation

\begin{matrix} E_{s^{'}} [V (s^{'})] & \approx & V (E_{s^{'}} [s^{'}]) \\ = & V (f (s, a)), \end{matrix}

where here the expection is over the random $s^{'} \sim P_{s a}$ . So long as the noise terms $ϵ_{t}$ are small, this will usually be a reasonable approximation.

However, for problems that don't lend themselves to such approximations, having to sample $k | A |$ states using the model, in order to approximate the expectation above, can be computationally expensive.

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask

	Anthropology Aesthetics Culture By Richley Crapo Start Assignment
	Worldport Outside By Rachel Carlisle Start Quiz
	NCE Ch 03 Human Growth and Develoment By Anh Dao Start Test
	23 Pesticides/Small animal poison Test By Brooke Delaney Start Exam
	Pre Employment English Proficiency Exam By Katherina jennife... Start Quiz
	Statistics Final Review By Madison Christian Start Exam
©flickr: U.S.	Biology Chapter 8 By Michael Sag Start Exam
©flickr:	Anatomy Physiology By Jemekia Weeden Start Quiz
	16 Dr. Amberg Pharm quiz 2 By Brooke Delaney Start Exam
	6 Psychology MCQ 2010 2 Exam By John Gabrieli Start Exam