<< Chapter < Page | Chapter >> Page > |
A second downside of this representation is called the curse of dimensionality . Suppose $S={\mathbb{R}}^{n}$ , and we discretize each of the $n$ dimensions of the state into $k$ values. Then the total number of discrete states we have is ${k}^{n}$ . This grows exponentially quickly in the dimension of the state space $n$ , and thus does not scale well to large problems. For example, with a 10d state,if we discretize each state variable into 100 values, we would have ${100}^{10}={10}^{20}$ discrete states, which is far too many to represent even on a modern desktop computer.
As a rule of thumb, discretization usually works extremely well for 1d and 2d problems (and has the advantage of being simple and quick to implement).Perhaps with a little bit of cleverness and some care in choosing the discretization method, it often works well for problems with up to 4d states. Ifyou're extremely clever, and somewhat lucky, you may even get it to work for some 6d problems. But it very rarely works for problems any higherdimensional than that.
We now describe an alternative method for finding policies in continuous-state MDPs, in which we approximate ${V}^{*}$ directly, without resorting to discretization. This approach, caled value function approximation, has been successfully applied to many RL problems.
To develop a value function approximation algorithm, we will assume that we have a model , or simulator , for the MDP. Informally, a simulator is a black-box that takes as input any (continuous-valued) state ${s}_{t}$ and action ${a}_{t}$ , and outputs a next-state ${s}_{t+1}$ sampled according to the state transition probabilities ${P}_{{s}_{t}{a}_{t}}$ :
simulation. For example, the simulator for the inverted pendulum in PS4 was obtained by using the laws of physics to calculate what position andorientation the cart/pole will be in at time $t+1$ , given the current state at time $t$ and the action $a$ taken, assuming that we know all the parameters of the system such as the length of the pole, the mass of the pole, and so on.Alternatively, one can also use an off-the-shelf physics simulation software package which takes as input a complete physical description of a mechanicalsystem, the current state ${s}_{t}$ and action ${a}_{t}$ , and computes the state ${s}_{t+1}$ of the system a small fraction of a second into the future. Open Dynamics Engine (http://www.ode.com) is one example of a free/open-source physics simulator that can be used to simulate systems like theinverted pendulum, and that has been a reasonably popular choice among RL researchers.
An alternative way to get a model is to learn one from data collected in the MDP. For example, suppose we execute $m$ trials in which we repeatedly take actions in an MDP, each trial for $T$ timesteps. This can be done picking actions at random, executing some specific policy, or via some other way of choosing actions. We would then observe $m$ state sequences like the following:
We can then apply a learning algorithm to predict ${s}_{t+1}$ as a function of ${s}_{t}$ and ${a}_{t}$ .
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?