<< Chapter < Page | Chapter >> Page > |
This is clearly not a very good representation, right, and when we talk about regression, you just choose some features of X and run linear regression or something. You get a much better fit to the function. And so the sense that discretization just isn’t a very good source of piecewise constant functions. This just isn’t a very good function for representing many things, and there’s also the sense that there’s no smoothing or there’s no generalization across the different buckets. And in fact, back in regression, I would never have chosen to do regression using this sort of visualization. It’s just really doesn’t make sense.
And so in the same way, instead of X, V(s), instead of X and some hypothesis function of X, if you have the state here and you’re trying to approximate the value function, then you can get discretization to work for many problems but maybe this isn’t the best representation to represent a value function. The other problem with discretization and maybe the more serious problem is what’s often somewhat fancifully called the curse of dimensionality. And just the observation that if the state space is in RN, and if you discretize each variable into K buckets, so if discretize each variable into K discrete values, then you get on the order of K to the power of N discrete states. In other words, the number of discrete states you end up with grows exponentially in the dimension of the problem, and so for a helicopter with 12-dimensional state space, this would be maybe like 100 to the power of 12, just huge, and it’s not feasible. And so discretization doesn’t scale well at all with two problems in high-dimensional state spaces, and this observation actually applies more generally than to just robotics and continuous state problems. For example, another fairly well-known applications of reinforcement learning has been to factory automations. If you imagine that you have 20 machines sitting in the factory and the machines lie in a assembly line and they all do something to a part on the assembly line, then they route the part onto a different machine. You want to use reinforcement learning algorithms, [inaudible] the order in which the different machines operate on your different things that are flowing through your assembly line and maybe different machines can do different things. So if you have N machines and each machine can be in K states, then if you do this sort of discretization, the total number of states would be K to N as well. If you have N machines and if each machine can be in K states, then again, you can get this huge number of states. Other well-known examples would be if you have a board game is another example. You’d want to use reinforcement learning to play chess. Then if you have N pieces on your board game, you have N pieces on the chessboard and if each piece can be in K positions,then this is a game sort of the curse of dimensionality thing where the number of discrete states you end up with goes exponentially with the number of pieces in your board game. So the curse of dimensionality means that discretization scales poorly to high-dimensional state spaces or at least discrete representations scale poorly to high-dimensional state spaces. In practice, discretization will usually, if you have a 2-dimensional problem, discretization will usually work great. If you have a 3-dimensional problem, you can often get discretization to work not too badly without too much trouble. With a 4-dimensional problem, you can still often get to where that they could be challenging and as you go to higher and higher dimensional state spaces, the odds and [inaudible] that you need to figure around to discretization and do things like non-uniform grids, so for example, what I’ve drawn for you is an example of a non-uniform discretization where I’m discretizing S-2 much more finally than S-1. If I think the value function is much more sensitive to the value of state variable S-2 than to S-1, and so as you get into higher dimensional state spaces, you may need to manually fiddle with choices like these with no uniform discretizations and so on. But the folk wisdom seems to be that if you have 2- or 3-dimensional problems, it work fine. With 4-dimensional problems, you can probably get it to work but it’d be just slightly challenging and you can sometimes by fooling around and being clever, you can often push discretization up to let’s say about 6-dimensional problems but with some difficulty and problems higher than 6-dimensional would be extremely difficult to solve with discretization. So that’s just rough folk wisdom order of managing problems you think about using for discretization. But what I want to spend most of today talking about is [inaudible]methods that often work much better than discretization and which we will approximate V* directly without resulting to these sort of discretizations. Before I jump to the specific representation let me just spend a few minutes talking about the problem setup then. For today’s lecture, I’m going to focus on the problem of continuous states and just to keep things sort of very simple in this lecture, I want view of continuous actions, so I’m gonna see discrete actions A. So it turns out actually that is a critical fact also for many problems, it turns out that the state space is much larger than the states of actions. That just seems to have worked out that way for many problems, so for example, for driving a car the state space is 6-dimensional, so if XY T, Xdot, Ydot, Tdot. Whereas, your action has, you still have two actions. You have forward backward motion and steering the car, so you have 6-D states and 2-D actions, and so you can discretize the action much more easily than discretize the states. The only examples down for a helicopter you’ve 12-D states in a 4-dimensional action it turns out, and it’s also often much easier to just discretize a continuous actions into a discrete sum of actions. And for the inverted pendulum, you have a 4-D state and a 1-D action. Whether you accelerate your cart to the left or the right is one D action and so for the rest of today, I’m gonna assume a continuous state but I’ll assume that maybe you’ve already discretized your actions, just because in practice it turns out that not for all problems, with many problems large actions is just less of a difficulty than large state spaces. So I’m going to assume that we have a model or simulator of the MDP, and so this is really an assumption on how the state transition probabilities are represented. I’m gonna assume and I’m going to use the terms “model” and “simulator” pretty much synonymously, so specifically, what I’m going to assume is that we have a black box and a piece of code, so that I can input any state, input an action and it will output S prime, sample from the state transition distribution. Says that this is really my assumption on the representation I have for the state transition probabilities, so I’ll assume I have a box that read take us in for the stated action and output in mixed state. And so since they’re fairly common ways to get models of different MDPs you may work with, one is you might get a model from a physics simulator. So for example, if you’re interested in controlling that inverted pendulum, so your action is A which is the magnitude of the force you exert on the cart to left or right, and your state is Xdot, T, Tdot. I’m just gonna write that in that order. And so I’m gonna write down a bunch of equations just for completeness but everything I’m gonna write below here is most of what I wanna write is a bit gratuitous, but so since I’ll maybe flip open a textbook on physics, a textbook on mechanics, you can work out the equations of motion of a physical device like this, so you find that Sdot. The dot denotes derivative, so the derivative of the state with respect to time is given by Xdot, ?-L(B) cost B over M Tdot, B. And so on where A is the force is the action that you exert on the cart. L is the length of the pole. M is the total mass of the system and so on. So all these equations are good uses, just writing them down for completeness, but by flipping over, open like a physics textbook, you can work out these equations and notions yourself and this then gives you a model which can say that S-2+1. You’re still one time step later will be equal to your previous state plus [inaudible], so in your simulator or in my model what happens to the cart every 10th of a second, so ? T would be within one second and then so plus ? T times that. And so that’d be one way to come up with a model of your MDP. And in this specific example, I’ve actually written down deterministic model because and by deterministic I mean that given an action in a state, the next state is not random, so would be an example of a deterministic model where I can compute the next state exactly as a function of the previous state and the previous action or it’s a deterministic model because all the probability mass is on a single state given the previous stated action. You can also make this a stochastic model. A second way that is often used to attain a model is to learn one. And so again, just concretely what you do is you would imagine that you have a physical inverted pendulum system as you physically own an inverted pendulum robot. What you would do is you would then initialize your inverted pendulum robot to some state and then execute some policy, could be some random policy or some policy that you think is pretty good, or you could even try controlling yourself with a joystick or something. But so you set the system off in some state as zero. Then you take some action. Here’s zero and the game could be chosen by some policy or chosen by you using a joystick tryina control your inverted pendulum or whatever. System would transition to some new state, S-1, and then you take some new action, A-1 and so on. Let’s say you do this for two time steps and sometimes I call this one trajectory and you repeat this M times, so this is the first trial of the first trajectory, and then you do this again. Initialize it in some and so on. So you do this a bunch of times and then you would run the learning algorithm to estimate ST+1 as a function of ST and AT. And for sake of completeness, you should just think of this as inverted pendulum problem, so ST+1 is a 4-dimensional vector. ST, AT will be a 4-dimensional vector and that’ll be a real number, and so you might run linear regression 4 times to predict each of these state variables as a function of each of these 5 real numbers and so on.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?