<< Chapter < Page | Chapter >> Page > |
And so once you’ve found V*, we can use this equation to find the optimal policy ?* and the last piece of this algorithm was Bellman’s equations where we know that V*, the optimal sum of discounted rewards you can get for State S, is equal to the immediate reward you get just for starting off in that state +G(for the max over all the actions you could take)(your future sum of discounted rewards)(your future payoff starting from the State S(p) which is where you might transition to after 1(s). And so this gave us a value iteration algorithm, which was essentially V.I.
I’m abbreviating value iteration as V.I., so in the value iteration algorithm, in V.I., you just take Bellman’s equations and you repeatedly do this. So initialize some guess of the value functions. Initialize a zero as the sum rounding guess and then repeatedly perform this update for all states, and I said last time that if you do this repeatedly, then V(s) will converge to the optimal value function, V*(s) and then having found V*(s), you can compute the optimal policy ?*.
Just one final thing I want to recap was the policy iteration algorithm in which we repeat the following two steps. So let’s see, given a random initial policy, we’ll solve for Vp. We’ll solve for the value function for that specific policy. So this means for every state, compute the expected sum of discounted rewards for if you execute the policy ? from that state, and then the other step of policy iteration is having found the value function for your policy, you then update the policy pretending that you’ve already found the optimal value function, V*, and then you repeatedly perform these two steps where you solve for the value function for your current policy and then pretend that that’s actually the optimal value function and solve for the policy given the value function, and you repeatedly update the value function or update the policy using that value function. And last time I said that this will also cause the estimated value function V to converge to V* and this will cause p to converge to ?*, the optimal policy.
So those are based on our last lecture [inaudible] MDPs and introduced a lot of new notation symbols and just summarize all that again. What I’m about to do now, what I’m about to do for the rest of today’s lecture is actually build on these two algorithms so I guess if you have any questions about this piece, ask now since I’ve got to go on please. Yeah.
Student: [Inaudible] how those two algorithms are very different?
Instructor (Andrew Ng) :I see, right, so yeah, do you see that they’re different? Okay, how it’s different. Let’s see. So well here’s one difference. I didn’t say this ‘cause no longer use it today. So value iteration and policy iteration are different algorithms. In policy iteration in this step, you’re given a fixed policy, and you’re going to solve for the value function for that policy and so you’re given some fixed policy ?, meaning some function mapping from the state’s actions. So give you some policy and whatever. That’s just some policy; it’s not a great policy. And in that step that I circled, we have to find the ? of S which means that for every state you need to compute your expected sum of discounted rewards or if you execute this specific policy and starting off the MDP in that state S.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?