<< Chapter < Page Chapter >> Page >

Or, when we are writing rewards as a function of the states only, this becomes

R ( s 0 ) + γ R ( s 1 ) + γ 2 R ( s 2 ) + .

For most of our development, we will use the simpler state-rewards R ( s ) , though the generalization to state-action rewards R ( s , a ) offers no special difficulties.

Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payoff:

E R ( s 0 ) + γ R ( s 1 ) + γ 2 R ( s 2 ) +

Note that the reward at timestep t is discounted by a factor of γ t . Thus, to make this expectation large, we would like to accrue positive rewardsas soon as possible (and postpone negative rewards as long as possible). In economic applications where R ( · ) is the amount of money made, γ also has a natural interpretation in terms of the interest rate (where a dollar today isworth more than a dollar tomorrow).

A policy is any function π : S A mapping from the states to the actions. We say that we are executing some policy π if, whenever we are in state s , we take action a = π ( s ) . We also define the value function for a policy π according to

V π ( s ) = E R ( s 0 ) + γ R ( s 1 ) + γ 2 R ( s 2 ) + s 0 = s , π ] .

V π ( s ) is simply the expected sum of discounted rewards upon starting in state s , and taking actions according to π . This notation in which we condition on π isn't technically correct because π isn't a random variable, but this is quite standard in the literature.

Given a fixed policy π , its value function V π satisfies the Bellman equations :

V π ( s ) = R ( s ) + γ s ' S P s π ( s ) ( s ' ) V π ( s ' ) .

This says that the expected sum of discounted rewards V π ( s ) for starting in s consists of two terms: First, the immediate reward R ( s ) that we get rightaway simply for starting in state s , and second, the expected sum of future discounted rewards. Examining the second termin more detail, we see that the summation term above can be rewritten E s ' P s π ( s ) [ V π ( s ' ) ] . This is the expected sum of discounted rewards for starting in state s ' , where s ' is distributed according P s π ( s ) , which is the distribution over where we will end up after taking the first action π ( s ) in the MDP from state s . Thus, the second term above gives the expected sum of discounted rewardsobtained after the first step in the MDP.

Bellman's equations can be used to efficiently solve for V π . Specifically, in a finite-state MDP ( | S | < ), we can write down one such equation for V π ( s ) for every state s . This gives us a set of | S | linear equations in | S | variables (the unknown V π ( s ) 's, one for each state), which can be efficiently solved for the V π ( s ) 's.

We also define the optimal value function according to

V * ( s ) = max π V π ( s ) .

In other words, this is the best possible expected sum of discounted rewards that can be attained using any policy. There is also a version of Bellman'sequations for the optimal value function:

V * ( s ) = R ( s ) + max a A γ s ' S P s a ( s ' ) V * ( s ' ) .

The first term above is the immediate reward as before. The second term is the maximum over all actions a of the expected future sum of discounted rewards we'll get upon after action a . You should make sure you understand this equation and see why it makes sense.

We also define a policy π * : S A as follows:

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask