<< Chapter < Page Chapter >> Page >

Suppose the observed state is j and the action is a A . Two results ensue:

  1. A return r ( j , a ) is realized
  2. The system moves to a new state

Let:

Y n = state in n th period, 0 n N - 1

A n = action taken on the basis of Y 0 , A 0 , , Y n - 1 , A n - 1 , Y n

[ A 0 is the initial action based on the initial state Y 0 ]

A policy π is a set of functions ( π 0 , π 1 , , π N - 1 ) , such that

A n = π n ( Y 0 , A 0 , , Y n - 1 , A n - 1 , Y n ) 0 n N - 1

The expected return under policy π , when Y 0 = j 0 is

R ( π , j 0 ) = E k = 0 N - 1 r ( Y k , A k )

The goal is to determine π to maximize R ( π , j 0 ) .

Let Z k = ( Y k , A k ) and W n = ( Z 0 , Z 1 , , Z n ) . If { Y k : 0 k N - 1 } is Markov, then use of (CI9) and (CI11) shows that for any policy the Z -process is Markov. Hence

E [ I M ( Y n + 1 ) | W n ] = E [ I M ( Y n + 1 ) | Z n ] a . s . n : 0 n N - 1 , Borel sets M

We assume time homogeneity in the sense that

P ( Y n + 1 = j | Y n = i , A n = a ) = p ( j | i , a ) , invariant with n , i , j E , a A

We make a dynamic programming approach

Define recursively f N , f N - 1 , , f 0 as follows:

f N ( j ) = 0 , j E . For n = N , N - 1 , , 2 , 1 , set

f n - 1 ( j ) = max { r ( j , a ) + k E f n ( k ) p ( k | j , a ) : a A }

Put

X n = k = 1 n { f k ( Y k ) - E [ f k ( Y k ) | W k - 1 ] }

Then, by A4-2 , ( X N , Z N ) is a MG, with E [ X n ] = 0 , 0 n N and

f n - 1 ( Y n - 1 ) r ( Y n - 1 , A n - 1 ) + k E f n ( k ) p ( k | Z n - 1 ) = r ( Y n - 1 , A n - 1 ) + E [ f n ( Y n ) | W n - 1 ]

IX A4-4

We may therefore assert

0 = E [ X N ] = E k = 1 N { f k ( Y k ) - E [ f k ( Y k ) | W k - 1 ] } E k = 1 N { f k ( Y k ) + r ( Y k - 1 , A k - 1 ) - f k - 1 ( Y k - 1 ) }
= E k = 0 N - 1 r ( Y k , A k ) + f N ( Y N ) - f 0 ( Y 0 ) = E k = 0 N - 1 r ( Y k , A k ) - E [ f 0 ( Y 0 ) ]

Hence, R ( π , Y 0 ) E [ f 0 ( Y 0 ) ] . For Y 0 = j 0 , R ( π , j 0 ) f 0 ( j 0 ) . If a policy π * can be found which yields equality, then π * is an optimal policy.

The following procedure leads to such a policy .

  • For each j E , let π n - 1 * ( Y 0 , A 0 , Y 1 , A 1 , , A n - 2 , j ) = π n - 1 * ( j ) be the action which maximizes
    r ( j , a ) + k E f n ( k ) p ( k | j , a ) = r ( j , a ) + E [ f n ( Y n ) | Y n - 1 = j , A n - 1 = a ]
    Thus, A n * = π n * ( Y n ) .
  • Now, f n - 1 ( Y n - 1 ) = r ( Y n - 1 , A n - 1 * ) - E [ f n ( Y n ) | Z n - 1 * ] , which yields equality in the argument above. Thus, R ( π * , j ) = f 0 ( j ) and π * is optimal.

Note that π * is a Markov policy, A n * = π n * ( Y n ) . The functions f n depend on the future stages, but once determined, the policy is Markov.

A4-9 doob's martingale

Let X be an integrable random variable and Z N an arbitrary sequence of random vectors. For each n , let X n = E [ X | W n ] . Then ( X N , Z N ) is a MG.

E [ | X n | ] = E { | E [ X | W n ] | } E { E [ | X | | W n ] } = E [ | X | ] <
E [ X n + 1 | W n ] = E { E [ X | W n + 1 ] | W n } = E [ X | W n ] = X n a . s .

A4-9a best mean-square estimators

If X L 2 , then X n = E [ X | W n ] is the best mean-square estimator of X , given W n = ( Z 0 , Z 1 , , Z n ) . ( X N , Z N ) is a MG.

A4-9b futures pricing

Let X N be a sequence of “spot” prices for a commodity. Let t 0 be the present and t 0 + T be a fixed future. The agent can be expected to know the past history U t 0 = ( X 0 , X 1 , , X t 0 ) , and will update as t increases beyond t 0 . Put Y k = E [ X t 0 + T | U t 0 + k ] , the expected futures price, given the history up to t 0 + k . Then { Y k : 0 k T } is a Doob's MG, with Y = X t 0 + T , relative to { Z k : 0 k T } , where Z 0 = U t 0 and Z k = X t 0 + k for 1 k T .

A4-9c discounted futures

Assume rate of return is r per unit time. Then α = 1 / ( 1 + r ) is the discount factor . Let

V k = E [ α T - k X t 0 + T | U t 0 + k ] = α T - k Y k

Then

E [ V k + 1 | U t 0 + k ] = α T - k E [ Y k + 1 | U t 0 + k ] = α T - k - 1 Y k > α T - k Y k = V k a . s .

Thus { V k : 0 k T } is a SMG relative to { Z k : 0 k T } .

Implication from martingale theory is that all methods to determine profitable patterns of prediction from past history are doomed to failure.

IX A4-5

A4-10 present discounted value of capital

If α = 1 / ( 1 + r ) is the discount factor, X n is the dividend at time n , and V n is the present value, at time n , of all future returns, then

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Topics in applied probability. OpenStax CNX. Sep 04, 2009 Download for free at http://cnx.org/content/col10964/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Topics in applied probability' conversation and receive update notifications?

Ask