# 2.3 Matlab procedures for markov decision processes

 Page 1 / 1
We first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. Then we consider three cases: a. Gain associated with a state; b. One-step transition gains; and c. Gains associated with a demand under certain reasonable conditions. Matlab implementations of the results of analysis provide machine solutions to a variety of problems.

In order to provide the background for Matlab procedures for Markov decision processes, we first summarize certain essentials in the analysis ofhomogeneous Markov chains with costs and rewards associated with states, or with transitions between states. References are to Pfeiffer: PROBABILITY FOR APPLICATIONS.

1. Some cost and reward patterns
Consider a finite, ergodic (i.e., recurrent, positive, aperiodic, irreducible) homogeneous Markov chain X N , with state space $\mathbf{E}=\left\{1,2,\cdots ,M\right\}$ . To facilitate use of matrix computation programs, we number states from one, except in certain examples in which state zero has anatural meaning. Associated with this chain is a cost or reward structure belonging to one of the three generalclasses described below. The one-step transition probability matrix is $\mathbf{P}=\left[p\left(i,j\right)\right]$ , where $p\left(i,j\right)=P\left({X}_{n+1}=j|{X}_{n}=i\right)$ . The distribution for X n is represented by the row matrix $\pi \left(n\right)=\left[{p}_{1}\left(n\right),{p}_{2}\left(n\right),\cdots ,{p}_{M}\left(n\right)\right]$ , where ${p}_{i}\left(n\right)=P\left({X}_{n}=i\right)$ . The long-run distribution is represented by the row matrix $\pi =\left[{\pi }_{1},{\pi }_{2},\cdots ,{\pi }_{M}\right]$ .
1. Type 1. Gain associated with a state . A reward or gain in the amount $g\left(i\right)$ is realized in the next period if the current state is i . The gain sequence $\left\{{G}_{n}:1\le n\right\}$ of random variables ${G}_{n+1}=g\left({X}_{n}\right)$ is the sequence of gains realized as the chain X N evolves. We represent the gain function g by the column matrix $\mathbf{g}={\left[g\left(1\right)g\left(2\right)\cdots g\left(M\right)\right]}^{T}$ .
2. Type 2. One-step transition gains . A reward or gain in the amount $g\left(i,j\right)={g}_{ij}$ is realized in period $n+1$ if the system goes from state i in period n to state j in period $n+1$ . The corresponding one-step transition probability is $p\left(i,j\right)={p}_{ij}$ . The gain matrix is $\mathbf{g}=\left[g\left(i,j\right)\right]$ . The gain sequence $\left\{{G}_{n}:1\le n\right\}$ of random variables ${G}_{n+1}=g\left({X}_{n},{X}_{n+1}\right)$ is the sequence of gains experienced as the chain X N evolves.
3. Type 3. Gains associated with a demand . In this case, the gain random variables are of the form
${G}_{n+1}=g\left({X}_{n},{D}_{n+1}\right)$
If n represents the present, the random vector ${U}_{n}=\left({X}_{0},{X}_{1},\cdots ,{X}_{n}\right)$ represents the “past” at n of the chain X N . We suppose $\left\{{D}_{n}:1\le n\right\}$ is iid and each $\left\{{D}_{n+1},{U}_{n}\right\}$ is independent. Again, the gain sequence $\left\{{G}_{n}:1\le n\right\}$ represents the gains realized as the process evolves.Standard results on Markov chains show that in each case the sequence ${G}_{\mathbf{N}}=\left\{{G}_{n}:1\le n\right\}$ is Markov.
A recurrence relation . We are interested in the expected gains in future stages, given the present state of the process. For any $n>0$ and any $m>1$ , the gain ${G}_{n}^{\left(m\right)}$ in the m periods following period n is given by
${G}_{n}^{\left(m\right)}={G}_{n+1}+{G}_{n+2}+\cdots +{G}_{n+m}$
If the process is in state i , the expected gain in the next period q i is
${q}_{i}={v}_{i}^{\left(1\right)}=E\left[{G}_{n+1}|{X}_{n}=i\right]$
and the expected gain in the next m periods is
${v}_{i}^{\left(m\right)}=E\left[{G}_{n}^{\left(m\right)}|{X}_{n}=i\right]$
We utilize a recursion relation that allows us to begin with thetransition matrix P and the next-period expected gain matrix $\mathbf{q}={\left[{q}_{1}{q}_{2}\cdots {q}_{m}\right]}^{T}$ and calculate the ${v}_{i}^{\left(m\right)}$ for any $m>1$ . Although the analysis is somewhat different for the various gain structures, the result exhibits a common pattern. In each case
${v}_{i}^{\left(m\right)}=E\left[{G}_{n}^{\left(m\right)}|{X}_{n}=i\right]=E\left[{G}_{n+1}|{X}_{n}=i\right]+\sum _{k=1}^{m-1}E\left[{G}_{n+k+1}|{X}_{n}=i\right]$
$={q}_{i}+\sum _{k=1}^{m-1}\sum _{j\in E}E\left[{G}_{n+k+1}|{X}_{n+1}=j\right]p\left(i,j\right)$
$={q}_{i}+\sum _{j\in E}E\left[\sum _{k=1}^{m-1}{G}_{n+k}|{X}_{n}=j\right]p\left(i,j\right)$
We thus have the fundamental recursion relation
$\left(*\right){v}_{i}^{\left(m\right)}={q}_{i}+\sum _{j\in E}{v}_{j}^{\left(m-1\right)}p\left(i,j\right)$
The recursion relation $\left(*\right)$ shows that the transition matrix $\mathbf{P}=\left[p\left(i,j\right)\right]$ and the vector of next-period expected gains, which we represent by the column matrix $\mathbf{q}={\left[{q}_{1},{q}_{2},\cdots ,{q}_{M}\right]}^{T}$ , determine the ${v}_{i}^{\left(m\right)}$ . The difference between the cases is the manner in which the q i are obtained.
• . ${q}_{i}=E\left[g\left({X}_{n}\right)|{X}_{n}=i\right]=g\left(i\right)$
• . ${q}_{i}=E\left[g\left({X}_{n},{X}_{n+1}\right)|{X}_{n}=i\right]=E\left[g\left(i,{X}_{n+1}\right)|{X}_{n}=i\right]={\sum }_{j\in E}g\left(i,j\right)p\left(i,j\right)$
• . ${q}_{i}=E\left[g\left({X}_{n},{D}_{n+1}\right)|{X}_{n}=i\right]=E\left[g\left(i,{D}_{n+1}\right)\right]={\sum }_{k}g\left(i,k\right)P\left(D=k\right)$
• . For computational purposes, we utilize these relations in matrix form. If
$\mathbf{v}\left(n\right)={\left[{v}_{1}^{\left(n\right)}{v}_{2}^{\left(n\right)}\cdots {v}_{M}^{\left(n\right)}\right]}^{T}\phantom{\rule{5pt}{0ex}}\text{and}\phantom{\rule{5pt}{0ex}}\mathbf{q}={\left[{q}_{1}{q}_{2}\cdots {q}_{M}\right]}^{T}$
then
$\left(*\right)\mathbf{v}\left(m+1\right)=\mathbf{q}+\mathbf{P}\mathbf{v}\left(m\right)\phantom{\rule{5pt}{0ex}}\text{for}\phantom{\rule{4pt}{0ex}}\text{all}\phantom{\rule{5pt}{0ex}}m>0,\phantom{\rule{5pt}{0ex}}\text{with}\phantom{\rule{5pt}{0ex}}\mathbf{v}\left(0\right)=0$
Examination of the expressions for q i , above, shows that the next-period expected gain matrix q takes the following forms. In each case, g is the gain matrix. In case c, p D is the column matrix whose elements are $P\left(D=k\right)$ .
• $\mathbf{q}=\mathbf{g}$
• $\mathbf{q}=\phantom{\rule{5pt}{0ex}}\text{the}\phantom{\rule{4pt}{0ex}}\text{diagonal}\phantom{\rule{4pt}{0ex}}\text{of}\phantom{\rule{5pt}{0ex}}\mathbf{P}{\mathbf{g}}^{T}$
• $\mathbf{q}={\mathbf{g}\mathbf{p}}_{D}$
2. Some long-run averages
Consider the average expected gain for m periods
$E\left[\frac{1}{m}{G}_{n}^{\left(m\right)}\right]=\frac{1}{m}\sum _{k=1}^{m}E\left[{G}_{n+k}\right]=\frac{1}{m}\sum _{k=1}^{m}E\left\{E\left[{G}_{n+k}|{X}_{n-1}\right]\right\}$
Use of the Markov property and the fact that for an ergodic chain
$\frac{1}{m}\sum _{k=1}^{m}{p}^{k}\left(i,j\right)\to {\pi }_{j}\phantom{\rule{5pt}{0ex}}\text{as}\phantom{\rule{5pt}{0ex}}m\to \infty$
yields the result that,
$\underset{m\to \infty }{lim}E\left[\frac{1}{m}{G}_{n}^{\left(m\right)}\right]=\sum _{i}P\left({X}_{n-1}=i\right)\sum _{j}{q}_{j}{\pi }_{j}=\sum _{j}{q}_{j}{\pi }_{j}=g$
and for any state i ,
$\underset{m\to \infty }{lim}E\left[\frac{1}{m}{G}_{n}^{\left(m\right)}|{X}_{n}=i\right]=\underset{m\to \infty }{lim}\frac{1}{m}{v}_{i}^{\left(m\right)}=g$
The expression for g may be put into matrix form. If the long-run probability distribution is represented by the row matrix $\pi =\left[{\pi }_{1}{\pi }_{2}\cdots {\pi }_{M}\right]$ and $\mathbf{q}={\left[{q}_{1}{q}_{2}\cdots ;{q}_{M}\right]}^{T}$ , then
$g=\pi \mathbf{q}$
3. Discounting and potentials
Suppose in a given time interval the value of money increases by a fraction a . This may represent the potential earning if the money were invested. One dollar now is worth $1+a$ dollars at the end of one period. It is worth ${\left(1+a\right)}^{n}$ dollars after n periods. Set $\alpha =1/\left(1+a\right)$ , so that $0<\alpha \le 1$ . It takes α dollars to be worth one dollar after one period. It takes α n dollars at present to be worth one dollar n periods later. Thus the present worth or discounted value of one dollar n periods in the future is α n dollars. This is the amount one would need to invest presently, at interest rate a per unit of time, to have one dollar n periods later. If f is any function defined on the state space E , we designate by f the column matrix ${\left[f\left(1\right)f\left(2\right)\cdots f\left(M\right)\right]}^{T}$ . We make an exception to this convention in the case of the distributions of thestate probabilities $\pi \left(n\right)=\left[{p}_{1}\left(n\right){p}_{2}\left(n\right)\cdots {p}_{M}\left(n\right)\right]$ , their generating function, and the long-run probabilities $\pi =\left[{\pi }_{1}{\pi }_{2}\cdots {\pi }_{M}\right]$ , which we represent as row matrices. It should be clear that much of the followingdevelopment extends immediately to infinite state space $\mathbf{E}=\mathbf{N}=\left\{0,1,2,\cdots \right\}$ . We assume one of the gain structures introduced in Sec 1 and the correspondinggain sequence $\left\{{G}_{n}:1\le n\right\}$ . The value of the random variable ${G}_{n+1}$ is the gain or reward realized during the period $n+1$ . We let each transition time be at the end of one period of time (say a month or aquarter). If n corresponds to the present, then ${G}_{n+k}$ is the gain in period $n+k$ . If we do not discount for the period immediately after the present, then ${\alpha }^{k-1}{G}_{n+k}$ is the present value of that gain. Hence
$\sum _{k=1}^{\infty }{\alpha }^{k-1}{G}_{n+k}\phantom{\rule{5pt}{0ex}}\text{is}\phantom{\rule{4pt}{0ex}}\text{the}\phantom{\rule{4pt}{0ex}}\text{total}\phantom{\rule{4pt}{0ex}}\text{discounted}\phantom{\rule{4pt}{0ex}}\text{future}\phantom{\rule{4pt}{0ex}}\text{gain}$
Definition . The α - potential of the gain sequence $\left\{{G}_{n}:1\le n\right\}$ is the function φ defined by
$\phi \left(i\right)=E\left[\sum _{n=0}^{\infty }{\alpha }^{n}{G}_{n+1}|{X}_{0}=i\right]\forall i\in \mathbf{E}$
Remark . $\phi \left(i\right)$ is the expected total discounted gain, given the system starts in state i . We next define a concept which is related to the α -potential in a useful way. Definition . The α - potential matrix ${\mathbf{R}}^{\alpha }$ for the process X N is the matrix
${\mathbf{R}}^{\alpha }=\sum _{n=0}^{\infty }{\alpha }^{n}{\mathbf{P}}^{n}\phantom{\rule{5pt}{0ex}}\text{with}\phantom{\rule{5pt}{0ex}}{\mathbf{P}}^{0}=\mathbf{I}$

## 3.1

Let X N be an ergodic, homogeneous Markov chain with state space E and gain sequence $\left\{{G}_{n}:1\le n\right\}$ . Let $\mathbf{q}={\left[{q}_{1}{q}_{1}\cdots {q}_{M}\right]}^{T}$ where ${q}_{i}=E\left[{G}_{n+1}|{X}_{n}=i\right]\text{for}i\in \mathbf{E}$ . For $\alpha \in \left(0,1\right)$ , let φ be the α -potential for the gain sequence. That is,

$\phi \left(i\right)=E\left[\sum _{n=0}^{\infty },{\alpha }^{n},{G}_{n+1},|,{X}_{0},=,i\right]\forall i\in \mathbf{E}$
Then, if ${\mathbf{R}}^{\alpha }$ is the α -potential matrix for X N , we have
$\phi ={\mathbf{R}}^{\alpha }\mathbf{q}$
$\square$

## 3.2

Consider an ergodic, homogeneous Markov chain X N with gain sequence $\left\{{G}_{n}:1\le n\right\}$ and next-period expected gain matrix q . For $\alpha \in \left(0,1\right)$ , the α -potential φ and the α -potential matrix ${\mathbf{R}}^{\alpha }$ satisfy

$\left[\mathbf{I}-\alpha \mathbf{P}\right]{\mathbf{R}}^{\alpha }=\mathbf{I}\text{and}\left[\mathbf{I}-\alpha \mathbf{P}\right]\phi =\mathbf{q}$
If the state space E is finite, then
${\mathbf{R}}^{\alpha }={\left[\mathbf{I}-\alpha \mathbf{P}\right]}^{-1}\text{and}\phi ={\left[\mathbf{I}-\alpha \mathbf{P}\right]}^{-1}\mathbf{q}={\mathbf{R}}^{\alpha }\mathbf{q}$
$\square$

## A numerical example

Suppose the transition matrix P and the next-period expected gain matrix q are

$\mathbf{P}=\left[\begin{array}{ccc}\hfill 0.2& \hfill 0.3& \hfill 0.5\\ \hfill 0.4& \hfill 0.1& \hfill 0.5\\ \hfill 0.6& \hfill 0.3& \hfill 0.1\end{array}\right]\phantom{\rule{5pt}{0ex}}\text{and}\phantom{\rule{5pt}{0ex}}\mathbf{q}=\left[\begin{array}{c}2\hfill \\ 5\hfill \\ 3\hfill \end{array}\right]$
For $\alpha =0.9$ , we have
${\mathbf{R}}^{0.9}={\left(\mathbf{I}-0.9\mathbf{P}\right)}^{-1}=\left[\begin{array}{ccc}4.4030\hfill & 2.2881\hfill & 3.3088\hfill \\ 3.5556\hfill & 3.1356\hfill & 3.3088\hfill \\ 3.6677\hfill & 2.2881\hfill & 4.0441\hfill \end{array}\right]\phantom{\rule{5pt}{0ex}}\text{and}\phantom{\rule{5pt}{0ex}}\phi ={\mathbf{R}}^{0.9}\mathbf{q}=\left[\begin{array}{c}30.17\hfill \\ 32.72\hfill \\ 30.91\hfill \end{array}\right]$
$\square$

The next example is of class c, but the demand random variable is Poisson, which does not have finite range. The simple pattern for calculating the q i must be modified.

## Costs associated with inventory under an $\left(m,M\right)$ Order policy.

If k units are ordered, the cost in dollars is

$c\left(k\right)=\left\{\begin{array}{cc}0\hfill & k=0\hfill \\ 10+25k\hfill & 0

For each unit of unsatisfied demand, there is a penalty of 50 dollars. We use the $\left(m,\phantom{\rule{0.166667em}{0ex}}M\right)$ inventory policy described in Ex 23.1.3. ${X}_{0}=M$ , and X n is the stock at the end of period n , before restocking. Demand in period n is D n . The cost for restocking at the end of period $n+1$ is

$g\left({X}_{n},{D}_{n+1}\right)=\left\{\begin{array}{cc}10+25\left(M-{X}_{n}\right)+50max\left\{{D}_{n+1}-M,0\right\}\hfill & 0\le {X}_{n}
For $m=1,M=3$ , we have
$g\left(0,{D}_{n+1}\right)=85+50{I}_{\left[3,\infty \right)}\left({D}_{n+1}\right)\left({D}_{n+1}-3\right)$
$g\left(i,{D}_{n+1}\right)=50{I}_{\left[3,\infty \right)}\left({D}_{n+1}\right)\left({D}_{n+1}-i\right)1\le i\le 3$
We take $m=1,M=3$ and assume D is Poisson (1). From Ex 23.1.3 in PA, we obtain

$\mathbf{P}=\left[\begin{array}{cccc}0.0803\hfill & 0.1839\hfill & 0.3679\hfill & 0.3679\hfill \\ 0.6321\hfill & 0.3679\hfill & 0\hfill & 0\hfill \\ 0.2642\hfill & 0.3679\hfill & 0.3679\hfill & 0\hfill \\ 0.0803\hfill & 0.1839\hfill & 0.3679\hfill & 0.3679\hfill \end{array}\right]$

The largest eigenvalue $|\lambda |\approx 0.26$ , so $n\ge 10$ should be sufficient for convergence. We use $n=16$ .

Taking any row of ${\mathbf{P}}^{16}$ , we approximate the long-run distribution by

$\pi =\left[0.2858\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}0.2847\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}0.2632\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}0.1663\right]$

We thus have

${q}_{0}=85+50E\left[{I}_{\left[3,\infty \right)}\left({D}_{n+1}\right)\left({D}_{n+1}-3\right)\right]=85+50\sum _{k=4}^{\infty }\left(k-3\right){p}_{k}\left(\phantom{\rule{5pt}{0ex}}\text{the}\phantom{\rule{5pt}{0ex}}\text{term}\phantom{\rule{4pt}{0ex}}\text{for}\phantom{\rule{4pt}{0ex}}k=3\phantom{\rule{4pt}{0ex}}\text{is}\phantom{\rule{4pt}{0ex}}\text{zero}\right)$

For the Poisson distribution $\sum _{k=n}^{\infty }k{p}_{k}=\lambda \sum _{k=n-1}^{\infty }{p}_{k}$ .
Hence ${q}_{0}=85+50\left[\sum _{k=3}^{\infty }{p}_{k}-3\sum _{k=4}^{\infty }{p}_{k}\right]\approx 85+50\left[0.0803-3×0.0190\right]=86.1668$ ${q}_{1}=50\sum _{k=3}^{\infty }\left(k-1\right){p}_{k}=50\left[\sum _{k=2}^{\infty }{p}_{k}-\sum _{k=3}^{\infty }{p}_{k}\right]=50{p}_{2}=9.1970$ ${q}_{2}=50\sum _{k=3}^{\infty }\left(k-2\right){p}_{k}=50\left[0.2642-2×0.0803\right]=5.1809$ ${q}_{3}={q}_{0}-85=1.1668$ so that

$\mathbf{q}=\left[\begin{array}{c}\hfill 86.1668\\ \hfill 9.1970\\ \hfill 5.1809\\ \hfill 1.1668\end{array}\right]$

Then, for $\alpha =0.8$ we have

${\mathbf{R}}^{0.8}={\left(\mathbf{I}\phantom{\rule{3.33333pt}{0ex}}-\phantom{\rule{3.33333pt}{0ex}}0.8\mathbf{P}\right)}^{-1}=\left[\phantom{\rule{3.33333pt}{0ex}},\begin{array}{cccc}\hfill 1.9608& \hfill 1.0626& \hfill 1.1589& \hfill 0.8178\\ \hfill 1.4051& \hfill 2.1785& \hfill 0.8304& \hfill 0.5860\\ \hfill 1.1733& \hfill 1.2269& \hfill 2.1105& \hfill 0.4893\\ \hfill 0.9608& \hfill 1.0626& \hfill 1.1589& \hfill 1.8178\end{array}\right]\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{and}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phi ={\mathbf{R}}^{0.8}\mathbf{q}=\left[\begin{array}{c}185.68\hfill \\ 146.09\hfill \\ 123.89\hfill \\ 100.68\hfill \end{array}\right]$

Recall that we are dealing with expected discounted costs. Since we usually consider starting with a full stock, the amount $\phi \left(M\right)=\phi \left(3\right)=100.68$ is the quantity of interest. Note that this is smaller than other values of $\phi \left(j\right)$ , which correspond to smaller beginning stock levels. We expect greater restocking costs insuch cases.

4. Evolution of chains with costs and rewards
1. No discounting A generating function analysis of
$\left(*\right)\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\mathbf{v}\left(m+1\right)=\mathbf{q}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\mathbf{P}\mathbf{v}\left(m\right)\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\text{all}\phantom{\rule{0.277778em}{0ex}}m\phantom{\rule{3.33333pt}{0ex}}>\phantom{\rule{3.33333pt}{0ex}}0,\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{with}\phantom{\rule{0.277778em}{0ex}}\mathbf{v}\left(0\right)=0$
shows
$\mathbf{v}\left(n\right)=ng\mathbf{1}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\mathbf{v}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{transients,}\phantom{\rule{4.pt}{0ex}}\phantom{\rule{4.pt}{0ex}}\text{where}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\mathbf{v}={\mathbf{B}}_{0}\mathbf{q}$
Here $g=\pi \mathbf{q}$ , is a column matrix of ones, ${\mathbf{P}}_{0}={\mathbf{P}}^{\infty }$ , and B 0 is a matrix which analysis also shows we may approximate by
${\mathbf{B}}_{0}=\mathbf{B}\left(1\right)\approx \mathbf{I}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\mathbf{P}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{\mathbf{P}}^{2}\phantom{\rule{3.33333pt}{0ex}}+\cdots +{\mathbf{P}}^{n}-\phantom{\rule{3.33333pt}{0ex}}\left(n\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}1\right){\mathbf{P}}_{0}$
The largest $|{\lambda }_{i}|<1$ gives an indication of how many terms are needed.

## The inventory problem (continued)

We consider again the inventory problem. We have

$\mathbf{P}=\left[\begin{array}{cccc}0.0803\hfill & 0.1839\hfill & 0.3679\hfill & 0.3679\hfill \\ 0.6321\hfill & 0.3679\hfill & 0\hfill & 0\hfill \\ 0.2642\hfill & 0.3679\hfill & 0.3679\hfill & 0\hfill \\ 0.0803\hfill & 0.1839\hfill & 0.3679\hfill & 0.3679\hfill \end{array}\right]\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{and}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\mathbf{q}=\left[\phantom{\rule{3.33333pt}{0ex}},\begin{array}{c}\hfill 86.1668\\ \hfill 9.1970\\ \hfill 5.1819\\ \hfill 1.1668\end{array},\phantom{\rule{3.33333pt}{0ex}}\right]$

The eigenvalues are $1,0.0920+i0.2434,0.0920-0.2434$ and 0. Thus, the decay of the transients is controlled by $|{\lambda }^{*}|={\left(0.{0920}^{2}+0.{2434}^{2}\right)}^{1/2}=0.2602$ . Since $|{\lambda }^{*}{|}^{4}\approx 0.0046$ , the transients die out rather rapidly. We obtain the following approximations

${\mathbf{P}}_{0}\approx {\mathbf{P}}^{16}\approx \left[\begin{array}{cccc}\hfill 0.2852& \hfill 0.2847& \hfill 0.2632& \hfill 0.1663\\ \hfill 0.2852& \hfill 0.2847& \hfill 0.2632& \hfill 0.1663\\ \hfill 0.2852& \hfill 0.2847& \hfill 0.2632& \hfill 0.1663\\ \hfill 0.2852& \hfill 0.2847& \hfill 0.2632& \hfill 0.1663\end{array}\right]\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{so}\phantom{\rule{4.pt}{0ex}}\text{that}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\pi \approx \left[0.2852\phantom{\rule{0.277778em}{0ex}}0.2847\phantom{\rule{0.277778em}{0ex}}0.2632\phantom{\rule{0.277778em}{0ex}}0.1663\right]$

The approximate value of B 0 is found to be

${\mathbf{B}}_{0}\approx \mathbf{I}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\mathbf{P}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{\mathbf{P}}^{2}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{\mathbf{P}}^{3}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{\mathbf{P}}^{4}\phantom{\rule{3.33333pt}{0ex}}-\phantom{\rule{3.33333pt}{0ex}}5{\mathbf{P}}_{0}\approx \left[\begin{array}{cccc}\hfill 0.4834& \hfill -0.3766& \hfill -0.1242& \hfill \phantom{\rule{3.33333pt}{0ex}}0.0174\\ \hfill 0.0299& \hfill \phantom{\rule{3.33333pt}{0ex}}0.7537& \hfill -0.5404& \hfill -0.2432\\ \hfill -0.2307& \hfill -0.1684& \hfill \phantom{\rule{3.33333pt}{0ex}}0.7980& \hfill -0.3989\\ \hfill -0.5166& \hfill -0.3766& \hfill -0.1242& \hfill \phantom{\rule{3.33333pt}{0ex}}1.0174\end{array},\phantom{\rule{3.33333pt}{0ex}}\right]$

The value $g=\pi \mathbf{q}\approx 28.80$ is found in the earlier treatment. From the values above, we have

$\mathbf{v}={\mathbf{B}}_{0}\mathbf{q}\approx \left[\begin{array}{c}\hfill 37.6\\ \hfill 6.4\\ \hfill -17.8\\ \hfill -47.4\end{array}\right]\phantom{\rule{5pt}{0ex}}\text{so}\phantom{\rule{4pt}{0ex}}\text{that}\phantom{\rule{5pt}{0ex}}\mathbf{v}\left(n\right)\approx \left[\begin{array}{c}28.80n+37.6\hfill \\ 28.80n+6.4\hfill \\ 28.80n-17.8\hfill \\ 28.80n-47.4\hfill \end{array}\right]+\phantom{\rule{5pt}{0ex}}\text{transients}\phantom{\rule{5pt}{0ex}}$

The average gain per period is clearly $g\approx 28.80$ . This soon dominates the constant terms represented by the entries in v .

2. With discounting Let the discount factor be $\alpha \in \left(0,\phantom{\rule{0.166667em}{0ex}}1\right)$ . If n represents the present period, then ${G}_{n+1}=$ the reward in the first succeeding period ${G}_{n+k}\phantom{\rule{3.33333pt}{0ex}}=$ the reward in the k th succeding period. If we do not discount the first period, then
${G}_{n}^{\left(m\right)}={G}_{n+1}+\alpha {G}_{n+2}+{\alpha }^{2}{G}_{n+3}+\cdots +{\alpha }^{m-1}{G}_{n+m}={G}_{n+1}+\alpha {G}_{n+1}^{\left(m-1\right)}$
Thus
${v}_{i}^{\left(m\right)}=E\left[{G}_{n}^{\left(m\right)}|{X}_{n}=i\right]={q}_{i}+\alpha \sum _{j=1}^{M}p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right){v}_{j}^{\left(m-1\right)}$
In matrix form, this is
$\mathbf{v}\left(n\right)=\mathbf{q}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\alpha \mathbf{P}\mathbf{v}\left(n-1\right)$
A generating function analysis shows that
${v}_{i}^{\left(n\right)}={v}_{i}\phantom{\rule{3.33333pt}{0ex}}+\text{transients}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}1\le i\le M$
Hence, the steady state part of
${v}_{i}^{\left(m\right)}={q}_{i}+\alpha \sum _{j=1}^{M}p\left(i,j\right){v}_{j}^{\left(m-1\right)}\phantom{\rule{4pt}{0ex}}\text{is}\phantom{\rule{4pt}{0ex}}{v}_{i}={q}_{i}+\alpha \sum _{j=1}^{M}p\left(i,j\right){v}_{j}1\le i\le M$
In matrix form,
$\mathbf{v}=\mathbf{q}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\alpha \mathbf{P}\mathbf{v}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{which}\phantom{\rule{4.pt}{0ex}}\text{yields}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\mathbf{v}={\left[\mathbf{I}\phantom{\rule{3.33333pt}{0ex}}-\phantom{\rule{3.33333pt}{0ex}}\alpha \mathbf{P}\right]}^{-1}\phantom{\rule{0.166667em}{0ex}}\mathbf{q}$
Since the ${q}_{i}=E\left[{G}_{n+1}|{X}_{n}=i\right]$ are known, we can solve for the v i . Also, since ${v}_{i}^{\left(m\right)}$ is the present value of expected gain m steps into the future, the v i represent the present value for an unlimited future, given that the process starts in state i .
5. Stationary decision policies
We suppose throughout this section that X N is ergodic, with finite state space $\mathbf{E}=\left\{1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}\cdots ,\phantom{\rule{0.166667em}{0ex}}M\right\}$ . For each state $i\in \mathbf{E}$ , there is a class A i of possible actions which may be taken when the process is in state i . A decision policy is a sequence of decision functions ${d}_{1},\phantom{\rule{0.166667em}{0ex}}{d}_{2}\phantom{\rule{0.166667em}{0ex}}\cdots \phantom{\rule{0.277778em}{0ex}}$ such that
$\text{The}\phantom{\rule{4.pt}{0ex}}\text{action}\phantom{\rule{4.pt}{0ex}}\text{at}\phantom{\rule{4.pt}{0ex}}\text{stage}\phantom{\rule{4.pt}{0ex}}n\phantom{\rule{4.pt}{0ex}}\text{is}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}{d}_{n}\left({X}_{0},\phantom{\rule{0.166667em}{0ex}}{X}_{1},\phantom{\rule{0.166667em}{0ex}}...,\phantom{\rule{0.166667em}{0ex}}{X}_{n}\right).$
The action selected is in A i whenever ${X}_{n}=i$ . We consider a special class of decision policies. Stationary decision policy
${d}_{n}\left({X}_{0},\phantom{\rule{0.166667em}{0ex}}{X}_{1},\phantom{\rule{0.166667em}{0ex}}\cdots ,\phantom{\rule{0.166667em}{0ex}}{X}_{n}\right)=d\left({X}_{n}\right)\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{with}\phantom{\rule{4.pt}{0ex}}d\phantom{\rule{4.pt}{0ex}}\text{invariant}\phantom{\rule{4.pt}{0ex}}\text{with}\phantom{\rule{0.277778em}{0ex}}n$
The possible actions depend upon the current state. That is $d\left({X}_{n}\right)\in {A}_{i}$ whenever ${X}_{n}=i$ . Analysis shows the Markov character of the process is preserved under stationary policies. Also, if E is finite and every stationary policy produces an irreducible chain, then an optimal stationary policy is optimal. We suppose each policy yields an ergodic chain . Since the transition probabilities from any state depend in part on the action taken when inthat state, the long-run probabilities will depend upon the policy. In the no-discounting case, we want to determine a stationary policy that maximizesthe gain $g=\pi \mathbf{q}$ for the corresponding long-run probabilities. We say “a stationary policy,” since we do not rule out the possibilitythere may be more than one policy which maximizes the gain. In the discounting case, the goal is to maximize the steady state part v i in the expressions
${v}_{i}^{\left(n\right)}={v}_{i}+\phantom{\rule{0.277778em}{0ex}}\text{transients}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}1\le i\le M$
In simple cases, with small state space and a small number of possible actions in each state, it may be feasible to obtain the gain or present values for eachpermissible policy and then select the optimum by direct comparison. However, for most cases, this is not an efficient approach. In the next twosections, we consider iterative procedures for step-by-step optimization which usually greatly reduces the computational burden.
6. Policy iteration method– no discounting
We develop an iterative procedure for finding an optimal stationary policy. The goal is to determine a stationary policy thatmaximizes the long run expected gain per period g . In the next section, we extend this procedure to thecase where discounting is in effect. We assume that each policy yields an ergodic chain . Suppose we have established that under a given policy (1) ${v}_{i}^{\left(n\right)}={q}_{i}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\sum _{j}p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right){v}_{j}^{\left(n-1\right)}$ , where ${q}_{i}=E\left[{G}_{n+1}|{X}_{n}=i\right]$ In this case, the analysis in Sec 4 shows that for large n , (2) ${v}_{i}^{\left(n\right)}\approx ng\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{v}_{i}$ , where $g=\sum _{i}{\pi }_{i}{q}_{i}$ We note that v i and g depend upon the entire policy, while q i and $p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right)$ depend on the individual actions a i . Using (1) and (2), we obtain
$ng\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{v}_{i}={q}_{i}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\sum _{i}p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right)\left[\left(n-\phantom{\rule{3.33333pt}{0ex}}1\right)g\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{v}_{j}\right]={q}_{i}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\left(n\phantom{\rule{3.33333pt}{0ex}}-\phantom{\rule{3.33333pt}{0ex}}1\right)g\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\sum _{j}p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right){v}_{j}$
From this we conclude (3) $g+{v}_{i}={q}_{i}+\sum _{j}p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right){v}_{j}$ for all $i\in \mathbf{E}$ Suppose a policy d has been used. That is, action $d\left(i\right)$ is taken whenever the process is in state i . To simplify writing, we drop the indication of the action and simply write $p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right)$ for ${p}_{i}j\left(d\left(i\right)\right)$ , etc. Associated with this policy, there is a gain g . We should like to determine whether or not this is the maximum possible gain, and ifit is not, to find a policy which does give the maximum gain. The following two-phase procedure has been found to be effective. It is essentiallythe procedure developed originally by Ronald Howard, who pioneered the treatment of these problems. The new feature is the method of carrying out the value-determinationstep, utilizing the approximation method noted above.
1. Value-determination procedure We calculate $g=\sum _{i}{\pi }_{i}{q}_{i}=\phantom{\rule{3.33333pt}{0ex}}\pi \mathbf{q}$ . As in Sec 4, we calculate
$\mathbf{v}={\mathbf{B}}_{0}\mathbf{q}\approx \left[\mathbf{I}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\mathbf{P}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{\mathbf{P}}^{2}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\cdots +\phantom{\rule{3.33333pt}{0ex}}{\mathbf{P}}^{s}\phantom{\rule{3.33333pt}{0ex}}-\phantom{\rule{3.33333pt}{0ex}}\left(s+1\right){\mathbf{P}}_{0}\right]\phantom{\rule{0.166667em}{0ex}}\mathbf{q}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}\text{where}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}{\mathbf{P}}_{0}=\underset{n}{lim}{\mathbf{P}}^{n}$
2. Policy-improvement procedure We suppose policy d has been used through period n . Then, by (3), above,
$g\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}{v}_{i}={q}_{i}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\sum _{j}p\left(i,\phantom{\rule{0.166667em}{0ex}}j\right){v}_{j}$
We seek to improve policy d by selecting policy d * , with ${d}^{*}\left(i\right)={a}_{ik}^{*}$ , to satisfy
${q}_{i}^{*}\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\sum _{j}{p}_{ij}^{*}{v}_{j}=max\left\{{q}_{i}\left({a}_{ik}\right)\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}\sum _{j}{p}_{ij}\left({a}_{ik}\right){v}_{j}\phantom{\rule{0.166667em}{0ex}}:\phantom{\rule{0.277778em}{0ex}}{a}_{ik}\in {A}_{i}\right\},\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}1\le i\le M$
Remarks
• In the procedure for selecting d * , we use the “old” v j .
• It has been established that ${g}^{*}\phantom{\rule{3.33333pt}{0ex}}\ge \phantom{\rule{3.33333pt}{0ex}}g$ , with equality iff g is optimal. Since there is a finite number of policies, the proceduremust converge to an optimum stationary policy in a finite number of steps.
We implement this two step procedure with Matlab. To demonstrate the procedure, we consider the following

## A numerical example

A Markov decision process has three states: State space $\mathbf{E}=\left\{1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}3\right\}$ .

Actions: State 1: ${A}_{1}=\left\{1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}3\right\}$ State 2: ${A}_{2}=\left\{1,\phantom{\rule{0.166667em}{0ex}}2\right\}$ State 3: ${A}_{3}=\left\{1,\phantom{\rule{0.166667em}{0ex}}2\right\}$

Transition probabilities and rewards are:

 ${p}_{1j}\left(1\right)$ : [1/3 1/3 1/3] ${g}_{1j}\left(1\right)$ : [1 3 4] ${p}_{1j}\left(2\right)$ : [1/4 3/8 3/8] ${g}_{2j}\left(2\right)$ : [2 2 3] ${p}_{1j}\left(3\right)$ : [1/3 1/3 1/3] ${g}_{3j}\left(3\right)$ : [2 2 3] ${p}_{2j}\left(1\right)$ : [1/8 3/8 1/2] ${g}_{2j}\left(1\right)$ : [2 1 2] ${p}_{2j}\left(2\right)$ : [1/2 1/4 1/4] ${g}_{2j}\left(2\right)$ : [1 4 4] ${p}_{3j}\left(1\right)$ : [3/8 1/4 3/8] ${g}_{3j}\left(1\right)$ : [2 3 3] ${p}_{3j}\left(2\right)$ : [1/8 1/4 5/8] ${g}_{3j}\left(2\right)$ : [3 2 2]
Use the policy iteration method to determine the policy which gives the maximum gain g .

A computational procedure utilizing Matlab We first put the data in an m-file. Since we have several cases, it is expedient to include the case number. This example belongs to type 2.Data in file md61.m type = 2 PA = [1/3 1/3 1/3; 1/4 3/8 3/8; 1/3 1/3 1/3; 0 0 0; 1/8 3/8 1/2; 1/2 1/4 1/4; 0 0 0;3/8 1/4 3/8; 1/8 1/4 5/8] GA = [1 3 4; 2 2 3; 2 2 3; 0 0 0; % Zero rows are separators2 1 2; 1 4 4; 0 0 0; 2 3 3; 3 2 2]A = [1 2 3 0 1 2 0 1 2]

md61 type = 2PA = 0.3333 0.3333 0.3333 0.2500 0.3750 0.37500.3333 0.3333 0.3333 0 0 00.1250 0.3750 0.5000 0.5000 0.2500 0.25000 0 0 0.3750 0.2500 0.37500.1250 0.2500 0.6250GA = 1 3 4 2 2 32 2 3 0 0 02 1 2 1 4 40 0 0 2 3 33 2 2A = 1 2 3 0 1 2 0 1 2

Once the data are entered into Matlab by the call for file “md61,” we make preparation for the policy-improvement step. The procedureis in the file newpolprep.m

% file newpolprep.m % version of 3/23/92disp('Data needed:') disp(' Matrix PA of transition probabilities for states and actions')disp(' Matrix GA of gains for states and actions') disp(' Type number to identify GA matrix types')disp(' Type 1. Gains associated with a state') disp(' Type 2. One-step transition gains')disp(' Type 3. Gains associated with a demand') disp(' Row vector of actions')disp(' Value of alpha (= 1 for no discounting)') c = input('Enter type number to show gain type ');a = input('Enter value of alpha (= 1 for no discounting) '); PA = input('Enter matrix PA of transition probabilities ');GA = input('Enter matrix GA of gains '); if c == 3PD = input('Enter row vector of demand probabilities '); endif c == 1 QA = GA';elseif c ==2 QA = diag(PA*GA'); % (qx1)else QA = GA*PD';end A = input('Enter row vector A of possible actions '); % (1xq) m = length(PA(1,:));q = length(A); n = input('Enter n, the power of P to approximate P0 ');s = input('Enter s, the power of P in the approximation of V '); QD = [(1:q)' A' QA]; % First col is index; second is A; third is QA DIS = [' Index Action Value']; disp(DIS)disp(QD) if a == 1call = 'Call for newpol'; elsecall = 'Call for newpold'; enddisp(call) newpolprep % Call for preparatory program in file npolprep.m Data needed:Matrix PA of transition probabilities for states and actions Matrix GA of gains for states and actionsType number to identify GA matrix types Type 1. Gains associated with a stateType 2. One-step transition gains Type 3. Gains associated with a demandRow vector of actions Value of alpha (= 1 for no discounting)Enter type number to show gain type 2Enter value of alpha (=1 for no discounting) 1 Enter matrix PA of transition probabilities PAEnter matrix GA of gains GA Enter row vector A of possible actions AEnter n, the power of P to approximate P0 16 Enter s, the power of P in the approximation of V 6Index Action Value 1.0000 1.0000 2.66672.0000 2.0000 2.3750 3.0000 3.0000 2.33334.0000 0 0 5.0000 1.0000 1.62506.0000 2.0000 2.5000 7.0000 0 08.0000 1.0000 2.6250 9.0000 2.0000 2.1250Call for newpol

The procedure has prompted for the procedure newpol (in file newpol.m)

% file: newpol.m (without discounting) % version of 3/23/92d = input('Enter policy as row matrix of indices '); D = A(d); % policy in terms of actionsP = PA(d',:); % selects probabilities for policy Q = QA(d',:); % selects next-period gains for policyP0 = P^n; % Display to check convergence PI = P0(1,:); % Long-run distributionG = PI*Q % Long-run expected gain per period C = P + eye(P);for j=2:s C = C + P^j; % C = I + P + P^2 + ... + P^send V = (C - (s+1)*P0 )*Q; % B = B0*Qdisp(' ') disp('Approximation to P0; rows are long-run dbn')disp(P0) disp('Policy in terms of indices')disp(d) disp('Policy in terms of actions')disp(D) disp('Values for the policy selected')disp(V) disp('Long-run expected gain per period G')disp(G) T = [(1:q)' A' (QA + PA*V)]; % Test values for determining new policy DIS =[' Index Action Test Value']; disp(DIS)disp(T) disp('Check current policy against new test values.')disp('--If new policy needed, call for newpol') disp('--If not, use policy, values V, and gain G, above') newpol Enter policy as row matrix of indices [2 6 9]% A deliberately poor choiceApproximation to P0; rows are long-run dbn 0.2642 0.2830 0.45280.2642 0.2830 0.4528 0.2642 0.2830 0.4528Policy in terms of indices2 6 9Policy in terms of actions 2 2 2 Long-run expected gain per period G 2.2972Index Action Test Value1.0000 1.0000 2.7171 2.0000 2.0000 2.41683.0000 3.0000 2.3838 4.0000 0 05.0000 1.0000 1.6220 6.0000 2.0000 2.56777.0000 0 0 8.0000 1.0000 2.64799.0000 2.0000 2.0583Check current policy against new test values. --If new policy needed, call for newpol--If not, use policy and gain G, above % New policy is needed newpolEnter policy as row matrix of indices [1 6 8]Approximation to P0; rows are long-run dbn0.3939 0.2828 0.3232 0.3939 0.2828 0.32320.3939 0.2828 0.3232Policy in terms of indices 1 6 8Policy in terms of actions1 2 1Values for selected policy 0.0526-0.0989 0.0223Long-run expected gain per period G2.6061Index Action Test Value 1.0000 1.0000 2.65872.0000 2.0000 2.3595 3.0000 3.0000 2.32544.0000 0 0 5.0000 1.0000 1.60576.0000 2.0000 2.5072 7.0000 0 08.0000 1.0000 2.6284 9.0000 2.0000 2.1208Check current policy against new test values.--If new policy needed, call for newpol --If not, use policy, values V, and gain G, above

Since the policy selected on this iteration is the same as the previous one, the procedure has converged to an optimal policy.The first of the first three rows is maximum; the second of the next two rows is maximum; and the first of the final two rows is maximum. Thus,we have selected rows 1, 5, 6, corresponding to the optimal policy ${d}^{*}\sim \left(1\phantom{\rule{0.277778em}{0ex}}2\phantom{\rule{0.277778em}{0ex}}1\right)$ . The expected long-run gain per period $g=2.6061$ .

The value matrix is

$\mathbf{v}=\left[\phantom{\rule{3.33333pt}{0ex}},\begin{array}{c}{v}_{1}\hfill \\ {v}_{2}\hfill \\ {v}_{3}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\hfill \end{array},\phantom{\rule{3.33333pt}{0ex}}\right]\phantom{\rule{3.33333pt}{0ex}}=\left[\begin{array}{c}\hfill 0.0527\\ \hfill -0.0989\\ \hfill 0.0223\phantom{\rule{3.33333pt}{0ex}}\end{array},\phantom{\rule{3.33333pt}{0ex}}\right]$

We made a somewhat arbitrary choice of the powers of P used in the approximation of P 0 and B 0 . As we note in the development of the approximation procedures in Sec 4, the convergence of ${\mathbf{P}}^{n}$ is governed by the magnitude of the largest eigenvalue other than one. We can always get a check on the reliability of the calculations by checkingthe eigenvalues for P corresponding to the presumed optimal policy. For the choice above, we find the eigenvalues to be 1, -0.1250, 0.0833. Since $0.{125}^{4}\approx 0.0002$ , the choices of exponents should be quite satisfactory. In fact, we could probably use ${\mathbf{P}}^{8}$ as a satisfactory approximation to P 0 , if that were desirable. The margin allows for the possibility that for some policies the eigenvalues may not be so small.— $\square$

7. Policy iteration– with discounting
It turns out that the policy iteration procedure is simpler in the case of discounting, as suggested by the evolution analysis in Sec 4. We havethe following two-phase procedure, based on that analysis.
1. Value-determination procedure .Given the q i and transition probabilities $p\left(i,j\right)$ for the current policy, solve $\mathbf{v}=\mathbf{q}+\alpha \mathbf{P}\mathbf{v}$ to get for the corresponding v i
$\mathbf{v}={\left[\mathbf{I}-\alpha \mathbf{P}\right]}^{-1}\mathbf{q}$
2. Policy-improvement procedure Given $\left\{{v}_{i}:1\le i\le M\right\}$ for the current policy, determine a new policy d * , with each ${d}^{*}\left(i\right)={a}_{i}*$ determined as that action for which
${q}_{i}^{*}+\alpha \sum _{j=1}^{M}{p}^{*}\left(i,j\right){v}_{j}=\underset{k}{max}\left\{{q}_{i}\left({a}_{ik}\right)+\sum _{j=1}^{M}{p}_{ij}\left({a}_{ik}\right){v}_{j}{a}_{ik}\in {A}_{i}\right\}$
We illustrate the Matlab procedure with the same numerical example as in the previous case, using discount factor $a=0.8$ . The data file is the same, so that we call for it, as before.

Assume data in file md61.m is in Matlab; if not, call for md61. We use the procedure newpolprep to prepare for the iterative procedure by making someinitial choices.

newpolprep Data needed:Matrix PA of transition probabilities for states and actions Matrix GA of gains for states and actionsType number to identify GA matrix types Type 1. Gains associated with a stateType 2. One-step transition gains Type 3. Gains associated with a demandRow vector of actions Value of alpha (= 1 for no discounting) Enter type number to show gain type 2 Enter value of alpha (= 1 for no discounting) 0.8Enter matrix PA of transition probabilities PA Enter matrix GA of gains GAEnter row vector A of possible actions A Enter n, the power of P to approximate P0 16Enter s, the power of P in the approximation of V 6Index Action Test Value 1.0000 1.0000 2.66672.0000 2.0000 2.3750 3.0000 3.0000 2.33334.0000 0 0 5.0000 1.0000 1.62506.0000 2.0000 2.5000 7.0000 0 08.0000 1.0000 2.6250 9.0000 2.0000 2.1250Call for newpold

The procedure for policy iteration is in the file newpold.m.

% file newpold.m (with discounting) % version of 3/23/92d = input('Enter policy as row matrix of indices '); D = A(d);P = PA(d',:); % transition matrix for policy selected Q = QA(d',:); % average next-period gain for policy selectedV = (eye(P) - a*P)\Q; % Present values for unlimited futureT = [(1:q)' A' (QA + a*PA*V)]; % Test values for determining new policydisp(' ') disp('Approximation to P0; rows are long-run dbn')disp(P0) disp(' Policy in terms of indices')disp(d) disp(' Policy in terms of actions')disp(D) disp(' Values for policy')disp(V) DIS =[' Index Action Test Value']; disp(DIS)disp(T) disp('Check current policy against new test values.')disp('--If new policy needed, call for newpold') disp('--If not, use policy and values above') newpold Enter policy as row matrix of indices [3 5 9]% Deliberately poor choiceApproximation to P0; rows are long-run dbn 0.3939 0.2828 0.32320.3939 0.2828 0.3232 0.39390.2828 0.3232Policy in terms of indices 3 5 9Policy in terms of actions3 1 2Values for policy 10.37789.6167 10.1722Index Action Test Value1.0000 1.0000 10.7111 2.0000 2.0000 10.38723.0000 3.0000 10.3778 4.0000 0 05.0000 1.0000 9.6167 6.0000 2.0000 10.60897.0000 0 0 8.0000 1.0000 10.71339.0000 2.0000 10.1722Check current policy against new test values. --If new policy needed, call for newpold--If not, use policy and values abovenewpold Enter policy as row matrix of indices [1 6 8]Approximation to P0; rows are long-run dbn0.3939 0.2828 0.3232 0.3939 0.2828 0.32320.3939 0.2828 0.3232Policy in terms of indices 1 6 8Policy in terms of actions1 2 1Values for policy 13.084412.9302 13.0519 Index Action Test Value 1.0000 1.0000 13.08442.0000 2.0000 12.7865 3.0000 3.0000 12.75114.0000 0 0 5.0000 1.0000 12.03336.0000 2.0000 12.9302 7.0000 0 08.0000 1.0000 13.0519 9.0000 2.0000 12.5455Check current policy against new test values.--If new policy needed, call for newpold --If not, use policy and values above

Since the policy indicated is the same as the previous policy, we know this is an optimal policy. Rows 1, 6, 8, indicate theoptimal policy to be ${d}^{*}\sim \left(1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}1\right)$ . It turns out in this example that the optimal policies are the same for thediscounted and undiscounted cases. That is not always true. The v matrix gives the present values for unlimited futures, for each of the three possible starting states.

do you think it's worthwhile in the long term to study the effects and possibilities of nanotechnology on viral treatment?
absolutely yes
Daniel
how to know photocatalytic properties of tio2 nanoparticles...what to do now
it is a goid question and i want to know the answer as well
Maciej
Abigail
Do somebody tell me a best nano engineering book for beginners?
what is fullerene does it is used to make bukky balls
are you nano engineer ?
s.
fullerene is a bucky ball aka Carbon 60 molecule. It was name by the architect Fuller. He design the geodesic dome. it resembles a soccer ball.
Tarell
what is the actual application of fullerenes nowadays?
Damian
That is a great question Damian. best way to answer that question is to Google it. there are hundreds of applications for buck minister fullerenes, from medical to aerospace. you can also find plenty of research papers that will give you great detail on the potential applications of fullerenes.
Tarell
what is the Synthesis, properties,and applications of carbon nano chemistry
Mostly, they use nano carbon for electronics and for materials to be strengthened.
Virgil
is Bucky paper clear?
CYNTHIA
so some one know about replacing silicon atom with phosphorous in semiconductors device?
Yeah, it is a pain to say the least. You basically have to heat the substarte up to around 1000 degrees celcius then pass phosphene gas over top of it, which is explosive and toxic by the way, under very low pressure.
Harper
Do you know which machine is used to that process?
s.
how to fabricate graphene ink ?
for screen printed electrodes ?
SUYASH
What is lattice structure?
of graphene you mean?
Ebrahim
or in general
Ebrahim
in general
s.
Graphene has a hexagonal structure
tahir
On having this app for quite a bit time, Haven't realised there's a chat room in it.
Cied
what is biological synthesis of nanoparticles
what's the easiest and fastest way to the synthesize AgNP?
China
Cied
types of nano material
I start with an easy one. carbon nanotubes woven into a long filament like a string
Porter
many many of nanotubes
Porter
what is the k.e before it land
Yasmin
what is the function of carbon nanotubes?
Cesar
I'm interested in nanotube
Uday
what is nanomaterials​ and their applications of sensors.
what is nano technology
what is system testing?
preparation of nanomaterial
Yes, Nanotechnology has a very fast field of applications and their is always something new to do with it...
what is system testing
what is the application of nanotechnology?
Stotaw
In this morden time nanotechnology used in many field . 1-Electronics-manufacturad IC ,RAM,MRAM,solar panel etc 2-Helth and Medical-Nanomedicine,Drug Dilivery for cancer treatment etc 3- Atomobile -MEMS, Coating on car etc. and may other field for details you can check at Google
Azam
anybody can imagine what will be happen after 100 years from now in nano tech world
Prasenjit
after 100 year this will be not nanotechnology maybe this technology name will be change . maybe aftet 100 year . we work on electron lable practically about its properties and behaviour by the different instruments
Azam
name doesn't matter , whatever it will be change... I'm taking about effect on circumstances of the microscopic world
Prasenjit
how hard could it be to apply nanotechnology against viral infections such HIV or Ebola?
Damian
silver nanoparticles could handle the job?
Damian
not now but maybe in future only AgNP maybe any other nanomaterials
Azam
Hello
Uday
I'm interested in Nanotube
Uday
this technology will not going on for the long time , so I'm thinking about femtotechnology 10^-15
Prasenjit
how did you get the value of 2000N.What calculations are needed to arrive at it
Privacy Information Security Software Version 1.1a
Good
Berger describes sociologists as concerned with
Got questions? Join the online conversation and get instant answers!