In dynamic programming an agent finds a policy (an action given their current state) that maximizes their present discounted award given

\begin{align*} \beta &\quad\text{discount rate}\\ R_{ik} &\quad\text{reward from transitioning to $i$ using control $k$} \\ P_{ijk} &\quad\text{probability of transitioning from state $i$ to $j$ using control $k$} \\ V_{i} &\quad\text{present discounted value of being in state $i$} \end{align*}Value iteration is a simple way to estimate discrete infinite horizon dynamic programs. In value iteration we set our present discounted value of being in a particular state to arbitrary values and iterate on the Bellman equation until convergence

\begin{align*} &\text{do} \\ &\quad i += 1\\ &\quad V_{i+1} = \max (R + \beta PV_i) \\ &\text{until}\quad || V_{i+1} - V_i || < \epsilon \end{align*}Following an example given in Foundations of Machine Learning suppose we have a infinite horizon dynamic program identical to the one given in the graph shown below:

Notice the edges in the graph correspond have probabilities and rewards. For instance, if you choose to play $a$ in state 1 there is a 75% chance that you will end up in state 1 (with a reward of 2) and 25% percent chance that you will end up in state 2 (also with a reward of 2).

In Julia this problem can be represented using arrays

```
In [4]:
```using MDP
I = [1,1,2,3,4]
J = [1,2,2,2,1]
R = sparse(I, J, Float64[2,2,2,2,3])
P = sparse(I,J,[.75,.25,1.00,1.00,1.00])
SA = [
((1,),('a',)),
((1,),('b',)),
((2,),('c',)),
((2,),('d',))
]
R = Float64[2,2,2,3]
indvec = [2,4]
# indvec describes what choices correspond to each state
# the "2" says that the first to second elements of R correspond to playing "a" or "b" in state 1
# the "4" says that the third to fourth elements of R correspond to playing "c" or "d" in state 2
V0 = Float64[-1; 1]
β=0.5
mdp = SimpleMDP(R,P,SA,indvec,β,IClock())
valueiteration(mdp; Vstart=V0)

```
Out[4]:
```

```
In [5]:
```policyiteration(mdp)

```
Out[5]:
```

Consider a deterministic infinite horizon optimal growth problem used by Judd in Numerical Methods for Economists

In the problem the agent maximizes their present discounted utility by changing their consumption. A higher rate of consumption lowers means that less capital is being produced which in turn lowers future consumption.

$$ V(k) = \max_c u(c) + \beta V(k^+(k,c)) $$where $u(c) = \frac{c^{1+\gamma}}{1+\gamma}$, $F(k) = k + f(k) = k + \frac{(1-\beta)}{\alpha\beta}k^\alpha$ and $k^+ = F(k) - c$

The problem with this formulation is that it is difficult to discretize $c$ in a way that is compatible with the levels of $k$. In order to circumvent this problem we make control variable the state variable of the next period. This gives

$$ V(k) = \max_{k^+} u(F(k) - k^+) + \beta V(k^+) $$Assuming $\gamma = -2$, $\alpha = 0.25$ and $\beta = 0.96$ the steady state level of capital is one. Since all initial capital values but zero converge to the steady state if a region around the steady state is discretized then the policy will not have an optimal policy go outside the discretized region. In the code below I chose to discretize the capital stock levels into 401 equally sized bins between 0.8 and 1.2.

```
In [15]:
```using MDP
# reward function
util(c; g=-2) = c^(g+1)/(g+1)
# Law of Motion for the capital stock
transition(k; alpha=0.25, beta=0.96) = k + ((1-beta)/(alpha*beta))*k^alpha
# function to determine what actions are feasible in a given state
reachable(c) = c >= 0
ks = 0.8:0.001:1.2 # capital levels at which the
n = length(ks)
# determine the feasible state action pairs
R = Float64[] # rewards for all feasible state action pairs
SA = ((Float64,),(Float64,))[] # enumeration of all feasible state action pairs
indvec = Int64[]
k = 1
for i=1:n
if i>1 push!(indvec, k-1) end
for j=1:n
c = transition(ks[i])-ks[j]
if reachable(c)
push!(SA, ((ks[i],),(ks[j],)))
push!(R, util(c))
k+=1
end
end
end
push!(indvec, k-1)
# build the transition matrix
P = sparse(Int64[],Int64[],Float64[],k-1,n)
P[1:indvec[1], 1:indvec[1]] = eye(indvec[1])
for i=2:length(indvec)
lb = indvec[i-1]+1
ub = indvec[i]
#@printf "lb = %0.0f\n" lb
#@printf "ub = %0.0f\n" ub
P[lb:ub,1:(ub-lb+1)] = eye(ub-lb+1)
end
β = 0.96
dp = SimpleMDP(R,P,SA,indvec,β,IClock())

```
Out[15]:
```

```
In [16]:
```v = policyiteration(dp; eps=1e-2)

```
Out[16]:
```

```
In [34]:
```function computepath(startindx, mdpsol; len=10)
policy = mdpsol.policy
action = copy(policy)
indvec = mdpsol.indvec
for i=2:length(policy)
action[i] -= indvec[i-1]
end
SA = mdpsol.SA
path = Array(Float64,len+1)
indx = startindx
path[1] = SA[policy[indx]][2][1]
for i=1:len
indx = action[indx]
path[i+1] = SA[policy[indx]][2][1]
end
path
end

```
Out[34]:
```

```
In [42]:
```using DataFrames
df = DataFrame(low=computepath(1,v;len=99), high=computepath(401,v; len=99),time=1:100)

```
Out[42]:
```

```
In [46]:
```using Gadfly
plot(df, layer(x="time", y="low", Geom.line), layer(x="time", y="high", Geom.line),
Guide.xlabel("time (t)"), Guide.ylabel("Capital Stock (k)"))

```
Out[46]:
```