# 马尔可夫决策过程MDP

## MDP

Posted by xuepro on May 31, 2018

A stochastic process is an indexed collection of random variables $${X_t}$$. e.g., time series of weekly demands for a product

A stochastic process $$X_t$$ is said to be Markovian if and only if

$P(X_{t+1}=j| X_t=i,X_{t-1}=k_{t-1},X_{t-2}=k_{t-2},...X_1=k_1,X_0=k_0) = P(X_{t+1}=j|X_0=k_0)$

“The future is independent of the past given the present”

A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states $$s_1; s_2; : : :$$ with the Markov property

A Markov process is a is a tuple $$(S;P;\mu)$$

• S is a (finite) set of states
• P is a state transition probability matrix, $$P_{ss'} = P(s'\vert s)$$

• a set of initial probabilities $$\mu_j^0 = P(X_0=i)$$ for all i

A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple $$S;P;R;\gamma; \mu$$

• S is a (finite) set of states
• P is a state transition probability matrix, $$P_{ss'} = P(s'\vert s)$$

• R is a reward function, $$R_s = E[r \vert s]$$
• $$\gamma$$ is a discount factor, $$\gamma \in[0,0]$$
• a set of initial probabilities $$\mu_j^0 = P(X_0=i)$$ for all i

The return $$v_t$$ is the total discounted reward from time–step t. $$v_t = r_{t+1}+\gamma r_{t+2}+\gamma r^2 r_{t+3}+--- = \sum_{k=0}^{\infty}\gamma ^k r_{t+k+1}$$

The state value function V(s) of an MRP is the expected return starting from state s $$V(s) = E[v_{t}\vert s_{t}=s]$$

### Bellman Equation

$$$\begin{split} V(s) = E[v_{t}\vert s_{t}=s] \\ = E[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+ ...\vert s_{t}=s]\\ = E[r_{t+1}+\gamma (r_{t+2}+\gamma r_{t+3}+ ...) \vert s_{t}=s]\\ = E[r_{t+1}+\gamma v_{t+1}\vert s_{t}=s]\\ = E[r_{t+1}+\gamma V(s_{t+1})\vert s_{t}=s] \end{split}$$$

So, We get the Bellman Equation:

$$$\begin{split} V(s) = E[r+ \gamma V(s')\vert s]= R_{s}+\gamma \sum_{s'\in S} P_{ss'}V(s') \end{split}$$$

The Bellman equation can be expressed concisely using matrices.

$V =R+\gamma PV$ $\begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix} = \begin{bmatrix} R(1) \\ \vdots\\ R(n) \end{bmatrix} +\gamma \begin{bmatrix} P_{11} &\dots & P_{1n}\\ \vdots &\ddots &\vdots \\ P_{n1} &\dots & P_{nn} \end{bmatrix} \begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix}$

### Solving the Bellman Equation

The Bellman equation is a linear equation. It can be solved directly $$V =R+\gamma PV$$

$(1-\gamma P)V =R$ $V = (1-\gamma P)^{-1}R$

Computational complexity is $$O(n^3)$$ for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,

• Dynamic programming
• Monte–Carlo evaluation
• Temporal–Difference learning

### Discrete–time Finite Markov Decision Processes (MDP)

A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.

A Markov Process is a tuple $$(S;A;P;R; \gamma; \mu)$$

• S is a (finite) set of states
• A is a (finite) set of actions
• P is a state transition probability matrix, $$P(s' \vert s; a)$$
• R is a reward function, $$R(s; a) = E[r \vert s; a]$$
• $$\gamma$$ is a discount factor, $$\gamma \in [0,1]$$
• a set of initial probabilities $$\mu_i^0= P(X_0 = i)$$ for all i

Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).

Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:

• Markovian $$\subset$$ History–dependent
• Deterministic $$\subset$$ Stochastic
• Stationary $$\subset$$ Non–stationary

#### Stationary Stochastic Markovian Policies

A policy $$\pi$$ is a distribution over actions given the state:

$$\pi (s,a) = P(a\vert s)$$

• MDP policies depend on the current state (not the history)
• i.e., Policies are stationary (time–independent)
• Given an MDP $$M= (S;A;P;R;\gamma;\mu)$$ and a policy $$\pi$$
• The state sequence s1; s2; … is a Markov process $$(S;P^{\pi};\gamma \mu)$$
• The state and reward sequence s1; r2; s2; … is a Markov reward process $$(S;P^{\pi};R^{\pi};\gamma; \mu)$$, where
$P^{\pi} = \sum_{a \in A} \pi(s,a)P(s'\vert s,a)$ $R^{\pi} = \sum_{a \in A} \pi(s,a)R(s,a)$

#### Value Functions

Given a policy $$\pi$$, it is possible to define the utility of each state: Policy Evaluation

• The state–value function $$V^{\pi}(s)$$ of an MDP is the expected return starting from state s, and then following policy $$\pi$$ $$V^{\pi}(s) = E_{\pi}[v_t\vert s_t=s]$$

For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.

The action–value function $$Q^{\pi}(s; a)$$ is the expected return starting from state s, taking action a, and then following policy $$\pi$$ $$Q^{\pi}(s; a) = E_{\pi}[v_t\vert s_t=s,a_t=a]$$

#### Bellman Expectation Equation

$$$\begin{split} V^{\pi}(s) = E_{\pi}[r_{t+1}+ \gamma V^{\pi}(s_{t+1}) \vert s_t=s]\\ =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s')) \end{split}$$$ $$$\begin{split} Q^{\pi}(s; a) = E_{\pi}[r_{t+1}+ \gamma Q^{\pi}(s_{t+1}; a_{t+1}) \vert s_t=s, a_t=a]\\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s') \\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a') \end{split}$$$

The Bellman expectation equation can be expressed concisely using the induced MRP

$V^{\pi} = R^{\pi} + \gamma P^{\pi} V^{\pi}$

with direct solution

$V^{\pi} = (1- \gamma P^{\pi})^{-1} R^{\pi}$

#### Bellman operators for $$V^{\pi}$$

The Bellman operator for $$V^{\pi}$$ is defined as $$T^{\pi}$$ : $$R^{S}\rightarrow R^{S}$$ (maps value functions to value functions):

$(T^{\pi}V^{\pi})(s) =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s'))$

Using Bellman operator, Bellman expectation equation can be compactly written as:

$T^{\pi}V^{\pi} = V^{\pi}$
• $$V^{\pi}$$ is a fixed point of the Bellman operator $$T^{\pi}$$
• This is a linear equation in$$V^{\pi}$$ and $$T^{\pi}$$
• If 0 <$$\gamma<1$$ then $$T^{\pi}$$ is a contraction w.r.t. the maximum norm

#### The Bellman operator for $$Q^{\pi}$$ is defined as

$$T^{\pi}$$ : $$R^{S\times A} \rightarrow R^{S\times A}$$ (maps action–value functions to action–value functions):

$(T^{\pi}Q^{\pi})(a,a) =\sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a')$

Using Bellman operator, Bellman expectation equation can be compactly written as:

$T^{\pi}Q^{\pi} = Q^{\pi}$

### Optimal Value Function and optical policy

#### Optimal Value Function

The optimal state–value function $$V^*(s)$$ is the maximum value function over all policies

$V^{*}(s) = \max_{\pi}V^{\pi}(s)$

The optimal action–value function $$^*(s; a)$$ is the maximum action–value function over all policies

$Q^{*}(s,a) = \max_{\pi}Q^{\pi}(s,a)$
• The optimal value function specifies the best possible performance in the MDP
• An MDP is “solved” when we know the optimal value function

#### Optimal Policy

Value functions define a partial ordering over policies

$$\pi\geq\pi^\prime$$ if $$V^{\pi}(s) \geq V^{\pi^\prime}(s)$$, $$\forall s\in S$$

##### Theorem

For any Markov Decision Process

• There exists an optimal policy $$\pi^*$$ that is better than or equal to all other policies $$\pi^*>\pi, \forall \pi$$
• All optimal policies achieve the optimal value function, $$V^{\pi^*}(s) = V^*(s)>$$
• All optimal policies achieve the optimal action–value function, $$Q^{\pi^*}(s,a) = Q^*(s,a)$$
• There is always a deterministic optimal policy for any MDP

A deterministic optimal policy can be found by maximizing over $$Q^*(s,a)$$

$$$\pi^*( a\vert s) = \begin{cases} 1& \text{if } a= arg \max_{a\in A}Q^*(s,a)\\ 0& \text{otherwise} \end{cases}$$$

### Bellman Optimality Equation

#### Bellman Optimality Equation for $$V^*$$

$$V^*(s) = \max_{a}Q^*(s,a) = \max_{a}\{ R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime) \}$$

#### Bellman Optimality Equation for $$Q^*$$

$$$\begin{split} Q^*(s,a) = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime)\\ = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) \max_{a^\prime}Q^*(s\prime,a\prime) \end{split}$$$

### Principle of optimality:

the tail of an optimal policy is optimal for the “tail” problem.（最优策略的尾策略对于尾问题也是最优的。或者说，不管初始状态和前面已经发生的，对于后续子问题,最优策略对剩余子问题的策略（tail policy）仍然是最优的）

 数学符号的竖线 有时需要用\vert表示

List of Greek letters and math symbols

2