马尔可夫决策过程MDP

A stochastic process is an indexed collection of random variables \({X_t}\). e.g., time series of weekly demands for a product

A stochastic process \(X_t\) is said to be Markovian if and only if

\[P(X_{t+1}=j| X_t=i,X_{t-1}=k_{t-1},X_{t-2}=k_{t-2},...X_1=k_1,X_0=k_0) = P(X_{t+1}=j|X_0=k_0)\]

“The future is independent of the past given the present”

A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states \(s_1; s_2; : : :\) with the Markov property

A Markov process is a is a tuple \((S;P;\mu)\)

S is a (finite) set of states
P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)
a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i

A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple \(S;P;R;\gamma; \mu\)

S is a (finite) set of states
P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)
R is a reward function, \(R_s = E[r \vert s]\)
\(\gamma\) is a discount factor, \(\gamma \in[0,0]\)
a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i

The return \(v_t\) is the total discounted reward from time–step t. \(v_t = r_{t+1}+\gamma r_{t+2}+\gamma r^2 r_{t+3}+--- = \sum_{k=0}^{\infty}\gamma ^k r_{t+k+1}\)

The state value function V(s) of an MRP is the expected return starting from state s \(V(s) = E[v_{t}\vert s_{t}=s]\)

Bellman Equation

\[\begin{equation}\begin{split} V(s) = E[v_{t}\vert s_{t}=s] \\ = E[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+ ...\vert s_{t}=s]\\ = E[r_{t+1}+\gamma (r_{t+2}+\gamma r_{t+3}+ ...) \vert s_{t}=s]\\ = E[r_{t+1}+\gamma v_{t+1}\vert s_{t}=s]\\ = E[r_{t+1}+\gamma V(s_{t+1})\vert s_{t}=s] \end{split}\end{equation}\]

So, We get the Bellman Equation:

\[\begin{equation}\begin{split} V(s) = E[r+ \gamma V(s')\vert s]= R_{s}+\gamma \sum_{s'\in S} P_{ss'}V(s') \end{split}\end{equation}\]

The Bellman equation can be expressed concisely using matrices.

\[V =R+\gamma PV\] \[\begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix} = \begin{bmatrix} R(1) \\ \vdots\\ R(n) \end{bmatrix} +\gamma \begin{bmatrix} P_{11} &\dots & P_{1n}\\ \vdots &\ddots &\vdots \\ P_{n1} &\dots & P_{nn} \end{bmatrix} \begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix}\]

Solving the Bellman Equation

The Bellman equation is a linear equation. It can be solved directly \(V =R+\gamma PV\)

\[(1-\gamma P)V =R\] \[V = (1-\gamma P)^{-1}R\]

Computational complexity is \(O(n^3)\) for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,

Dynamic programming
Monte–Carlo evaluation
Temporal–Difference learning

Discrete–time Finite Markov Decision Processes (MDP)

A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.

A Markov Process is a tuple \((S;A;P;R; \gamma; \mu)\)

S is a (finite) set of states
A is a (finite) set of actions
P is a state transition probability matrix, \(P(s' \vert s; a)\)
R is a reward function, \(R(s; a) = E[r \vert s; a]\)
\(\gamma\) is a discount factor, \(\gamma \in [0,1]\)
a set of initial probabilities \(\mu_i^0= P(X_0 = i)\) for all i

Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).

Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:

Markovian \(\subset\) History–dependent
Deterministic \(\subset\) Stochastic
Stationary \(\subset\) Non–stationary

Stationary Stochastic Markovian Policies

A policy \(\pi\) is a distribution over actions given the state:

\(\pi (s,a) = P(a\vert s)\)

MDP policies depend on the current state (not the history)
i.e., Policies are stationary (time–independent)
Given an MDP \(M= (S;A;P;R;\gamma;\mu)\) and a policy \(\pi\)
- The state sequence s1; s2; … is a Markov process \((S;P^{\pi};\gamma \mu)\)
- The state and reward sequence s1; r2; s2; … is a Markov reward process \((S;P^{\pi};R^{\pi};\gamma; \mu)\), where
\[P^{\pi} = \sum_{a \in A} \pi(s,a)P(s'\vert s,a)\] \[R^{\pi} = \sum_{a \in A} \pi(s,a)R(s,a)\]

Value Functions

Given a policy \(\pi\), it is possible to define the utility of each state: Policy Evaluation

The state–value function \(V^{\pi}(s)\) of an MDP is the expected return starting from state s, and then following policy \(\pi\) \(V^{\pi}(s) = E_{\pi}[v_t\vert s_t=s]\)

For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.

The action–value function \(Q^{\pi}(s; a)\) is the expected return starting from state s, taking action a, and then following policy \(\pi\) \(Q^{\pi}(s; a) = E_{\pi}[v_t\vert s_t=s,a_t=a]\)

Bellman Expectation Equation

\[\begin{equation}\begin{split} V^{\pi}(s) = E_{\pi}[r_{t+1}+ \gamma V^{\pi}(s_{t+1}) \vert s_t=s]\\ =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s')) \end{split}\end{equation}\] \[\begin{equation}\begin{split} Q^{\pi}(s; a) = E_{\pi}[r_{t+1}+ \gamma Q^{\pi}(s_{t+1}; a_{t+1}) \vert s_t=s, a_t=a]\\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s') \\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a') \end{split}\end{equation}\]

The Bellman expectation equation can be expressed concisely using the induced MRP

\[V^{\pi} = R^{\pi} + \gamma P^{\pi} V^{\pi}\]

with direct solution

\[V^{\pi} = (1- \gamma P^{\pi})^{-1} R^{\pi}\]

Bellman operators for \(V^{\pi}\)

The Bellman operator for \(V^{\pi}\) is defined as \(T^{\pi}\) : \(R^{S}\rightarrow R^{S}\) (maps value functions to value functions):

\[(T^{\pi}V^{\pi})(s) =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s'))\]

Using Bellman operator, Bellman expectation equation can be compactly written as:

\[T^{\pi}V^{\pi} = V^{\pi}\]

\(V^{\pi}\) is a fixed point of the Bellman operator \(T^{\pi}\)
This is a linear equation in\(V^{\pi}\) and \(T^{\pi}\)
If 0 <\(\gamma<1\) then \(T^{\pi}\) is a contraction w.r.t. the maximum norm

The Bellman operator for \(Q^{\pi}\) is defined as

\(T^{\pi}\) : \(R^{S\times A} \rightarrow R^{S\times A}\) (maps action–value functions to action–value functions):

\[(T^{\pi}Q^{\pi})(a,a) =\sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a')\]

Using Bellman operator, Bellman expectation equation can be compactly written as:

\[T^{\pi}Q^{\pi} = Q^{\pi}\]

Optimal Value Function and optical policy

Optimal Value Function

The optimal state–value function \(V^*(s)\) is the maximum value function over all policies

\[V^{*}(s) = \max_{\pi}V^{\pi}(s)\]

The optimal action–value function \(^*(s; a)\) is the maximum action–value function over all policies

\[Q^{*}(s,a) = \max_{\pi}Q^{\pi}(s,a)\]

The optimal value function specifies the best possible performance in the MDP
An MDP is “solved” when we know the optimal value function

Optimal Policy

Value functions define a partial ordering over policies

\(\pi\geq\pi^\prime\) if \(V^{\pi}(s) \geq V^{\pi^\prime}(s)\), \(\forall s\in S\)

Theorem

For any Markov Decision Process

There exists an optimal policy \(\pi^*\) that is better than or equal to all other policies \(\pi^*>\pi, \forall \pi\)
All optimal policies achieve the optimal value function, \(V^{\pi^*}(s) = V^*(s)>\)
All optimal policies achieve the optimal action–value function, \(Q^{\pi^*}(s,a) = Q^*(s,a)\)
There is always a deterministic optimal policy for any MDP

A deterministic optimal policy can be found by maximizing over \(Q^*(s,a)\)

\[\begin{equation} \pi^*( a\vert s) = \begin{cases} 1& \text{if } a= arg \max_{a\in A}Q^*(s,a)\\ 0& \text{otherwise} \end{cases} \end{equation}\]

Bellman Optimality Equation

Bellman Optimality Equation for \(V^*\)

\(V^*(s) = \max_{a}Q^*(s,a) = \max_{a}\{ R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime) \}\)

Bellman Optimality Equation for \(Q^*\)

\[\begin{equation}\begin{split} Q^*(s,a) = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime)\\ = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) \max_{a^\prime}Q^*(s\prime,a\prime) \end{split}\end{equation}\]

Principle of optimality:

the tail of an optimal policy is optimal for the “tail” problem.（最优策略的尾策略对于尾问题也是最优的。或者说，不管初始状态和前面已经发生的，对于后续子问题,最优策略对剩余子问题的策略（tail policy）仍然是最优的）

数学符号 refer:

数学符号的竖线有时需要用\vert表示

特殊符号 ‘要用\prime

List of Greek letters and math symbols

赏

支付宝打赏

微信打赏

您的打赏是对我最大的鼓励！

MDP