A stochastic process is an indexed collection of random variables \({X_t}\). e.g., time series of weekly demands for a product
A stochastic process \(X_t\) is said to be Markovian if and only if
\[P(X_{t+1}=j X_t=i,X_{t1}=k_{t1},X_{t2}=k_{t2},...X_1=k_1,X_0=k_0) = P(X_{t+1}=jX_0=k_0)\]“The future is independent of the past given the present”
A Markov process (or Markov chain) is a memoryless stochastic process, i.e., a sequence of random states \(s_1; s_2; : : :\) with the Markov property
A Markov process is a is a tuple \((S;P;\mu)\)
 S is a (finite) set of states

P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)
 a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i
A Markov reward process is a Markov process with values. A Markov Reward Process is a tuple \(S;P;R;\gamma; \mu\)
 S is a (finite) set of states

P is a state transition probability matrix, \(P_{ss'} = P(s'\vert s)\)
 R is a reward function, \(R_s = E[r \vert s]\)
 \(\gamma\) is a discount factor, \(\gamma \in[0,0]\)
 a set of initial probabilities \(\mu_j^0 = P(X_0=i)\) for all i
The return \(v_t\) is the total discounted reward from time–step t. \(v_t = r_{t+1}+\gamma r_{t+2}+\gamma r^2 r_{t+3}+ = \sum_{k=0}^{\infty}\gamma ^k r_{t+k+1}\)
The state value function V(s) of an MRP is the expected return starting from state s \(V(s) = E[v_{t}\vert s_{t}=s]\)
Bellman Equation
\[\begin{equation}\begin{split} V(s) = E[v_{t}\vert s_{t}=s] \\ = E[r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+3}+ ...\vert s_{t}=s]\\ = E[r_{t+1}+\gamma (r_{t+2}+\gamma r_{t+3}+ ...) \vert s_{t}=s]\\ = E[r_{t+1}+\gamma v_{t+1}\vert s_{t}=s]\\ = E[r_{t+1}+\gamma V(s_{t+1})\vert s_{t}=s] \end{split}\end{equation}\]So, We get the Bellman Equation:
\[\begin{equation}\begin{split} V(s) = E[r+ \gamma V(s')\vert s]= R_{s}+\gamma \sum_{s'\in S} P_{ss'}V(s') \end{split}\end{equation}\]The Bellman equation can be expressed concisely using matrices.
\[V =R+\gamma PV\] \[\begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix} = \begin{bmatrix} R(1) \\ \vdots\\ R(n) \end{bmatrix} +\gamma \begin{bmatrix} P_{11} &\dots & P_{1n}\\ \vdots &\ddots &\vdots \\ P_{n1} &\dots & P_{nn} \end{bmatrix} \begin{bmatrix} V(1) \\ \vdots\\ V(n) \end{bmatrix}\]Solving the Bellman Equation
The Bellman equation is a linear equation. It can be solved directly \(V =R+\gamma PV\)
\[(1\gamma P)V =R\] \[V = (1\gamma P)^{1}R\]Computational complexity is \(O(n^3)\) for n states. Direct solution only possible for small MRPs There are many iterative methods for large MRPs,e.g.,
 Dynamic programming
 Monte–Carlo evaluation
 Temporal–Difference learning
Discrete–time Finite Markov Decision Processes (MDP)
A Markov decision process (MDP) is Markov reward process with decisions. It models an environment in which all states are Markov and time is divided into stages.
A Markov Process is a tuple \((S;A;P;R; \gamma; \mu)\)
 S is a (finite) set of states
 A is a (finite) set of actions
 P is a state transition probability matrix, \(P(s' \vert s; a)\)
 R is a reward function, \(R(s; a) = E[r \vert s; a]\)
 \(\gamma\) is a discount factor, \(\gamma \in [0,1]\)
 a set of initial probabilities \(\mu_i^0= P(X_0 = i)\) for all i
Goal is a scalar reward:goals and purposes can be well thought of as the maximization of the cumulative sum of a received scalar signal (reward).
Policies: A policy, at any given point in time, decides which action the agent selects A policy fully defines the behavior of an agent Policies can be:
 Markovian \(\subset\) History–dependent
 Deterministic \(\subset\) Stochastic
 Stationary \(\subset\) Non–stationary
Stationary Stochastic Markovian Policies
A policy \(\pi\) is a distribution over actions given the state:
\(\pi (s,a) = P(a\vert s)\)
 MDP policies depend on the current state (not the history)
 i.e., Policies are stationary (time–independent)
 Given an MDP \(M= (S;A;P;R;\gamma;\mu)\) and a policy \(\pi\)
 The state sequence s1; s2; … is a Markov process \((S;P^{\pi};\gamma \mu)\)
 The state and reward sequence s1; r2; s2; … is a Markov reward process \((S;P^{\pi};R^{\pi};\gamma; \mu)\), where
Value Functions
Given a policy \(\pi\), it is possible to define the utility of each state: Policy Evaluation
 The state–value function \(V^{\pi}(s)\) of an MDP is the expected return starting from state s, and then following policy \(\pi\) \(V^{\pi}(s) = E_{\pi}[v_t\vert s_t=s]\)
For control purposes, rather than the value of each state, it is easier to consider the value of each action in each state.
The action–value function \(Q^{\pi}(s; a)\) is the expected return starting from state s, taking action a, and then following policy \(\pi\) \(Q^{\pi}(s; a) = E_{\pi}[v_t\vert s_t=s,a_t=a]\)
Bellman Expectation Equation
\[\begin{equation}\begin{split} V^{\pi}(s) = E_{\pi}[r_{t+1}+ \gamma V^{\pi}(s_{t+1}) \vert s_t=s]\\ =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s')) \end{split}\end{equation}\] \[\begin{equation}\begin{split} Q^{\pi}(s; a) = E_{\pi}[r_{t+1}+ \gamma Q^{\pi}(s_{t+1}; a_{t+1}) \vert s_t=s, a_t=a]\\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s') \\ = R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a') \end{split}\end{equation}\]The Bellman expectation equation can be expressed concisely using the induced MRP
\[V^{\pi} = R^{\pi} + \gamma P^{\pi} V^{\pi}\]with direct solution
\[V^{\pi} = (1 \gamma P^{\pi})^{1} R^{\pi}\]Bellman operators for \(V^{\pi}\)
The Bellman operator for \(V^{\pi}\) is defined as \(T^{\pi}\) : \(R^{S}\rightarrow R^{S}\) (maps value functions to value functions):
\[(T^{\pi}V^{\pi})(s) =\sum_{a\in A} \pi (a\vert s)(R(s,a)+\gamma \sum_{s' \in S}P(s' \vert s,a) V^{\pi}(s'))\]Using Bellman operator, Bellman expectation equation can be compactly written as:
\[T^{\pi}V^{\pi} = V^{\pi}\] \(V^{\pi}\) is a fixed point of the Bellman operator \(T^{\pi}\)
 This is a linear equation in\(V^{\pi}\) and \(T^{\pi}\)
 If 0 <\(\gamma<1\) then \(T^{\pi}\) is a contraction w.r.t. the maximum norm
The Bellman operator for \(Q^{\pi}\) is defined as
\(T^{\pi}\) : \(R^{S\times A} \rightarrow R^{S\times A}\) (maps action–value functions to action–value functions):
\[(T^{\pi}Q^{\pi})(a,a) =\sum_{s' \in S}P(s' \vert s,a) \sum_{a'\in A} \pi(a'\vert s')Q^{\pi}(s',a')\]Using Bellman operator, Bellman expectation equation can be compactly written as:
\[T^{\pi}Q^{\pi} = Q^{\pi}\]Optimal Value Function and optical policy
Optimal Value Function
The optimal state–value function \(V^*(s)\) is the maximum value function over all policies
\[V^{*}(s) = \max_{\pi}V^{\pi}(s)\]The optimal action–value function \(^*(s; a)\) is the maximum action–value function over all policies
\[Q^{*}(s,a) = \max_{\pi}Q^{\pi}(s,a)\] The optimal value function specifies the best possible performance in the MDP
 An MDP is “solved” when we know the optimal value function
Optimal Policy
Value functions define a partial ordering over policies
\(\pi\geq\pi^\prime\) if \(V^{\pi}(s) \geq V^{\pi^\prime}(s)\), \(\forall s\in S\)
Theorem
For any Markov Decision Process
 There exists an optimal policy \(\pi^*\) that is better than or equal to all other policies \(\pi^*>\pi, \forall \pi\)
 All optimal policies achieve the optimal value function, \(V^{\pi^*}(s) = V^*(s)>\)
 All optimal policies achieve the optimal action–value function, \(Q^{\pi^*}(s,a) = Q^*(s,a)\)
 There is always a deterministic optimal policy for any MDP
A deterministic optimal policy can be found by maximizing over \(Q^*(s,a)\)
\[\begin{equation} \pi^*( a\vert s) = \begin{cases} 1& \text{if } a= arg \max_{a\in A}Q^*(s,a)\\ 0& \text{otherwise} \end{cases} \end{equation}\]Bellman Optimality Equation
Bellman Optimality Equation for \(V^*\)
\(V^*(s) = \max_{a}Q^*(s,a) = \max_{a}\{ R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime) \}\)
Bellman Optimality Equation for \(Q^*\)
\[\begin{equation}\begin{split} Q^*(s,a) = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) V^*(s\prime)\\ = R(s,a)+ \gamma \sum_{s^{\prime} \in S} P(s\prime \vert s,a) \max_{a^\prime}Q^*(s\prime,a\prime) \end{split}\end{equation}\]Principle of optimality:
the tail of an optimal policy is optimal for the “tail” problem.（最优策略的尾策略对于尾问题也是最优的。或者说，不管初始状态和前面已经发生的，对于后续子问题,最优策略对剩余子问题的策略（tail policy）仍然是最优的）
数学符号 refer:
数学符号的竖线  有时需要用\vert 表示 
特殊符号 ‘要用\prime
List of Greek letters and math symbols
您的打赏是对我最大的鼓励！