Reinforcement Learning
Markov Decision Processes (MDPs)
An MDP is defined by the tuple (S, A, P, R, gamma):
- S: set of states
- A: set of actions
- P(s'|s, a): transition probability
- R(s, a, s'): reward function
- gamma in [0, 1): discount factor
Key Quantities
Policy: pi(a|s) = probability of taking action a in state s.
State-value function:
V^pi(s) = E_pi[sum_{t=0}^{inf} gamma^t * R_t | S_0 = s]
Action-value function:
Q^pi(s, a) = E_pi[sum_{t=0}^{inf} gamma^t * R_t | S_0 = s, A_0 = a]
Optimal value functions: V*(s) = max_pi V^pi(s), Q*(s, a) = max_pi Q^pi(s, a).
Bellman Equations
Bellman Expectation Equation
V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V^pi(s')]
Q^pi(s,a) = sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * sum_{a'} pi(a'|s') * Q^pi(s',a')]
Bellman Optimality Equation
V*(s) = max_a sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V*(s')]
Q*(s,a) = sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * max_{a'} Q*(s',a')]
The optimal policy: pi*(a|s) = 1 if a = argmax_a Q*(s, a), else 0.
Dynamic Programming
Requires complete knowledge of MDP (model-based).
Policy Iteration
Alternate between evaluation and improvement:
def policy_iteration(P, R, gamma, n_states, n_actions):
pi = np.zeros(n_states, dtype=int) # arbitrary initial policy
while True:
# Policy Evaluation: solve V^pi
V = np.zeros(n_states)
for _ in range(1000): # iterative evaluation
for s in range(n_states):
a = pi[s]
V[s] = sum(P[s,a,s_] * (R[s,a,s_] + gamma * V[s_])
for s_ in range(n_states))
# Policy Improvement
stable = True
for s in range(n_states):
old_action = pi[s]
pi[s] = argmax over a of sum(P[s,a,s_] * (R[s,a,s_] + gamma * V[s_]))
if pi[s] != old_action:
stable = False
if stable:
break
return pi, V
Value Iteration
Directly compute optimal value function:
V_{k+1}(s) = max_a sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V_k(s')]
Converges to V* as k -> infinity. Extract policy: pi(s) = argmax_a Q(s, a).
Monte Carlo Methods
Learn from complete episodes without requiring model knowledge.
First-Visit MC
def mc_prediction(env, policy, gamma, n_episodes):
V = defaultdict(float)
returns = defaultdict(list)
for _ in range(n_episodes):
episode = generate_episode(env, policy)
G = 0
visited = set()
for t in reversed(range(len(episode))):
s, a, r = episode[t]
G = gamma * G + r
if s not in visited: # first-visit
visited.add(s)
returns[s].append(G)
V[s] = np.mean(returns[s])
return V
MC Control: use epsilon-greedy policy, update Q(s,a) from returns.
Properties: unbiased but high variance. Requires complete episodes (episodic tasks only).
Temporal Difference Learning
Bootstrap: update value estimates using other estimates (no need for complete episodes).
TD(0)
V(s_t) <- V(s_t) + alpha * [r_t + gamma * V(s_{t+1}) - V(s_t)]
\__________________________/
TD target - current estimate = TD error
Biased but lower variance than MC. Can learn online, step by step.
SARSA (On-Policy TD Control)
Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [r_t + gamma * Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]
On-policy: updates use the action actually taken under the current policy. SARSA stands for (S, A, R, S', A').
Q-Learning (Off-Policy TD Control)
Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [r_t + gamma * max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]
Off-policy: uses max over actions regardless of behavior policy. Converges to Q* under mild conditions (all state-action pairs visited infinitely often, learning rate decay).
def q_learning(env, n_episodes, gamma=0.99, alpha=0.1, epsilon=0.1):
Q = defaultdict(lambda: np.zeros(env.n_actions))
for _ in range(n_episodes):
s = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q[s])
s_next, r, done = env.step(a)
Q[s][a] += alpha * (r + gamma * np.max(Q[s_next]) - Q[s][a])
s = s_next
return Q
MC vs TD Comparison
| Property | Monte Carlo | TD Learning | |----------------|------------------|--------------------| | Bias | Unbiased | Biased (bootstrap) | | Variance | High | Lower | | Episodes | Complete required| Step-by-step ok | | Convergence | To V^pi | To V^pi (TD(0)) |
Deep Q-Network (DQN)
Use a neural network to approximate Q(s, a; theta):
Loss = E[(r + gamma * max_{a'} Q(s', a'; theta^-) - Q(s, a; theta))^2]
Key Innovations
-
Experience replay: store transitions (s, a, r, s') in buffer, sample mini-batches uniformly. Breaks correlation between consecutive samples.
-
Target network: use a slowly-updated copy theta^- for the TD target. Prevents moving target problem. Updated periodically or via Polyak averaging: theta^- = tau*theta + (1-tau)*theta^-.
-
Extensions:
- Double DQN: use online network to select action, target network to evaluate: r + gamma * Q(s', argmax_{a'} Q(s', a'; theta); theta^-)
- Dueling DQN: decompose Q = V(s) + A(s, a) - mean(A)
- Prioritized replay: sample transitions proportional to TD error magnitude
- Rainbow: combines 6 improvements (Double, Dueling, Prioritized, n-step, distributional, noisy nets)
Policy Gradient Methods
Directly optimize the policy pi(a|s; theta).
REINFORCE
grad J(theta) = E_pi[sum_t grad log pi(a_t|s_t; theta) * G_t]
where G_t = sum_{k=t}^{T} gamma^{k-t} r_k is the return from time t.
def reinforce(env, policy_net, optimizer, gamma, n_episodes):
for _ in range(n_episodes):
states, actions, rewards = collect_episode(env, policy_net)
# Compute returns
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = normalize(returns) # baseline: subtract mean, divide by std
# Policy gradient update
loss = 0
for s, a, G in zip(states, actions, returns):
log_prob = policy_net.log_prob(s, a)
loss -= log_prob * G
optimizer.zero_grad()
loss.backward()
optimizer.step()
High variance. Use baselines (subtract V(s)) to reduce variance without introducing bias.
Advantage Actor-Critic (A2C)
Two networks: actor (policy) and critic (value function).
Advantage: A(s, a) = Q(s, a) - V(s) ~ r + gamma*V(s') - V(s) # TD error
Actor update: grad J = E[grad log pi(a|s; theta) * A(s, a)]
Critic update: minimize (r + gamma*V(s'; w) - V(s; w))^2
Proximal Policy Optimization (PPO)
Constrain policy updates to prevent destructive large steps:
Clipped surrogate objective:
r_t(theta) = pi(a_t|s_t; theta) / pi(a_t|s_t; theta_old)
L^CLIP = E[min(r_t * A_t, clip(r_t, 1-epsilon, 1+epsilon) * A_t)]
epsilon = 0.2 typical. Simple, stable, widely used (OpenAI default).
Soft Actor-Critic (SAC)
Maximum entropy RL: maximize reward + entropy of the policy:
J(pi) = sum_t E[r_t + alpha * H(pi(.|s_t))]
- Encourages exploration (high entropy = more random)
- alpha is automatically tuned
- Off-policy (uses replay buffer)
- Continuous action spaces. State-of-the-art for robotics.
Model-Based RL
Learn a dynamics model and use it for planning.
MuZero
Learns three components without explicit environment model:
- Representation: h(observation) -> hidden state
- Dynamics: g(hidden_state, action) -> next_hidden_state, reward
- Prediction: f(hidden_state) -> policy, value
Planning: use learned model with Monte Carlo Tree Search (MCTS) at each step.
Achieved superhuman performance in Go, chess, Atari, and other games without knowing the rules.
Dyna Architecture
Interleave real experience with simulated experience from the learned model:
1. Take real action, observe transition
2. Update Q from real experience
3. Update model from real experience
4. For k steps: sample from model, update Q from simulated experience
Exploration Strategies
| Strategy | Description | |----------------------|------------------------------------------------| | Epsilon-greedy | Random action with probability epsilon | | Boltzmann | Sample from softmax(Q/tau) | | UCB | Select argmax Q(a) + c*sqrt(ln(t)/N(a)) | | Intrinsic motivation | Reward curiosity/novelty (ICM, RND) | | Thompson sampling | Sample from posterior over Q-values | | Count-based | Bonus inversely proportional to visit count |