Reinforcement Learning

Markov Decision Processes (MDPs)

An MDP is defined by the tuple (S, A, P, R, gamma):

S: set of states
A: set of actions
P(s'|s, a): transition probability
R(s, a, s'): reward function
gamma in [0, 1): discount factor

Key Quantities

Policy: pi(a|s) = probability of taking action a in state s.

State-value function:

V^pi(s) = E_pi[sum_{t=0}^{inf} gamma^t * R_t | S_0 = s]

Action-value function:

Q^pi(s, a) = E_pi[sum_{t=0}^{inf} gamma^t * R_t | S_0 = s, A_0 = a]

Optimal value functions: V*(s) = max_pi V^pi(s), Q*(s, a) = max_pi Q^pi(s, a).

Bellman Equations

Bellman Expectation Equation

V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V^pi(s')]

Q^pi(s,a) = sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * sum_{a'} pi(a'|s') * Q^pi(s',a')]

Bellman Optimality Equation

V*(s) = max_a sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V*(s')]

Q*(s,a) = sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * max_{a'} Q*(s',a')]

The optimal policy: pi*(a|s) = 1 if a = argmax_a Q*(s, a), else 0.

Dynamic Programming

Requires complete knowledge of MDP (model-based).

Policy Iteration

Alternate between evaluation and improvement:

def policy_iteration(P, R, gamma, n_states, n_actions):
    pi = np.zeros(n_states, dtype=int)  # arbitrary initial policy

    while True:
        # Policy Evaluation: solve V^pi
        V = np.zeros(n_states)
        for _ in range(1000):  # iterative evaluation
            for s in range(n_states):
                a = pi[s]
                V[s] = sum(P[s,a,s_] * (R[s,a,s_] + gamma * V[s_])
                           for s_ in range(n_states))

        # Policy Improvement
        stable = True
        for s in range(n_states):
            old_action = pi[s]
            pi[s] = argmax over a of sum(P[s,a,s_] * (R[s,a,s_] + gamma * V[s_]))
            if pi[s] != old_action:
                stable = False

        if stable:
            break
    return pi, V

Value Iteration

Directly compute optimal value function:

V_{k+1}(s) = max_a sum_{s'} P(s'|s,a) * [R(s,a,s') + gamma * V_k(s')]

Converges to V* as k -> infinity. Extract policy: pi(s) = argmax_a Q(s, a).

Monte Carlo Methods

Learn from complete episodes without requiring model knowledge.

First-Visit MC

def mc_prediction(env, policy, gamma, n_episodes):
    V = defaultdict(float)
    returns = defaultdict(list)

    for _ in range(n_episodes):
        episode = generate_episode(env, policy)
        G = 0
        visited = set()

        for t in reversed(range(len(episode))):
            s, a, r = episode[t]
            G = gamma * G + r
            if s not in visited:  # first-visit
                visited.add(s)
                returns[s].append(G)
                V[s] = np.mean(returns[s])

    return V

MC Control: use epsilon-greedy policy, update Q(s,a) from returns.

Properties: unbiased but high variance. Requires complete episodes (episodic tasks only).

Temporal Difference Learning

Bootstrap: update value estimates using other estimates (no need for complete episodes).

TD(0)

V(s_t) <- V(s_t) + alpha * [r_t + gamma * V(s_{t+1}) - V(s_t)]
                              \__________________________/
                                    TD target - current estimate = TD error

Biased but lower variance than MC. Can learn online, step by step.

SARSA (On-Policy TD Control)

Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [r_t + gamma * Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]

On-policy: updates use the action actually taken under the current policy. SARSA stands for (S, A, R, S', A').

Q-Learning (Off-Policy TD Control)

Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [r_t + gamma * max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]

Off-policy: uses max over actions regardless of behavior policy. Converges to Q* under mild conditions (all state-action pairs visited infinitely often, learning rate decay).

def q_learning(env, n_episodes, gamma=0.99, alpha=0.1, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(env.n_actions))

    for _ in range(n_episodes):
        s = env.reset()
        done = False
        while not done:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                a = env.action_space.sample()
            else:
                a = np.argmax(Q[s])

            s_next, r, done = env.step(a)
            Q[s][a] += alpha * (r + gamma * np.max(Q[s_next]) - Q[s][a])
            s = s_next

    return Q

MC vs TD Comparison

Property	Monte Carlo	TD Learning
Bias	Unbiased	Biased (bootstrap)
Variance	High	Lower
Episodes	Complete required	Step-by-step ok
Convergence	To V^pi	To V^pi (TD(0))

Deep Q-Network (DQN)

Use a neural network to approximate Q(s, a; theta):

Loss = E[(r + gamma * max_{a'} Q(s', a'; theta^-) - Q(s, a; theta))^2]

Key Innovations

Experience replay: store transitions (s, a, r, s') in buffer, sample mini-batches uniformly. Breaks correlation between consecutive samples.
Target network: use a slowly-updated copy theta^- for the TD target. Prevents moving target problem. Updated periodically or via Polyak averaging: theta^- = tau*theta + (1-tau)*theta^-.
Extensions:
- Double DQN: use online network to select action, target network to evaluate: r + gamma * Q(s', argmax_{a'} Q(s', a'; theta); theta^-)
- Dueling DQN: decompose Q = V(s) + A(s, a) - mean(A)
- Prioritized replay: sample transitions proportional to TD error magnitude
- Rainbow: combines 6 improvements (Double, Dueling, Prioritized, n-step, distributional, noisy nets)

Policy Gradient Methods

Directly optimize the policy pi(a|s; theta).

REINFORCE

grad J(theta) = E_pi[sum_t grad log pi(a_t|s_t; theta) * G_t]

where G_t = sum_{k=t}^{T} gamma^{k-t} r_k is the return from time t.

def reinforce(env, policy_net, optimizer, gamma, n_episodes):
    for _ in range(n_episodes):
        states, actions, rewards = collect_episode(env, policy_net)

        # Compute returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = normalize(returns)  # baseline: subtract mean, divide by std

        # Policy gradient update
        loss = 0
        for s, a, G in zip(states, actions, returns):
            log_prob = policy_net.log_prob(s, a)
            loss -= log_prob * G

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

High variance. Use baselines (subtract V(s)) to reduce variance without introducing bias.

Advantage Actor-Critic (A2C)

Two networks: actor (policy) and critic (value function).

Advantage: A(s, a) = Q(s, a) - V(s) ~ r + gamma*V(s') - V(s)   # TD error

Actor update:  grad J = E[grad log pi(a|s; theta) * A(s, a)]
Critic update: minimize (r + gamma*V(s'; w) - V(s; w))^2

Proximal Policy Optimization (PPO)

Constrain policy updates to prevent destructive large steps:

Clipped surrogate objective:

r_t(theta) = pi(a_t|s_t; theta) / pi(a_t|s_t; theta_old)

L^CLIP = E[min(r_t * A_t,  clip(r_t, 1-epsilon, 1+epsilon) * A_t)]

epsilon = 0.2 typical. Simple, stable, widely used (OpenAI default).

Soft Actor-Critic (SAC)

Maximum entropy RL: maximize reward + entropy of the policy:

J(pi) = sum_t E[r_t + alpha * H(pi(.|s_t))]

Encourages exploration (high entropy = more random)
alpha is automatically tuned
Off-policy (uses replay buffer)
Continuous action spaces. State-of-the-art for robotics.

Model-Based RL

Learn a dynamics model and use it for planning.

MuZero

Learns three components without explicit environment model:

Representation: h(observation) -> hidden state
Dynamics: g(hidden_state, action) -> next_hidden_state, reward
Prediction: f(hidden_state) -> policy, value

Planning: use learned model with Monte Carlo Tree Search (MCTS) at each step.

Achieved superhuman performance in Go, chess, Atari, and other games without knowing the rules.

Dyna Architecture

Interleave real experience with simulated experience from the learned model:

1. Take real action, observe transition
2. Update Q from real experience
3. Update model from real experience
4. For k steps: sample from model, update Q from simulated experience

Exploration Strategies

Strategy	Description
Epsilon-greedy	Random action with probability epsilon
Boltzmann	Sample from softmax(Q/tau)
UCB	Select argmax Q(a) + c*sqrt(ln(t)/N(a))
Intrinsic motivation	Reward curiosity/novelty (ICM, RND)
Thompson sampling	Sample from posterior over Q-values
Count-based	Bonus inversely proportional to visit count