Robot Perception and Control

Reinforcement Learning

Last updated: Jul / 25 /2024
Kashu Yamazaki
kyamazak@andrew.cmu.edu

Application of Reinforcement Learning

Game Agents

#center

Kashu Yamazaki, 2024

Alignment of LLMs

#center

Kashu Yamazaki, 2024

Robot Control

#center

#center
#center

Kashu Yamazaki, 2024

Basics of Reinforcement Learning

Expectation

Expectation is the probability-weighted average of all possible values.

Let be a random variable and be any function.

  • For continuous distribution, the expectation of is:

    where is the probability density function of .

  • For discrete distribution, the expectation of is:

    where is the probability mass function of and is the support of .

Kashu Yamazaki, 2024

Expectation and Independence

Let and be independent random variables. Then, the two random variables are mean independent for any function and :

  • Let and be independent random variables. Then, and are also independent for any function and .
Kashu Yamazaki, 2024

Expectation and Inequalities

Chebychev’s Inequality: Let be a random variable and let be a nonnegative function. Then, for any positive real number ,

  • when , it is called Markov’s inequality.

Jensen’s Inequality: Let be a random variable with . If is a convex function, then:

  • for a concave function, then the inequality will be reversed
Kashu Yamazaki, 2024

Markov Property (MP)

A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present values) depends only upon the present state.
Markov property is defined by the following condition:

  • memoryless property of a stochastic process
Kashu Yamazaki, 2024

Markov Decision Process (MDP)

A standard discrete-time MDP is defined by a tuple of where:

  • is a set of all valid states (state space)
  • is a set of all valid actions (action space)
  • is a reward function with (or : task cost function)
  • is a transition probability function (transition model)
  • is the starting state distribution
Kashu Yamazaki, 2024

Partially Observable MDP (POMDP)

When all states are observable such that , this can be considered a Markov Decision Process (MDP). When there is unobservable information, such as external forces or full terrain information in robotics, the dynamics are modeled as a Partially Observable Markov Decision Process (POMDP).

This is often overcome by constructing a belief state from a history of observations in an attempt to capture the full state. In deep reinforcement learning, this is frequently done by stacking a sequence of previous observations [1] or by using architectures which can compress past information such as Recurrent Neural Networks (RNNs) or Temporal Convolutional Networks (TCNs).

Kashu Yamazaki, 2024

constrained Markov Decision Process (cMDP)

Another challenge is to ensure the safety of the robot during the entire learning process. Safety in RL can be formulated as a constrained Markov Decision Process (cMDP), which can be solved by the Lagrangian relaxation procedure.

Kashu Yamazaki, 2024

Reinforcement Learning

#center

At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take based on a policy . The environment changes when the agent acts on it, but may also change on its own. The state usually follows a discrete-time Markov Decision Process (MDP).

Objective of the agent is to learn a policy that maximizes the return (or total discounted reward) or the objective function related to the return.

Kashu Yamazaki, 2024

States and Action Space

The environment is fully observable when the agent is able to observe the complete state of the environment. The environment is partially observable when the agent is able to observe only a partial state of the environment. We highlight the difference of state and observation as:

state (): a complete description of the state of the world.
observation (): a partial description of a state, which may omit information.

The set of all valid actions in a given environment is often called the action space.
Mainly, action space is either discrete action spaces (e.g., Atari and Go) or continuous action spaces (e.g., Robot control).

Kashu Yamazaki, 2024

Policies

A policy is a rule used by an agent to decide what actions to take. It can be deterministic () or stochastic ().

Deterministic Policies: a function that maps the set of states of the environment to the set of actions .

Stochastic Policies: can be represented as a family of conditional probability distributions. For a fixed state , is a possibly distinct conditional probability distribution.

  • categorical policies can be used in discrete action spaces.
  • diagonal Gaussian policies are used in continuous action spaces.
Kashu Yamazaki, 2024

Categorical Policies

A categorical policy is like a classifier over discrete actions that returns softmax probabilities of each action.

  • Sampling: given the probabilities for each action, an action can be sampled via: torch.multinomial or sample() method of Categorical distribution in Pytorch.

  • Log-Likelihood: denote the last layer of probabilities as . We can treat the actions as indices for . The log likelihood for an action a can then be obtained by indexing into the vector:

Kashu Yamazaki, 2024

Diagonal Gaussian Policies

A multivariate Gaussian distribution is described by a mean vector and a covariance matrix . A diagonal Gaussian distribution is a special case where the covariance matrix only has entries on the diagonal. As a result, we can represent it by a vector.

mean vector: A diagonal Gaussian policy always has a neural network that maps from observations to mean actions .
covariance matrix: typically represented as standalone parameters or a neural network that maps from states to log standard deviations , which may optionally share some layers with the mean network.

Note that in both cases of covariance matrix, log standard deviations is used instead of standard deviations directly. This is because log stds are free to take on any values in , while stds must be nonnegative.

Kashu Yamazaki, 2024

Diagonal Gaussian Policies

  • Sampling: given the mean action and standard deviation , and a vector of noise from a spherical Gaussian , an action sample can be computed with . Can be done by using rsample() method in Normal distribution in Pytorch.

  • Log-Likelihood: the log-likelihood of a -dimensional action , for a diagonal Gaussian with mean and standard deviation , is given by:

Kashu Yamazaki, 2024

Gradient Estimation

It is not possible to directly backpropagate through random samples. There are two main methods for creating surrogate functions that allows backpropagation through random samples [1].

Score function estimator (likelihood ratio estimator/ REINFORCE): the partial derivative of the log-likelihood function commonly seen as the basis for policy gradient methods. When the probability density function is differentiable with respect to its parameters, we only need sample() and log_prob() methods.

Pathwise derivative estimator: commonly seen in the reparameterization trick in variational autoencoders. The parameterized random variable can be constructed via a parameterized deterministic function of a parameter-free random variable. Can be done by using rsample() method.

Kashu Yamazaki, 2024

Trajectories

A trajectory is a sequence of states and actions in the world:

where the first state of the world, , is randomly sampled from the start-state distribution .

Trajectories are also frequently called episodes or rollouts.

State transitions (what happens to the world between the state at time , , and the state at , ), are governed by the natural laws of the environment, and depend on only the most recent action, . They can be either deterministic or stochastic .

Kashu Yamazaki, 2024

Reward and Return

The reward function depends on the current state of the world, the action just taken, and the next state of the world . The goal of the agent is to maximize some notion of cumulative reward over a trajectory .

finite-horizon undiscounted return: the sum of rewards obtained in a fixed window of steps.

infinite-horizon discounted return: the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained.

we can also represent this in a recursive way:

Kashu Yamazaki, 2024

Reinforcement Learning

Suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a -step trajectory is:

The expected return, denoted by , is then:

The central optimization problem in RL is to find the policy that maximizes the expected return :

Kashu Yamazaki, 2024

Model-free vs Model-based RL

The difference between model-free and model-based RL is whether the agent has access to (or learns) a model of the environment. A model is usually a function that predicts state transitions or rewards.

With collected data , problem setup is:

Model-free: learn policy directly or indirectly from data ()

Model-base: learn model, then use it to learn or improve a policy ()

  • substantial improvement in sample efficiency.
  • ground-truth model of the environment is usually not available to the agent.
Kashu Yamazaki, 2024

Model-Free Reinforcement Learning

Model-Free Reinforcement Learning

Value-based methods: methods in this family learn an approximator .

  • Typically use an objective function based on the Bellman equation.
  • optimization is almost always performed off-policy: each update can use data collected at any point during training (substantially more sample efficient).

Policy-based methods: methods in this family represent a policy explicitly as .

  • usually involves learning an approximator .
  • optimization is almost always performed on-policy: each update only uses data collected while acting according to the most recent policy.
Kashu Yamazaki, 2024

Value-based methods

Value Functions

Action Value function : expected value of the return that can be obtained when selecting action according to a policy from state .

  • evaluates how good it is for an agent to pick action while being in state .

(State) Value function : expected value of the return that can be obtained when acting according to a policy from state .

  • For fixed policy , evaluates how good the situation is in state .
  • evaluates how good the policy is.
Kashu Yamazaki, 2024

Optimal Value Functions

Optimal Action-Value Function : expected value of the return that can be obtained when selecting action according to an optimal policy from state .

  • evaluates how good it is for an agent to pick action while being in state no matter what the policy is.

Optimal (State) Value function : expected value of the return that can be obtained when acting according to an optimal policy from state .

Kashu Yamazaki, 2024

Advantage Functions

The advantage function corresponding to a policy describes how much better it is to take a specific action in state , over randomly selecting an action according to , assuming you act according to forever after. The advantage function is defined by:

  • the advantage function is crucially important to policy gradient methods.
  • note that .
Kashu Yamazaki, 2024

The Optimal Q-Function and the Optimal Action

By definition, gives the expected return for starting in state , taking (arbitrary) action , and then acting according to the optimal policy forever after.
The optimal policy in will select whichever action maximizes the expected return from starting in . As a result, if we have , we can directly obtain the optimal action, , via:

Value-based methods learn an approximator for the optimal action-value function and the actions taken by the agent are given by:

Kashu Yamazaki, 2024

What is a good policy?

greedy policy: if the optimal action-value function is known, this is the optimal policy.

-greedy policy: a stochastic policy that selects an random action with probability and the action with the highest expected return otherwise.

Boltzmann policy (softmax policy): a policy that samples actions according to a Boltzmann distribution (also called Gibbs distribution):

  • When , the Boltzmann policy becomes a greedy policy.
Kashu Yamazaki, 2024

Bellman Equation

All four of the value functions obey special self-consistency equations called Bellman equations. An intuition behind Bellman equation is:

The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.

The Bellman equations for the on-policy value functions are:

The Bellman equations for the optimal value functions are:

Kashu Yamazaki, 2024

Optimal Value Functions

The backup diagrams below show graphically the spans of future states and actions considered in the Bellman optimality equations for and .

#center

The arcs at the agent’s choice points represent that the maximum over that choice is taken rather than the expected value given some policy.

Kashu Yamazaki, 2024

Temporal Difference (TD) Learning

A method to update estimated action value function with obtained experience at time :

TD Error (): the difference of the expected return between time and time .
TD Target (): the teacher signal in TD learning.

  • SARSA (on-policy TD control):
  • Q-Learning (off-policy TD control) [1]:
Kashu Yamazaki, 2024

TD ()

forward view: use future experience to update the current value function. Note that there will be a delay in update of value to obtain the future experience.

  • n-step (truncated) return: utilization of multi-step experience to obtain target

  • -return: any average of n-step returns can be used as target

    • TD learning: TD(0)
    • Monte Carlo Methods: TD(1)
Kashu Yamazaki, 2024

TD ()

#center

Kashu Yamazaki, 2024

TD ()

backward view: use past experience to update the current value function.

  • Eligibility Trace:
Kashu Yamazaki, 2024

DQN

Approximate the action value function with a deep neural network with parameter utilizing Q-learning setup as an objective function to train the network with SGD.

  • the target network is used to stabilize the learning process.
  • the target network’s parameters are not trained, but they are periodically synchronized with the parameters of the main Q-network
Kashu Yamazaki, 2024

DQN

  1. observe state and perform action
  2. predict the value:
  3. differentiate the value network:
  4. receive new state and reward
  5. compute TD target:
  6. update Q network (gradient descent):
Kashu Yamazaki, 2024

Experience Replay

Store experience to replay buffer and randomly sample multiple experiences for mini-batch training.

Kashu Yamazaki, 2024

Double DQN

The Q-learning algorithm is known to overestimate action values under certain conditions [1].
Double DQN can be implemented using the existing architecture of the DQN algorithm without requiring additional networks or parameters:

Kashu Yamazaki, 2024

Prioritized Experience Replay

Prioritized experience replay assigns a priority to each experience in the replay buffer according to the TD Error.

The priority is used to sample experiences with a probability proportional to the priority.

Use importance sampling (multiply the loss with the following weight) to deal with the bias introduced by the prioritized experience replay.

Kashu Yamazaki, 2024

Policy-based Methods

Policy Gradient

In value-based methods, the policies would not even exist without the action-value estimate. Consider methods that instead learn a parameterized policy that can select actions without consulting a value function.

As we aim to maximize the expected return , we would like to optimize the policy by gradient ascent:

  • The gradient of policy performance, , is called the policy gradient.
  • The goal of policy-based RL is to find:

Kashu Yamazaki, 2024

Objective of Optimizing Policy

Given a policy approximator , we want to measure the quality of a policy .

In episodic environments, we can use the start value:

In continuing environments, we can use either average state value or the average reward per time-step:

Kashu Yamazaki, 2024

Log-derivative Trick

The log-derivative trick is based on a simple rule from calculus: the derivative of with respect to is . When rearranged and combined with chain rule, we get:

Log-probability of a Trajectory:

Since the environment has no dependence on ,

Kashu Yamazaki, 2024

Basic Policy Gradient

Putting it all together, we derive the following:

This is an expectation, which means that we can estimate it with a sample mean.

Kashu Yamazaki, 2024

Estimating the Policy Gradient

If we collect a set of trajectories where each trajectory is obtained by letting the agent act in the environment using the policy , the policy gradient can be estimated with:

Assuming that we have represented our policy in a way which allows us to calculate , and if we are able to run the policy in the environment to collect the trajectory dataset, we can compute the policy gradient and take an update step.

Kashu Yamazaki, 2024

Does past Reward matter?

Agents should really only reinforce actions on the basis of their consequences. Rewards obtained before taking an action have no bearing on how good that action was (only rewards that come after).

The policy gradient can also be expressed by:

In this form, actions are only reinforced based on rewards obtained after they are taken.

  • The reward in this form is called reward-to-go.
Kashu Yamazaki, 2024

Baselines

Any function called a baseline can be added or subtracted from the expression for the policy gradient without changing it in expectation:

  • The baseline can be any function, even a random variable, as long as it does not vary with because:

  • The most common choice of baseline is the on-policy value function because it can reduce variance in the sample estimation for the policy gradient (faster and more stable policy learning).
Kashu Yamazaki, 2024

Policy Gradient

Approximate state-value function:

  • policy function is a probability density function (i.e., ).
  • policy based learning: learn that maximizes
  • policy gradient

Kashu Yamazaki, 2024

Policy Gradient

  1. observe state
  2. randomly sample action according to
  3. compute
  4. differentiate policy network:
  5. approximate policy gradient:
  6. update policy network (gradient ascent):
Kashu Yamazaki, 2024

Proximal Policy Optimization (PPO)

Motivation

PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse?
Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old.

PPO is an on-policy algorithm that can be used for environments with either discrete or continuous action spaces. There are two primary variants of PPO:

PPO-Penalty: approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately.

PPO-Clip: doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.

Kashu Yamazaki, 2024

PPO-Clip

PPO-clip updates policies via:

typically taking multiple steps of SGD to maximize the objective.

where is a small hyperparameter that determines how far away the new policy is allowed to go from the old.

The PPO-Clip objective can be simplified to:

where

Kashu Yamazaki, 2024

PPO-Clip

#center

Kashu Yamazaki, 2024

Actor-Critic

Approximate with neural network:

  • policy network (actor): - try to increase the state-value
    • supervision is from the critic
  • value network (critic): - better estimate the reward
    • supervision is from the rewards
Kashu Yamazaki, 2024

Actor-Critic

  1. observe state and randomly sample action according to
  2. receive new state and reward
  3. randomly sample
  4. evaluate value network: and
  5. compute TD error:
  6. update value network:
  7. update policy network:
Kashu Yamazaki, 2024

References

Kashu Yamazaki, 2024

Model-based Reinforcement Learning

What is the model?

Model: a model is a representation that explicitly encodes knowledge about the structure of the environment and task.

  • A transition/dynamics model:
  • A model of rewards:
  • An inverse transition/dynamics model:
  • A model of distance:
  • A model of future returns: or
Kashu Yamazaki, 2024

Humans are the ultimate model-based reasoners

  • Motor control: forward kinematics models in the cerebellum
  • Language comprehension: models of what is communicated
  • Pragmatics: models of listener & speaker beliefs
  • Theory of mind: models of other agents’ beliefs and behavior
  • Decision making: model-based reinforcement learning
  • Intuitive physics: forward models of physical dynamics
  • Scientific reasoning: mental models of scientific phenomena
  • Creativity: being able to imagine novel combinations of things
    … and much more!
Kashu Yamazaki, 2024

Resources: Books

#center

Reinforcement Learning: An Introduction
by Richard S. Sutton and Andrew G. Barto

Kashu Yamazaki, 2024

## Problem Setup **Markov decision process (MDP)** **Partially Observable MDP (POMDP)** When the agent does not have direct access to the state $s_t$ but can observe the environment via observation $o_t$ that is determined by the previous action $a_{t-1}$ and the state $s_t$. ## Reward and Return **Reward** $r_{t+1}$: the evaluation of an action $a_t$ to the state $s_t$. **Return** $R_t$: the evaluation of future rewards $\{r_{t+1}, r_{t+2}, \dots\}$ We usually apply a discount factor ($0 \leq \gamma \leq 1$) $$ R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots = \sum_{k=0}^{\infty}{\gamma^k r_{t+k+1}} $$

where random variable $X$ is in the domain $\mathcal{X}$.

## Partially Observable MDP (POMDP)

An agent percieves state $s_t$ and selects action $a_t$ based on a policy $\pi(a|s)$. Then the agent receives a reward $r_t = r(s_t, a_t)$ and the next state $s_{t+1}$.

$$ \begin{split} G_t &= \sum_{k=0}^{T-t-1}\gamma^k r_{t+k+1} = r_{t+1} + \gamma r_{t+2} +\dots+ \gamma^{T-t-1} r_{T} \\ &= r_{t+1} + \gamma G_{t+1} \end{split} $$

```python probs = policy_network(state) m = torch.distributions.categorical.Categorical(probs) # p(a|\pi(s)) action = m.sample() next_state, reward = env.step(action) loss = -m.log_prob(action) * reward loss.backward() ```

$$ Q^\pi(s_t, a_t) = r_t + \mathbb{E}_{\pi}\left[\sum_{k=1}^{\infty} r\left(s_{t+k}, a_{t+k}\right)\right] $$

$$ V^\pi(s_t) = \mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} r\left(s_{t+k}, a_{t+k}\right)\right] = \mathbb{E}_\pi\left[Q^\pi(s_t, a_t)\right] $$

To know the relative advantage of the action than others on average.

Update with Temporal Difference (TD) Error: $$ V(s) \longleftarrow V(s)+\alpha\left[r+\gamma V(s')-V(s)\right] $$ **Monte Carlo Methods**: $$ V(s_t) \longleftarrow V(s_t)+\alpha\left[(r_{t+1}+\gamma r_{t+2} \dots +\gamma^{T-1}r_T) -V(s_t)\right] $$ **Temporal Difference Learning**: $$ V(s_t) \longleftarrow V(s_t)+\alpha\left[r_{t+1}+\gamma V(s_{t+1})-V(s_t)\right] $$ **Multi-step Learning**:

--- # Trade-offs The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.

* TD learning: $$ \underbrace{Q(s_t, a_t)}_{\text{estimate of }U_t} \approx \mathbb{E}[R_t + \gamma \underbrace{\max_a Q(S_{t+1}, a) }_{\text{estimate of }U_{t+1}}] $$ $$ \underbrace{Q(s_t, a_t)}_{\text{prediction}} \approx \underbrace{r_t + \gamma Q(s_{t+1}, a_{t+1}) }_{\text{TD target}} $$

* A value function may still be used to learn the policy parameter, but is **not required for action selection**.

--- # Expected Grad Log-Probability (EGLP) Lemma Suppose that $P_{\theta}$ is a parameterized probability distribution over a random variable $x$. Then: $$ \mathbb{E}_{x\sim P_\theta}[\nabla_\theta \log P_\theta(x)] = 0 $$

https://spinningup.openai.com/en/latest/algorithms/ppo.html