24 Jun 2024

Proximal Policy Optimization (PPO)

Motivation

PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse?
Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old.

PPO is an on-policy algorithm that can be used for environments with either discrete or continuous action spaces. There are two primary variants of PPO:

PPO-Penalty: approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately.

PPO-Clip: doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.

Robot Perception and Control

Reinforcement Learning

Application of Reinforcement Learning

Game Agents

Alignment of LLMs

Robot Control

Basics of Reinforcement Learning

Expectation

Expectation and Independence

Expectation and Inequalities

Markov Property (MP)

Markov Decision Process (MDP)

Partially Observable MDP (POMDP)

constrained Markov Decision Process (cMDP)

Reinforcement Learning

States and Action Space

Policies

Categorical Policies

Diagonal Gaussian Policies

Diagonal Gaussian Policies

Gradient Estimation

Trajectories

Reward and Return

Reinforcement Learning

Model-free vs Model-based RL

Model-Free Reinforcement Learning

Model-Free Reinforcement Learning

Value-based methods

Value Functions

Optimal Value Functions

Advantage Functions

The Optimal Q-Function and the Optimal Action

What is a good policy?

Bellman Equation

Optimal Value Functions

Temporal Difference (TD) Learning

TD ()

TD ()

TD ()

DQN

DQN

Experience Replay

Double DQN

Prioritized Experience Replay

Policy-based Methods

Policy Gradient

Objective of Optimizing Policy

Log-derivative Trick

Basic Policy Gradient

Estimating the Policy Gradient

Does past Reward matter?

Baselines

Policy Gradient

Policy Gradient

Proximal Policy Optimization (PPO)

PPO-Clip

PPO-Clip

Actor-Critic

Actor-Critic

References

Model-based Reinforcement Learning

What is the model?

Humans are the ultimate model-based reasoners

Resources: Books