One-Minute RL 101: Concepts and Framework Overview for Agent RL

VERL: Volcano Engine Reinforcement Learning for LLMs

Why Get Started with RL Now

Inspired by this article from Gao Ce, I've been wanting to dive deeper into Agent RL, understand the current progress of Agent RL infrastructure, and get hands-on with a full training experiment.

After DeepSeek R1, everyone now sees that reinforcement learning can really improve reasoning ability in LLMs. As someone who works with AI infrastructure every day, I naturally wanted to try it myself.

One-Minute RL 101: Concepts and Framework Overview

Let's start with a quick overview to help you quickly understand the main concepts and frameworks in current RL training:

Basic Concepts

Concept	One-sentence explanation
RLHF	Reinforcement Learning from Human Feedback, aligns large language models using reinforcement learning from human feedback
PPO	Proximal Policy Optimization, classic RL algorithm, requires a separate critic value network, uses more memory but stable
GRPO	Group Relative Policy Optimization, removes the critic network, computes advantages via normalization within a group of samples from the same prompt, saves memory, currently very popular for reasoning model training
DAPO	Decoupled Actor Policy Optimization, further improves on GRPO, reported better results than vanilla GRPO
GDPO	Group Divergence PO, another improvement that better controls KL divergence
FSDP	Fully Sharded Data Parallel, memory sharding for distributed large model training, standard now
vLLM/SGLang	High-performance inference engines, used for rollout sample generation during RL training, much faster than native Hugging Face

Popular Open-Source Frameworks

Framework	Key Features	Who It's For
VERL	Open-source by ByteDance Seed team, supports all major algorithms (PPO/GRPO/DAPO/GDPO), excellent engineering, supports FSDP2/vLLM/SGLang, most popular in the community right now	If you want complete features, interested in the latest algorithms like DAPO/GRPO, okay with setting up your own environment
OpenRLHF	Older project, mature community, supports PPO/GRPO, comprehensive documentation	If you prefer stability and need more documentation and examples
TinyZero	Minimal GRPO reproduction, concise code, great for learning principles	Beginners learning, want a minimal working example
EasyR1	One-click startup, provides ready-to-use datasets and configurations, works out of the box	If you want to get running quickly, don't want to mess with environment

After this table, you should be able to quickly pick the framework that fits you.