VERL: Volcano Engine Reinforcement Learning for LLMs

Why Get Started with RL Now

Inspired by this article from Gao Ce, I've been wanting to dive deeper into Agent RL, understand the current progress of Agent RL infrastructure, and get hands-on with a full training experiment.

After DeepSeek R1, everyone now sees that reinforcement learning can really improve reasoning ability in LLMs. As someone who works with AI infrastructure every day, I naturally wanted to try it myself.


One-Minute RL 101: Concepts and Framework Overview

Let's start with a quick overview to help you quickly understand the main concepts and frameworks in current RL training:

Basic Concepts

Concept One-sentence explanation
RLHF Reinforcement Learning from Human Feedback, aligns large language models using reinforcement learning from human feedback
PPO Proximal Policy Optimization, classic RL algorithm, requires a separate critic value network, uses more memory but stable
GRPO Group Relative Policy Optimization, removes the critic network, computes advantages via normalization within a group of samples from the same prompt, saves memory, currently very popular for reasoning model training
DAPO Decoupled Actor Policy Optimization, further improves on GRPO, reported better results than vanilla GRPO
GDPO Group Divergence PO, another improvement that better controls KL divergence
FSDP Fully Sharded Data Parallel, memory sharding for distributed large model training, standard now
vLLM/SGLang High-performance inference engines, used for rollout sample generation during RL training, much faster than native Hugging Face
Framework Key Features Who It's For
VERL Open-source by ByteDance Seed team, supports all major algorithms (PPO/GRPO/DAPO/GDPO), excellent engineering, supports FSDP2/vLLM/SGLang, most popular in the community right now If you want complete features, interested in the latest algorithms like DAPO/GRPO, okay with setting up your own environment
OpenRLHF Older project, mature community, supports PPO/GRPO, comprehensive documentation If you prefer stability and need more documentation and examples
TinyZero Minimal GRPO reproduction, concise code, great for learning principles Beginners learning, want a minimal working example
EasyR1 One-click startup, provides ready-to-use datasets and configurations, works out of the box If you want to get running quickly, don't want to mess with environment

After this table, you should be able to quickly pick the framework that fits you.