One-Minute RL 101: Concepts and Framework Overview for Agent RL

Why Get Started with RL Now
Inspired by this article from Gao Ce, I've been wanting to dive deeper into Agent RL, understand the current progress of Agent RL infrastructure, and get hands-on with a full training experiment.
After DeepSeek R1, everyone now sees that reinforcement learning can really improve reasoning ability in LLMs. As someone who works with AI infrastructure every day, I naturally wanted to try it myself.
One-Minute RL 101: Concepts and Framework Overview
Let's start with a quick overview to help you quickly understand the main concepts and frameworks in current RL training:
Basic Concepts
| Concept | One-sentence explanation |
|---|---|
| RLHF | Reinforcement Learning from Human Feedback, aligns large language models using reinforcement learning from human feedback |
| PPO | Proximal Policy Optimization, classic RL algorithm, requires a separate critic value network, uses more memory but stable |
| GRPO | Group Relative Policy Optimization, removes the critic network, computes advantages via normalization within a group of samples from the same prompt, saves memory, currently very popular for reasoning model training |
| DAPO | Decoupled Actor Policy Optimization, further improves on GRPO, reported better results than vanilla GRPO |
| GDPO | Group Divergence PO, another improvement that better controls KL divergence |
| FSDP | Fully Sharded Data Parallel, memory sharding for distributed large model training, standard now |
| vLLM/SGLang | High-performance inference engines, used for rollout sample generation during RL training, much faster than native Hugging Face |
Popular Open-Source Frameworks
| Framework | Key Features | Who It's For |
|---|---|---|
| VERL | Open-source by ByteDance Seed team, supports all major algorithms (PPO/GRPO/DAPO/GDPO), excellent engineering, supports FSDP2/vLLM/SGLang, most popular in the community right now | If you want complete features, interested in the latest algorithms like DAPO/GRPO, okay with setting up your own environment |
| OpenRLHF | Older project, mature community, supports PPO/GRPO, comprehensive documentation | If you prefer stability and need more documentation and examples |
| TinyZero | Minimal GRPO reproduction, concise code, great for learning principles | Beginners learning, want a minimal working example |
| EasyR1 | One-click startup, provides ready-to-use datasets and configurations, works out of the box | If you want to get running quickly, don't want to mess with environment |
After this table, you should be able to quickly pick the framework that fits you.