基于veRL框架并行训练的深入探究

process-overview

加载模型
veRL使用Ray框架进行模型的调度
init_workers 初始化策略模型、参考模型、价值模型、奖励模型。这里面有些模型不会用到。很多都不会使用价值模型，奖励模型也都是基于规则的
训练过程
get_data_batch

generate_sequences(normal-rl和agent-rl)

normal-rl：单轮的交互（prompt–>model–>response），这个流程就结束了
agent-rl：多轮的交互（prompt–>model–>response–>env（tool call,code exec ,etc）–>model–>response–>…）
reward（通过自定义的奖励函数控制）不用去修改veRL框架，通过配置文件进行修改
log_probs（计算策略模型和参考模型输出token的概率）
values(目前主流的rl算法基本上都在舍弃价值模型)

adv(各rl算法之间的差异主要体现在adv的计算方式，此处需要针对不同的算法进行修改)

compute_loss（normal-rl和agent-rl）

normal-rl：无需和环境交互，response中token全部由模型生成，均参与损失计算
agent-rl：需要获取观测结果，观测结果部分不是模型生成，计算损失时需要对其进行mask
update actor

DeepSpeed

DeepSpeed官方文档👈官方文档
DeepSpeed配置JSON👈使用只需要JSON配置文件
【利用多張GPU訓練大型語言模型】 - YouTube👈李宏毅老师YouTube视频讲解（约一个小时）
The Ultra-Scale Playbook:Training LLMs on GPU Clusters👈并行训练高质参考资料

hugging face经常有高质量实验总结，可以多关注一下

train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs

train_batch_size: [integer]

代表着one step.Example:32

train_micro_batch_size_per_gpu: [integer]

一次更新的batch_size，所以叫micro_batch_size.

gradient_accumulation_steps: [integer]

积累几次

一开始会有batch-size个prompt去做rollout，每个prompt rollout出n个response，之后每mini-batch-size个prompt及其rollout出来的response会去做一次梯度下降，batch-size / mini-batch-size次梯度下降之后一个step结束

Monitoring Module

可以使用TensorBoard,andb,Comet等，因为个人使用所以只介绍swanlab(wandb的国内镜像)
注册登录，设置 project_name 和 experiment_name 就可以在电脑上/手机上看了
很好用的监控平台！
swanlab官方文档👈官方文档

The Ultra-Scale Playbook:Training LLMs on GPU Clusters

Finding the Best Training Configuration

Step 1: Fitting a training step in memory

首先我们思考如何把一个完整的模型放在我们的GPUs集群里面，通常有一下两种办法

GPU-rich case 🤑 - when you have plenty of GPUs available:
- For models under 10B parameters, you can use a single parallelism technique, e.g. tensor parallelism or ZeRO-3/DP with full recompute across 8 GPUs.
- 后面模型太大了就不提了，可以自行去资料中翻看
GPU-poor case 😭 - when you might be low on GPU resources:
- You can enable full activation recomputation to trade some compute for memory (and train a bit more slowly).
- You can increase gradient accumulation to process larger batches with limited memory.

Now that we have a first model instance training, we need to make sure we have the right batch size.

Step 2: Achieving the target global batch size

To increase our current global batch size:

We can scale up data parallelism or gradient accumulation steps.
For long sequences, we can leverage context parallelism.

To decrease our current global batch size:

We can reduce data parallelism in favor of other parallelization strategies.
For long sequences, we can reduce context parallelism.

OK, now we have the model running in the general configuration we want in terms of model size and batch size - but are we training it the fastest way? The final step is to work on optimizing throughput.

Step 3: Optimizing training throughput

We want to make sure the training is running as fast as possible so all our precious GPUs are well utilized at all times. As long as memory and communication aren’t bottlenecks, we can try the following:

Scale up tensor parallelism (using the fast intra-node bandwidth) until we reach a degree close to the node size, so that we can reduce other forms of parallelism.
Increase data parallelism with ZeRO-3 while keeping the target batch size.
When data parallelism communication starts to become a bottleneck, transition to using pipeline parallelism.
Try scaling up different parallelisms one by one.
Experiment with micro-batch sizes (mbs) to aim for an optimal balance between max global batch size, model size, compute, and communication.

Benchmarking thousands of configurations

有 Heatmap 可以看Best Configurations

Lessons learned on benchmarking

What looks simple in theory often requires careful attention to many moving parts in practice.