← Back
LMSYS
AMD and Miles Team Enable ROCm Support for Large-Scale RL Post-Training on Instinct GPUs
· releasefeatureplatformintegrationopen-source · lmsys.org ↗

Miles RL Framework Now Native on AMD GPUs

AMD and the Miles team have announced full ROCm support for the Miles open-source reinforcement learning framework on AMD Instinct MI300 and MI350/355-class accelerators. Miles is a production-grade RL framework designed for large-scale post-training of language and multimodal models, building on SGLang and the broader RL ecosystem.

Why RL Workloads Fit AMD Hardware

Reinforcement learning post-training differs fundamentally from pretraining in that rollout generation dominates compute, consuming 70–90% of GPU time across thousands of parallel environments. This makes memory capacity and bandwidth critical performance factors. AMD Instinct MI GPUs are well-suited for these workloads due to their large HBM memory capacity, high memory bandwidth, efficient long-context inference, and strong multi-node scaling capabilities.

Architecture and Core Features

Miles uses a decoupled two-plane architecture that separates:

  • Rollout plane: Generates training data using SGLang
  • Training plane: Updates model weights using Megatron-LM
  • Scheduler: Coordinates interaction between planes for scalable post-training

The framework supports:

  • Distributed rollout generation and on-policy RL training loops
  • GRPO and PPO policy optimization
  • Ray-based orchestration
  • Integration with Megatron-LM and SGLang

Getting Started

Miles provides ROCm-ready containers with SGLang and Megatron-LM preinstalled. Users can pull GPU-specific images:

# MI300X
rlsys/miles:rocm7-MI300-sglang0.5.9-latest

# MI350X / MI355X
rlsys/miles:rocm7-MI350-355-sglang0.5.9-latest

The framework includes example workflows for launching a full RL pipeline with Ray cluster initialization, rollout generation, GRPO training loops, and on-policy update cycles. Models and datasets are available via Hugging Face.

Validated Performance Results

Testing on a single 8-GPU AMD Instinct MI300X node with Qwen3-30B-A3B using GRPO training (32×8 sampling, 8k response cap, global batch 256) showed:

  • Mean step time: 388.50 seconds
  • Rollout throughput: 1.1k–1.3k tokens/GPU/second
  • Train throughput: ~15–16k tokens/second
  • Model improvement: AIME accuracy increased from 0.665 (step 19) to 0.729 (step 139) with pass@16 reaching 0.890

The framework demonstrates practical viability for multi-turn agent training and agentic task workflows on AMD hardware.