Miles RL Framework Now Native on AMD GPUs
AMD and the Miles team have announced full ROCm support for the Miles open-source reinforcement learning framework on AMD Instinct MI300 and MI350/355-class accelerators. Miles is a production-grade RL framework designed for large-scale post-training of language and multimodal models, building on SGLang and the broader RL ecosystem.
Why RL Workloads Fit AMD Hardware
Reinforcement learning post-training differs fundamentally from pretraining in that rollout generation dominates compute, consuming 70–90% of GPU time across thousands of parallel environments. This makes memory capacity and bandwidth critical performance factors. AMD Instinct MI GPUs are well-suited for these workloads due to their large HBM memory capacity, high memory bandwidth, efficient long-context inference, and strong multi-node scaling capabilities.
Architecture and Core Features
Miles uses a decoupled two-plane architecture that separates:
- Rollout plane: Generates training data using SGLang
- Training plane: Updates model weights using Megatron-LM
- Scheduler: Coordinates interaction between planes for scalable post-training
The framework supports:
- Distributed rollout generation and on-policy RL training loops
- GRPO and PPO policy optimization
- Ray-based orchestration
- Integration with Megatron-LM and SGLang
Getting Started
Miles provides ROCm-ready containers with SGLang and Megatron-LM preinstalled. Users can pull GPU-specific images:
# MI300X
rlsys/miles:rocm7-MI300-sglang0.5.9-latest
# MI350X / MI355X
rlsys/miles:rocm7-MI350-355-sglang0.5.9-latest
The framework includes example workflows for launching a full RL pipeline with Ray cluster initialization, rollout generation, GRPO training loops, and on-policy update cycles. Models and datasets are available via Hugging Face.
Validated Performance Results
Testing on a single 8-GPU AMD Instinct MI300X node with Qwen3-30B-A3B using GRPO training (32×8 sampling, 8k response cap, global batch 256) showed:
- Mean step time: 388.50 seconds
- Rollout throughput: 1.1k–1.3k tokens/GPU/second
- Train throughput: ~15–16k tokens/second
- Model improvement: AIME accuracy increased from 0.665 (step 19) to 0.729 (step 139) with pass@16 reaching 0.890
The framework demonstrates practical viability for multi-turn agent training and agentic task workflows on AMD hardware.