AMD and Miles Team Enable ROCm Support for Large-Scale RL Post-Training on Instinct GPUs

Miles RL Framework Now Native on AMD GPUs

AMD and the Miles team have announced full ROCm support for the Miles open-source reinforcement learning framework on AMD Instinct MI300 and MI350/355-class accelerators. Miles is a production-grade RL framework designed for large-scale post-training of language and multimodal models, building on SGLang and the broader RL ecosystem.

Why RL Workloads Fit AMD Hardware

Reinforcement learning post-training differs fundamentally from pretraining in that rollout generation dominates compute, consuming 70–90% of GPU time across thousands of parallel environments. This makes memory capacity and bandwidth critical performance factors. AMD Instinct MI GPUs are well-suited for these workloads due to their large HBM memory capacity, high memory bandwidth, efficient long-context inference, and strong multi-node scaling capabilities.

Architecture and Core Features

Miles uses a decoupled two-plane architecture that separates:

Rollout plane: Generates training data using SGLang
Training plane: Updates model weights using Megatron-LM
Scheduler: Coordinates interaction between planes for scalable post-training

The framework supports:

Distributed rollout generation and on-policy RL training loops
GRPO and PPO policy optimization
Ray-based orchestration
Integration with Megatron-LM and SGLang

Getting Started

Miles provides ROCm-ready containers with SGLang and Megatron-LM preinstalled. Users can pull GPU-specific images:

# MI300X
rlsys/miles:rocm7-MI300-sglang0.5.9-latest

# MI350X / MI355X
rlsys/miles:rocm7-MI350-355-sglang0.5.9-latest

The framework includes example workflows for launching a full RL pipeline with Ray cluster initialization, rollout generation, GRPO training loops, and on-policy update cycles. Models and datasets are available via Hugging Face.

Validated Performance Results

Testing on a single 8-GPU AMD Instinct MI300X node with Qwen3-30B-A3B using GRPO training (32×8 sampling, 8k response cap, global batch 256) showed:

Mean step time: 388.50 seconds
Rollout throughput: 1.1k–1.3k tokens/GPU/second
Train throughput: ~15–16k tokens/second
Model improvement: AIME accuracy increased from 0.665 (step 19) to 0.729 (step 139) with pass@16 reaching 0.890

The framework demonstrates practical viability for multi-turn agent training and agentic task workflows on AMD hardware.

Miles RL Framework Now Native on AMD GPUs

Why RL Workloads Fit AMD Hardware

Architecture and Core Features

Getting Started

Validated Performance Results

Tags

Published

Source