NVIDIA Megatron Core gains Falcon-H1 hybrid architecture and BitNet ternary quantization support

Falcon-H1 Parallel Hybrid Architecture Integration

NVIDIA Megatron Core has been extended with support for the Falcon-H1 parallel hybrid architecture, a novel design that processes Transformer-based attention and Mamba-2 state-space model (SSM) layers simultaneously within each block, rather than sequentially stacking them. This parallel design allows the model to combine the long-context memory efficiency of SSMs with the long-range dependency modeling capabilities of attention mechanisms.

The integration spans two repositories with complementary responsibilities. In Megatron Core, the Technology Innovation Institute (TII) contributed:

ParallelHybridLayer: A foundational layer that runs Mamba and attention in parallel
Updated layer allocation logic supporting PARALLEL symbol alongside existing layer types
Checkpoint conversion tools for loading and saving hybrid models

In Megatron Bridge, TII built the complete Falcon-H1 model implementation, including:

FalconH1Layer: Extends the parallel design with an integrated MLP component
FalconH1Bridge: Provides bidirectional Hugging Face ↔ Megatron weight conversion
FalconH1ModelProvider: Pre-configured variants for 0.5B, 1.5B-Deep, 7B, and 34B models

Flexible Architecture Configuration

Developers can now independently configure the ratio of parallel hybrid layers, pure Mamba layers, attention-only layers, and MLP-only layers within their models, enabling flexible architecture exploration and experimentation. The implementation includes non-learnable maximal update parametrization (µP) multipliers for stable and efficient training across heterogeneous components.

BitNet Ternary Quantization Support

Megatron Core now integrates BitNet, enabling ternary (1.58-bit) quantized weight training for edge models. This implementation:

Replaces standard linear layers with BitNetColumnParallelLinear and BitNetRowParallelLinear layers
Uses optimized Triton kernels for efficient computation
Maintains full tensor and pipeline parallelism support
Reduces memory and bandwidth usage while preserving model throughput

These contributions demonstrate how Megatron Core's extensible architecture enables community-driven enhancements for cutting-edge model training approaches.

Falcon-H1 Parallel Hybrid Architecture Integration

Flexible Architecture Configuration

BitNet Ternary Quantization Support

Tags

Published

Source