← Back
NVIDIA
NVIDIA Megatron Core gains Falcon-H1 hybrid architecture and BitNet ternary quantization support
· sdkfeatureintegrationopen-source · developer.nvidia.com ↗

Falcon-H1 Parallel Hybrid Architecture Integration

NVIDIA Megatron Core has been extended with support for the Falcon-H1 parallel hybrid architecture, a novel design that processes Transformer-based attention and Mamba-2 state-space model (SSM) layers simultaneously within each block, rather than sequentially stacking them. This parallel design allows the model to combine the long-context memory efficiency of SSMs with the long-range dependency modeling capabilities of attention mechanisms.

The integration spans two repositories with complementary responsibilities. In Megatron Core, the Technology Innovation Institute (TII) contributed:

  • ParallelHybridLayer: A foundational layer that runs Mamba and attention in parallel
  • Updated layer allocation logic supporting PARALLEL symbol alongside existing layer types
  • Checkpoint conversion tools for loading and saving hybrid models

In Megatron Bridge, TII built the complete Falcon-H1 model implementation, including:

  • FalconH1Layer: Extends the parallel design with an integrated MLP component
  • FalconH1Bridge: Provides bidirectional Hugging Face ↔ Megatron weight conversion
  • FalconH1ModelProvider: Pre-configured variants for 0.5B, 1.5B-Deep, 7B, and 34B models

Flexible Architecture Configuration

Developers can now independently configure the ratio of parallel hybrid layers, pure Mamba layers, attention-only layers, and MLP-only layers within their models, enabling flexible architecture exploration and experimentation. The implementation includes non-learnable maximal update parametrization (µP) multipliers for stable and efficient training across heterogeneous components.

BitNet Ternary Quantization Support

Megatron Core now integrates BitNet, enabling ternary (1.58-bit) quantized weight training for edge models. This implementation:

  • Replaces standard linear layers with BitNetColumnParallelLinear and BitNetRowParallelLinear layers
  • Uses optimized Triton kernels for efficient computation
  • Maintains full tensor and pipeline parallelism support
  • Reduces memory and bandwidth usage while preserving model throughput

These contributions demonstrate how Megatron Core's extensible architecture enables community-driven enhancements for cutting-edge model training approaches.