NVIDIA Megatron Core adds Falcon-H1 hybrid architecture support for parallel Transformer-Mamba processing

Falcon-H1 Hybrid Architecture Integration

NVIDIA Megatron Core now supports the Falcon-H1 parallel hybrid architecture developed by Technology Innovation Institute (TII). This architecture represents a significant departure from sequential layer stacking approaches, instead enabling Transformer attention and Mamba-2 state-space model (SSM) components to process input simultaneously within each core block.

Key Technical Innovations

Parallel Processing Design: Unlike sequential hybrid models, Falcon-H1 operates attention and SSM branches in parallel, concatenating their outputs before final projection. This design combines SSMs' superior long-context memory and efficiency with attention's long-range dependency modeling capabilities.

Flexible Architecture Configuration: The ratio of parallel hybrid layers, pure Mamba layers, attention-only layers, and MLP-only layers can be independently configured, enabling flexible exploration and optimization of model architectures.

Two-Repository Integration:

Megatron Core contributions: ParallelHybridLayer for running Mamba and attention in parallel, updated layer allocation logic with PARALLEL symbol support, and checkpoint conversion tools
Megatron Bridge contributions: FalconH1Layer extending the parallel design with MLP components, bidirectional Hugging Face weight conversion, and model providers for 0.5B, 1.5B-Deep, 7B, and 34B variants

BitNet Ternary Quantization Support

The update includes BitNet integration enabling ternary (1.58-bit) quantized weight training for Falcon Edge models. Implemented via specialized BitNetColumnParallelLinear and BitNetRowParallelLinear layers using Triton kernels, this approach maintains tensor and pipeline parallelism while reducing memory and bandwidth usage without sacrificing model throughput.

Developer Action Items

Review the ParallelHybridLayer implementation in Megatron-LM for custom hybrid architecture development
Use the checkpoint conversion tools to load pre-trained Hugging Face Falcon-H1 models into Megatron
Explore configurable layer ratios to optimize model architectures for specific use cases
Consider BitNet quantization for edge deployment scenarios

Falcon-H1 Hybrid Architecture Integration

Key Technical Innovations

BitNet Ternary Quantization Support

Developer Action Items

Tags

Published

Source