Overview
H Company has released Holotron-12B, a 12-billion parameter multimodal model designed specifically for computer-use agent applications. The model is now available on Hugging Face. Unlike most multimodal models optimized for static vision or instruction-following, Holotron-12B targets the unique demands of agentic workloads that require perception, decision-making, and action in interactive environments.
Architecture & Performance
The key innovation is Holotron-12B's hybrid State-Space Model (SSM) and attention architecture, which delivers significant efficiency gains over purely transformer-based approaches:
- Superior long-context handling: SSMs avoid the quadratic computation costs of full attention mechanisms, making them ideal for agentic workflows involving multiple images and lengthy interaction histories
- Dramatically reduced memory footprint: Unlike vanilla attention which stores K and V cache activations per token and layer, SSMs maintain only constant state per layer per sequence, independent of context length
- High-throughput inference: The architecture is optimized for production-scale serving with efficient token generation
Foundation & Development
The model was post-trained from NVIDIA's open-source Nemotron-Nano-2 VL foundation using H Company's proprietary data mixture, demonstrating how additional training can significantly expand the capabilities of the base model. H Company is part of the NVIDIA Inception Program.
Getting Started
Developers can download and evaluate Holotron-12B from the Hugging Face model repository. The model is ready for integration into computer-use agent systems and supports long contexts with multiple concurrent images.