H Company releases Holotron-12B, a 12B multimodal agent model with hybrid SSM architecture for high-throughput inference

Overview

H Company has released Holotron-12B, a multimodal computer-use agent model now available on Hugging Face. Unlike traditional vision-language models optimized for static image understanding, Holotron-12B is purpose-built as a policy model for interactive agents that must perceive, decide, and act efficiently in real-time environments.

Architecture & Performance

The key innovation behind Holotron-12B is its hybrid State-Space Model (SSM) and attention architecture, which dramatically improves inference efficiency:

Reduced memory footprint: SSMs store only constant state per layer, versus transformers' KV Cache which scales with token count and layers
Eliminated quadratic complexity: Avoids the O(n²) computational cost of full attention mechanisms, especially important for agentic workloads with multiple images and lengthy interaction histories
Long-context handling: Enables efficient processing of extended interaction histories and multiple images without memory explosion

Foundation & Training

Built by post-training NVIDIA's open-source Nemotron-Nano-2 VL model on H Company's proprietary data mixture, the model represents a collaboration between H Company's research labs and NVIDIA (H Company participates in the NVIDIA Inception Program).

Action Items for Developers

Access the model on Hugging Face to evaluate its performance on agent benchmarks
Consider Holotron-12B for computer-use agent deployments requiring high throughput and long-context support
Benchmark against existing multimodal models in your specific interactive/agentic use cases

The model is positioned as a production-grade alternative to larger, less efficient vision-language models for agent workloads.

Overview

Architecture & Performance

Foundation & Training

Action Items for Developers

Tags

Published

Source