H Company releases Holotron-12B, a 12B multimodal model optimized for computer-use agents with hybrid SSM architecture

Overview

H Company has released Holotron-12B, a 12-billion parameter multimodal model designed specifically for computer-use agent applications. The model is now available on Hugging Face. Unlike most multimodal models optimized for static vision or instruction-following, Holotron-12B targets the unique demands of agentic workloads that require perception, decision-making, and action in interactive environments.

Architecture & Performance

The key innovation is Holotron-12B's hybrid State-Space Model (SSM) and attention architecture, which delivers significant efficiency gains over purely transformer-based approaches:

Superior long-context handling: SSMs avoid the quadratic computation costs of full attention mechanisms, making them ideal for agentic workflows involving multiple images and lengthy interaction histories
Dramatically reduced memory footprint: Unlike vanilla attention which stores K and V cache activations per token and layer, SSMs maintain only constant state per layer per sequence, independent of context length
High-throughput inference: The architecture is optimized for production-scale serving with efficient token generation

Foundation & Development

The model was post-trained from NVIDIA's open-source Nemotron-Nano-2 VL foundation using H Company's proprietary data mixture, demonstrating how additional training can significantly expand the capabilities of the base model. H Company is part of the NVIDIA Inception Program.

Getting Started

Developers can download and evaluate Holotron-12B from the Hugging Face model repository. The model is ready for integration into computer-use agent systems and supports long contexts with multiple concurrent images.

Overview

Architecture & Performance

Foundation & Development

Getting Started

Tags

Published

Source