AI2 releases MolmoPoint, vision-language models with native pointing reaching 70.7% accuracy on spatial grounding

A Better Way to Point

Most vision-language models locate objects by generating coordinates as text or using token bins—an unnatural approach that requires models to memorize external coordinate systems. MolmoPoint replaces this with a more intuitive mechanism: instead of spelling out coordinates, the model points by directly selecting visual tokens from its internal representations.

The architecture uses three special tokens to achieve coarse-to-fine grounding: selects a rough image region, refines to finer detail using ViT features, and pinpoints the exact spot within that subpatch. This design naturally ties pointing to the model's visual embeddings rather than abstract coordinates, making the task easier to learn and more robust across different image resolutions.

Three Specialized Models and New Datasets

AI2 is releasing three open-source models optimized for different tasks:

MolmoPoint-8B: General-purpose image and video grounding
MolmoPoint-GUI-8B: Specialized for UI elements in apps and websites
MolmoPoint-Vid-4B: Optimized for video tracking and pointing

The release includes MolmoPoint-GUISyn, a new synthetic dataset of 36,000 high-resolution screenshots with over 2 million annotated UI points, and MolmoPoint-TrackData, an augmented tracking dataset with human annotations and synthetic video sequences featuring complex occlusions and motion.

Performance Gains Across Benchmarks

MolmoPoint-8B achieves state-of-the-art results on standard grounding benchmarks:

PointBench: 70.7% average accuracy (up from 68.7% for Molmo 2 8B), with especially strong gains in spatial reasoning (~5-point improvement)
PixMo-Points: 89.2 F1 score (vs. 85.2 for Molmo 2)
GUI grounding: MolmoPoint-GUI-8B reaches 61.1 on ScreenSpot-Pro and 70.0 on OSWorldG, state-of-the-art among fully open models
Video tracking: MolmoPoint-8B shows +5.7 J&F improvement on Molmo2-Track with state-of-the-art MeViS performance

The grounding token approach also improves training efficiency—with just 8,192 examples, it outperforms coordinate-based baselines by ~20 F1 points and reaches peak performance faster during full pretraining.

Why This Matters

Accurate pointing is foundational for practical AI applications: robotics grasping, autonomous UI navigation, object tracking across video frames, and visual reasoning. By using native grounding tokens instead of coordinate text generation, MolmoPoint is simpler for models to learn, uses fewer tokens per point (3 vs 8), and produces cleaner behavior with better localization of small targets. All code, models, and datasets are open source and available on Hugging Face.

A Better Way to Point

Three Specialized Models and New Datasets

Performance Gains Across Benchmarks

Why This Matters

Tags

Published

Source