AI2 releases MolmoPoint, a new pointing architecture for vision-language models achieving state-of-the-art grounding accuracy

A Better Way to Point

Traditional vision-language models handle spatial grounding by generating coordinates as text or token bins—an approach that requires models to learn an artificial coordinate system, consumes many output tokens, and degrades at high resolutions. AI2's MolmoPoint takes a fundamentally different approach by letting models point directly to regions within their internal visual feature space.

The architecture uses a coarse-to-fine mechanism built around three special tokens: , , and . The model first selects a coarse image patch by attending over visual tokens, refines that choice to a finer-grained subpatch using lower-level ViT features, and finally predicts a precise location within that subpatch. This design is tied directly to the model's internal visual representation rather than an external coordinate format.

Key Improvements and Technical Advances

The system introduces several important refinements:

Rotary embeddings encode patch distances, helping models generate points in consistent order and avoid redundant selections
No-more-points class allows explicit stopping rather than forced continued selection
Token efficiency: reduced from 8 tokens per point to just 3
Faster learning: MolmoPoint outperforms coordinate baselines by ~20 F1 points with only 8,192 training examples

Three Specialized Models and New Datasets

AI2 released three open-source models optimized for different tasks:

MolmoPoint-8B: General-purpose image and video grounding
MolmoPoint-GUI-8B: Specialized for software interfaces, apps, and websites
MolmoPoint-Vid-4B: Optimized for video pointing and tracking

Supporting these models are two major datasets:

MolmoPoint-GUISyn: 36,000 high-resolution synthetic screenshots spanning desktop, web, and mobile environments with 2+ million annotated points
MolmoPoint-TrackData: Enhanced video tracking dataset with human-annotated tracks and synthetically generated sequences featuring complex occlusion and motion dynamics

Benchmark Results

MolmoPoint demonstrates substantial improvements across diverse evaluation domains:

Natural images (PointBench): 70.7% accuracy vs. 68.7% for Molmo 2 (8B), with particularly strong gains in reasoning tasks (+5 points)
Image grounding (PixMo-Points): 89.2 F1 vs. 85.2 for Molmo 2 (8B)
GUI grounding (ScreenSpot-Pro/OSWorldG): 61.1 and 70.0 respectively—state-of-the-art among fully open models
Video: Improved counting metrics and 59.1% human preference win rate; MolmoPoint-Vid-4B achieves 58.7% close-accuracy on video counting
Tracking (MeViS/Molmo2-Track): State-of-the-art results with +5.7 J&F improvement

Why This Matters

Accurate spatial grounding underpins critical capabilities including software interface navigation, robotic manipulation, object tracking in video, and visual reasoning. By replacing coordinate text generation with direct visual token selection, MolmoPoint offers a more natural, efficient, and learnable pointing mechanism. The approach suggests the field may have been using an unnecessarily complex abstraction for spatial grounding, and similar principles could extend to grounding in other modalities.

A Better Way to Point

Key Improvements and Technical Advances

Three Specialized Models and New Datasets

Benchmark Results

Why This Matters

Tags

Published

Source