A Better Way to Point
Traditional vision-language models handle spatial grounding by generating coordinates as text or token bins—an approach that requires models to learn an artificial coordinate system, consumes many output tokens, and degrades at high resolutions. AI2's MolmoPoint takes a fundamentally different approach by letting models point directly to regions within their internal visual feature space.
The architecture uses a coarse-to-fine mechanism built around three special tokens:
Key Improvements and Technical Advances
The system introduces several important refinements:
- Rotary embeddings encode patch distances, helping models generate points in consistent order and avoid redundant selections
- No-more-points class allows explicit stopping rather than forced continued selection
- Token efficiency: reduced from 8 tokens per point to just 3
- Faster learning: MolmoPoint outperforms coordinate baselines by ~20 F1 points with only 8,192 training examples
Three Specialized Models and New Datasets
AI2 released three open-source models optimized for different tasks:
- MolmoPoint-8B: General-purpose image and video grounding
- MolmoPoint-GUI-8B: Specialized for software interfaces, apps, and websites
- MolmoPoint-Vid-4B: Optimized for video pointing and tracking
Supporting these models are two major datasets:
- MolmoPoint-GUISyn: 36,000 high-resolution synthetic screenshots spanning desktop, web, and mobile environments with 2+ million annotated points
- MolmoPoint-TrackData: Enhanced video tracking dataset with human-annotated tracks and synthetically generated sequences featuring complex occlusion and motion dynamics
Benchmark Results
MolmoPoint demonstrates substantial improvements across diverse evaluation domains:
- Natural images (PointBench): 70.7% accuracy vs. 68.7% for Molmo 2 (8B), with particularly strong gains in reasoning tasks (+5 points)
- Image grounding (PixMo-Points): 89.2 F1 vs. 85.2 for Molmo 2 (8B)
- GUI grounding (ScreenSpot-Pro/OSWorldG): 61.1 and 70.0 respectively—state-of-the-art among fully open models
- Video: Improved counting metrics and 59.1% human preference win rate; MolmoPoint-Vid-4B achieves 58.7% close-accuracy on video counting
- Tracking (MeViS/Molmo2-Track): State-of-the-art results with +5.7 J&F improvement
Why This Matters
Accurate spatial grounding underpins critical capabilities including software interface navigation, robotic manipulation, object tracking in video, and visual reasoning. By replacing coordinate text generation with direct visual token selection, MolmoPoint offers a more natural, efficient, and learnable pointing mechanism. The approach suggests the field may have been using an unnecessarily complex abstraction for spatial grounding, and similar principles could extend to grounding in other modalities.