A Better Way to Point
Most vision-language models locate objects by generating coordinates as text or using token bins—an unnatural approach that requires models to memorize external coordinate systems. MolmoPoint replaces this with a more intuitive mechanism: instead of spelling out coordinates, the model points by directly selecting visual tokens from its internal representations.
The architecture uses three special tokens to achieve coarse-to-fine grounding:
Three Specialized Models and New Datasets
AI2 is releasing three open-source models optimized for different tasks:
- MolmoPoint-8B: General-purpose image and video grounding
- MolmoPoint-GUI-8B: Specialized for UI elements in apps and websites
- MolmoPoint-Vid-4B: Optimized for video tracking and pointing
The release includes MolmoPoint-GUISyn, a new synthetic dataset of 36,000 high-resolution screenshots with over 2 million annotated UI points, and MolmoPoint-TrackData, an augmented tracking dataset with human annotations and synthetic video sequences featuring complex occlusions and motion.
Performance Gains Across Benchmarks
MolmoPoint-8B achieves state-of-the-art results on standard grounding benchmarks:
- PointBench: 70.7% average accuracy (up from 68.7% for Molmo 2 8B), with especially strong gains in spatial reasoning (~5-point improvement)
- PixMo-Points: 89.2 F1 score (vs. 85.2 for Molmo 2)
- GUI grounding: MolmoPoint-GUI-8B reaches 61.1 on ScreenSpot-Pro and 70.0 on OSWorldG, state-of-the-art among fully open models
- Video tracking: MolmoPoint-8B shows +5.7 J&F improvement on Molmo2-Track with state-of-the-art MeViS performance
The grounding token approach also improves training efficiency—with just 8,192 examples, it outperforms coordinate-based baselines by ~20 F1 points and reaches peak performance faster during full pretraining.
Why This Matters
Accurate pointing is foundational for practical AI applications: robotics grasping, autonomous UI navigation, object tracking across video frames, and visual reasoning. By using native grounding tokens instead of coordinate text generation, MolmoPoint is simpler for models to learn, uses fewer tokens per point (3 vs 8), and produces cleaner behavior with better localization of small targets. All code, models, and datasets are open source and available on Hugging Face.