New Grounding Architecture for Vision-Language Models
AI2 has introduced MolmoPoint, a family of open-source vision-language models that fundamentally rethinks how models perform visual grounding and pointing tasks. Rather than generating coordinates as text or emitting tokens corresponding to coordinate bins, MolmoPoint allows models to point by directly selecting parts of their visual input features using a more intuitive mechanism.
Three Specialized Models and New Datasets
The release includes three purpose-built models:
- MolmoPoint-8B: General-purpose model for image and video tasks
- MolmoPoint-GUI-8B: Specialized for software interfaces, apps, and websites
- MolmoPoint-Vid-4B: Optimized for video understanding
AI2 also released MolmoPoint-GUISyn, a new synthetic dataset of 36,000 high-resolution screenshots with over 2 million annotated points across desktop, web, and mobile environments. A second dataset, MolmoPoint-TrackData, augments previous video data with human-annotated tracks and synthetically generated sequences featuring complex occlusion and motion dynamics.
How It Works
MolmoPoint introduces a coarse-to-fine grounding mechanism using three special tokens: <PATCH>, <SUBPATCH>, and <LOCATION>. The model first selects a coarse image patch, refines it to a finer-grained subpatch, and finally predicts a precise location within that subpatch. This design directly ties pointing to the model's internal visual representations rather than to external coordinate formats.
Key technical innovations include:
- Rotary embeddings to encode distances between patches, helping models generate points in consistent order and avoid duplicate selections
- No-more-points class allowing the model to explicitly stop when all relevant elements are identified
- Token efficiency: Reduces pointing expressions from 8 tokens to just 3
Impressive Benchmark Performance
MolmoPoint demonstrates significant improvements across multiple evaluation frameworks:
Natural Image Grounding:
- PointBench: 70.7% accuracy (up from 68.7% for Molmo 2 8B)
- PixMo-Points: 89.2 F1 (up from 85.2 for Molmo 2 8B)
- Notable 5-point improvements in spatial reasoning tasks
GUI and Interface Grounding:
- ScreenSpot-Pro: 61.1 (state-of-the-art among open models)
- OSWorldG: 70.0 (state-of-the-art among open models)
- 2-9 point gains over identically-trained coordinate-based baseline
Video and Tracking:
- Wins human preference comparisons 59.1% of the time
- State-of-the-art results on MeViS benchmark
- +5.7 J&F improvement on Molmo2-Track
Training Efficiency and Practical Impact
Beyond benchmark numbers, grounding tokens prove significantly easier for models to learn. With just 8,192 training examples, MolmoPoint outperformed coordinate-based baselines by approximately 20 F1 points. During full pretraining, the model reaches peak pointing performance faster, suggesting that the token-based approach better matches how models naturally learn visual representations.
All models, code, and datasets are available open-source through HuggingFace, with interactive demos available for both general and GUI-specialized variants.