AI2 releases MolmoPoint, new vision-language models with direct visual grounding achieving state-of-the-art pointing accuracy

New Grounding Architecture for Vision-Language Models

AI2 has introduced MolmoPoint, a family of open-source vision-language models that fundamentally rethinks how models perform visual grounding and pointing tasks. Rather than generating coordinates as text or emitting tokens corresponding to coordinate bins, MolmoPoint allows models to point by directly selecting parts of their visual input features using a more intuitive mechanism.

Three Specialized Models and New Datasets

The release includes three purpose-built models:

MolmoPoint-8B: General-purpose model for image and video tasks
MolmoPoint-GUI-8B: Specialized for software interfaces, apps, and websites
MolmoPoint-Vid-4B: Optimized for video understanding

AI2 also released MolmoPoint-GUISyn, a new synthetic dataset of 36,000 high-resolution screenshots with over 2 million annotated points across desktop, web, and mobile environments. A second dataset, MolmoPoint-TrackData, augments previous video data with human-annotated tracks and synthetically generated sequences featuring complex occlusion and motion dynamics.

How It Works

MolmoPoint introduces a coarse-to-fine grounding mechanism using three special tokens: <PATCH>, <SUBPATCH>, and <LOCATION>. The model first selects a coarse image patch, refines it to a finer-grained subpatch, and finally predicts a precise location within that subpatch. This design directly ties pointing to the model's internal visual representations rather than to external coordinate formats.

Key technical innovations include:

Rotary embeddings to encode distances between patches, helping models generate points in consistent order and avoid duplicate selections
No-more-points class allowing the model to explicitly stop when all relevant elements are identified
Token efficiency: Reduces pointing expressions from 8 tokens to just 3

Impressive Benchmark Performance

MolmoPoint demonstrates significant improvements across multiple evaluation frameworks:

Natural Image Grounding:

PointBench: 70.7% accuracy (up from 68.7% for Molmo 2 8B)
PixMo-Points: 89.2 F1 (up from 85.2 for Molmo 2 8B)
Notable 5-point improvements in spatial reasoning tasks

GUI and Interface Grounding:

ScreenSpot-Pro: 61.1 (state-of-the-art among open models)
OSWorldG: 70.0 (state-of-the-art among open models)
2-9 point gains over identically-trained coordinate-based baseline

Video and Tracking:

Wins human preference comparisons 59.1% of the time
State-of-the-art results on MeViS benchmark
+5.7 J&F improvement on Molmo2-Track

Training Efficiency and Practical Impact

Beyond benchmark numbers, grounding tokens prove significantly easier for models to learn. With just 8,192 training examples, MolmoPoint outperformed coordinate-based baselines by approximately 20 F1 points. During full pretraining, the model reaches peak pointing performance faster, suggesting that the token-based approach better matches how models naturally learn visual representations.

All models, code, and datasets are available open-source through HuggingFace, with interactive demos available for both general and GUI-specialized variants.

New Grounding Architecture for Vision-Language Models

Three Specialized Models and New Datasets

How It Works

Impressive Benchmark Performance

Training Efficiency and Practical Impact

Tags

Published

Source