NVIDIA releases GR00T-H surgical AI model and 778-hour healthcare robotics dataset

Healthcare Robotics Dataset Now Open

NVIDIA, in collaboration with a 35-organization steering committee, has released Open-H-Embodiment, the first large-scale open dataset designed specifically for training physical AI systems in healthcare robotics. The dataset comprises 778 hours of CC-BY-4.0 licensed training data, spanning surgical robotics, ultrasound autonomy, and colonoscopy procedures. Data includes both simulations and real clinical procedures, captured using commercial surgical robots (CMR Surgical, Rob Surgical, Tuodao) and research platforms (dVRK, Franka, Kuka).

The consortium includes leading academic and healthcare institutions: Johns Hopkins University, Stanford, UC Berkeley, Technical University of Munich, and 28 additional organizations worldwide. This collaborative effort addresses a critical gap in healthcare AI—while perception-based models dominate the field, real surgical autonomy requires embodied learning with contact dynamics, force feedback, and closed-loop control.

GR00T-H: Vision-Language-Action Model for Surgery

The first released model, GR00T-H, is a specialized Vision-Language-Action (VLA) policy model trained on approximately 600 hours of Open-H-Embodiment data. Built on NVIDIA's Isaac GR00T architecture with Cosmos Reason 2 2B as its vision-language backbone, GR00T-H introduces novel design choices to handle surgical robotics' unique challenges:

Embodiment Projectors: Learnable MLPs map each robot's specific kinematics to a normalized action space, enabling cross-platform training
State Dropout: 100% dropout of proprioceptive input during inference improves real-world performance
Relative End-Effector Actions: Addresses kinematic inconsistencies across different surgical robots
Metadata Injection: Instrument names and control mappings embedded in task prompts

A prototype has completed end-to-end suturing tasks on the SutureBot benchmark, demonstrating long-horizon dexterity in complex surgical scenarios.

Cosmos-H: Surgical Simulation Without Real-World Constraints

The second release, Cosmos-H-Surgical-Simulator, is a World Foundation Model that generates physically plausible surgical video from kinematic actions alone. This addresses a major bottleneck: traditional physics simulators fail under real surgical conditions (soft tissue deformation, blood, smoke, reflections).

Fine-tuned from NVIDIA's Cosmos Predict 2.5 2B model, Cosmos-H dramatically accelerates development cycles—generating 600 simulation rollouts takes 40 minutes versus 2 days using physical benchtop setups. The model implicitly learns tissue mechanics and tool interactions from training data, enabling synthetic data generation for underrepresented surgical scenarios.

Immediate Impact and Next Steps

Both models and the full dataset are available as open-source releases, enabling researchers and developers to build surgical automation systems without proprietary constraints. Developers can access the data and models through NVIDIA's official channels and build custom surgical AI applications using standardized benchmarks and cross-embodiment evaluation frameworks.

Healthcare Robotics Dataset Now Open

GR00T-H: Vision-Language-Action Model for Surgery

Cosmos-H: Surgical Simulation Without Real-World Constraints

Immediate Impact and Next Steps

Tags

Published

Source