← Back
NVIDIA releases Open-H-Embodiment, 778-hour surgical robotics dataset with GR00T-H policy model
· releasefeaturemodeldatasetopen-sourceapi · huggingface.co ↗

Open-H-Embodiment: Foundation for Physical AI in Healthcare

NVIDIA has released Open-H-Embodiment, an open-source dataset initiative created with 35 organizations including Johns Hopkins, Stanford, UC Berkeley, and major surgical robotics companies. This represents the first large-scale, standardized dataset designed specifically for training embodied AI systems in healthcare robotics—moving beyond perception-only models to systems that can perform physical tasks with precision.

The Dataset

The dataset comprises 778 hours of CC-BY-4.0 licensed training data, including:

  • Real surgical procedures from commercial robots (CMR Surgical, Rob Surgical, Tuodao)
  • Research platforms (dVRK, Franka, Kuka)
  • Simulation environments, benchtop exercises (e.g., suturing), and clinical procedures
  • Vision, force, and kinematics data synchronized across multiple robot embodiments
  • Ultrasound and colonoscopy autonomy data alongside surgical robotics

GR00T-H: Surgical Policy Model

GR00T-H is a Vision-Language-Action (VLA) model trained on ~600 hours of the dataset, designed to overcome the unique challenges of surgical robotics:

  • Embodiment Projectors: Learnable mappings that translate each robot's specific kinematics to a shared, normalized action space
  • State Dropout: Removes proprioceptive input during inference to improve real-world performance
  • Relative End-Effector Actions: Uses common action space to handle kinematic inconsistencies across different robots
  • Task Metadata Injection: Embeds instrument names and control mappings directly into prompts

Early prototypes have successfully executed end-to-end suturing in the SutureBot benchmark, demonstrating long-horizon dexterity and precision control.

Cosmos-H-Surgical-Simulator: World Foundation Model

Cosmos-H-Surgical-Simulator is a generative world model that predicts realistic surgical video from robotic actions. Key advantages:

  • Sim-to-Real Bridging: Learned implicitly from real data, handling complexities like soft tissue deformation, blood, smoke, and light reflections
  • Efficiency: Completes 600 simulation rollouts in 40 minutes vs. 2 days on real benchtop systems
  • Synthetic Data Generation: Creates realistic video-action pairs to augment sparse datasets

The model was fine-tuned on all 32 Open-H-Embodiment datasets across 9 robot embodiments using 64 A100 GPUs over ~10,000 GPU-hours.

Getting Started

Developers can access the dataset and models immediately:

  • Dataset: Hugging Face / GitHub
  • GR00T-H Model: Available on Hugging Face
  • Cosmos-H: Available for fine-tuning and deployment

The initiative invites community contributions toward version 2, which will focus on reasoning-capable autonomy for surgical robotics.