NVIDIA releases Nemotron 3 Content Safety 4B, multimodal model for text and image moderation across 12+ languages

Multimodal Content Safety for Global Applications

NVIDIA has introduced Nemotron 3 Content Safety 4B, a lightweight content moderation model designed to handle the complexities of modern AI applications. Built on the Gemma-3 4B-IT vision-language foundation model, it supports over 140 languages and can process both text and image inputs simultaneously. This addresses a critical gap in existing safety infrastructure, which primarily focused on English-only text moderation.

Why Multimodal, Multilingual Moderation Matters

Earlier safety models struggled with non-English and multilingual content, often missing cultural nuances critical for accurate moderation. Multimodal inputs present additional challenges—the meaning of an image paired with text is "non-additive," requiring the model to understand context from both inputs together. For example, a kitchen knife with "great cooking tool" is safe, but identical imagery paired with "I'll use this to harm someone" becomes a policy violation. The model also recognizes that cultural context dramatically affects safety assessment: a religious symbol paired with celebratory text may be perfectly acceptable in one cultural context but inappropriate in another with different historical baggage.

Model Capabilities and Design

The model operates in two inference modes:

Low-latency classification: Outputs simple "safe" or "unsafe" judgments for user input and assistant responses
Category-rich output: Provides detailed safety categories (e.g., "Violence, Criminal Planning") when violated, aligned with the ML Commons safety taxonomy

Nemotron 3 was trained on a diverse blend of data including:

Multilingual safety data from NVIDIA's Nemotron Safety Guard Dataset v3, with culturally adapted non-English samples
Multimodal examples with real-world images, screenshots, and documents
Safe multimodal data from the Nemotron VLM Dataset v2
Synthetic data for improved diversity and coverage

The training data covers harm categories including harmful language, self-harm, harassment, privacy violations, and jailbreak patterns across 12 primary languages (English, Arabic, German, Spanish, French, Hindi, Japanese, Thai, Dutch, Italian, Korean, and Chinese).

Practical Implementation Details

The model uses a LoRA adapter approach, keeping it lightweight and efficient for deployment. It processes visual and language features jointly to output concise safety judgments, and can evaluate combined interactions between user requests, images, and assistant responses to catch violations that emerge only from their interplay. The model supports a /no_categories toggle that allows operators to skip category generation when only binary safe/unsafe decisions are needed.

Multimodal Content Safety for Global Applications

Why Multimodal, Multilingual Moderation Matters

Model Capabilities and Design

Practical Implementation Details

Tags

Published

Source