Prompt Injection Attacks Are Evolving
AI agents are increasingly capable of browsing the web, retrieving information, and taking actions on users' behalf. However, these capabilities create vulnerabilities to prompt injection attacks—instructions embedded in external content designed to manipulate the model into performing unintended actions. Early attacks were simple, often consisting of direct instructions placed in Wikipedia articles or other web content. As AI models have become more sophisticated, so have the attacks, with modern prompt injection increasingly resembling social engineering rather than simple instruction injection.
The Social Engineering Parallel
OpenAI frames this problem through the lens of human social engineering. Rather than treating prompt injection as a purely technical problem of filtering malicious inputs, the company recommends designing systems to constrain the impact of manipulation even when attacks succeed. This approach mirrors real-world customer service systems where agents have limited authority and are subject to oversight mechanisms to prevent abuse. The key insight: detecting every malicious input is as difficult as detecting lies or misinformation, but limiting damage through system design is achievable.
Defense Mechanisms in ChatGPT
OpenAI's defense strategy combines traditional security engineering with social engineering principles:
- Safety Training: Models are trained to refuse requests that involve exfiltrating sensitive information
- Source-Sink Analysis: Attacks are viewed as requiring both a way to influence the system (source) and a dangerous capability (sink)—typically combining untrusted external content with actions like transmitting information, following links, or using tools
- Safe Url Mechanism: When the assistant might transmit information learned from conversations to third parties, the system either requests user confirmation before proceeding or blocks the action and suggests alternatives
Real-World Impact
In testing, a sophisticated 2025 prompt injection attack against ChatGPT succeeded approximately 50% of the time when targeting deep research features. However, OpenAI notes that fully developed attacks are rarely caught by "AI firewall" intermediaries, which attempt to classify inputs as malicious. The company's approach prioritizes designing constraints into agent capabilities rather than relying on input filtering alone.