Two New Compact Models for Speed-Critical Applications
OpenAI has released GPT-5.4 mini and GPT-5.4 nano, two new smaller models optimized for workloads where latency and cost are paramount. These models bring much of GPT-5.4's capability to faster, more efficient packages suitable for coding assistants, subagent systems, and multimodal applications that require real-time responsiveness.
Performance and Benchmarks
GPT-5.4 mini significantly improves over GPT-5 mini across multiple dimensions:
- Coding: Achieves 54.4% accuracy on SWE-Bench Pro (vs. 45.7% for GPT-5 mini), approaching GPT-5.4's 57.7%
- Tool use: Scores 42.9% on Toolathlon and 57.7% on MCP Atlas
- Speed: Runs more than 2x faster than GPT-5 mini at similar latency
- Computer use: Reaches 72.1% on OSWorld-Verified, nearly matching GPT-5.4's 75.0%
GPT-5.4 nano serves as the smallest and cheapest option:
- Recommended for classification, data extraction, ranking, and simpler coding subagents
- Significant upgrade over GPT-5 nano across all evaluated tasks
Availability and Pricing
GPT-5.4 mini is available today across three platforms:
- API: Supports text/image inputs, tool use, function calling, web search, file search, computer use, and skills. 400k context window. $0.75 per 1M input tokens and $4.50 per 1M output tokens
- Codex: Uses only 30% of GPT-5.4 quota, enabling cheaper coding task handling. Available across Codex app, CLI, IDE extensions, and web
- ChatGPT: Available to Free and Go users via the "Thinking" feature; other users receive it as a fallback for GPT-5.4 Thinking
GPT-5.4 nano is API-only: $0.20 per 1M input tokens and $1.25 per 1M output tokens
Use Cases and Architecture Patterns
These models excel in latency-sensitive scenarios where response time directly impacts user experience. Developers can now compose multi-model systems where larger models handle planning and coordination while delegating focused subtasks to GPT-5.4 mini subagents running in parallel—for example, having GPT-5.4 manage overall strategy while mini models search codebases, review files, or process supporting documents simultaneously. This pattern scales more efficiently as smaller models improve in both speed and capability.