Realtime API · Voice intelligence · May 2026 generation

Advancing voice intelligence with GPT Realtime 2

GPT Realtime 2 is part of a new generation of realtime audio models that can reason while people speak, keep context through interruptions, call tools in parallel, and sound natural in production settings. This page explains what changed, why it matters for voice products, and how teams typically roll it out.

Model class
GPT-5-class voice reasoning
Context window
Up to 128K tokens
Default reasoning
Low (tunable)
Companion models
Translate · Whisper

Why voice is shifting from playback to real work

Voice is one of the most natural ways to use software: someone can ask for help while driving, change a travel plan while walking through an airport, or get support in their preferred language without stopping to type. The hard part is not audio quality alone—it is maintaining intent across corrections, using tools while the conversation continues, and recovering gracefully when a request changes mid-sentence.

Alongside GPT Realtime 2, the broader realtime audio lineup includes live translation and streaming transcription so developers can ship multilingual experiences and low-latency captions in the same architectural family. Together, these models move realtime audio from simple call-and-response toward interfaces that can listen, reason, translate, transcribe, and take action as a session unfolds.

From fast turn-taking to dependable agent behavior

Reasoning that keeps the conversation moving

GPT Realtime 2 is built for live interactions where the model must reason through a request while staying responsive—handling corrections, overlapping speech, and mid-session plan changes without losing the thread.

Tool use you can hear

Parallel tool calls and short preambles help users understand what the agent is doing. Phrases like “checking your calendar” or “one moment while I look into it” make latency feel intentional instead of empty.

Production-grade recovery

When something fails, the model is designed to acknowledge it in natural language rather than going silent. That behavior matters for trust in customer support, healthcare-adjacent flows, and any high-stakes voice channel.

Realtime voice: reasoning, tools, and control

GPT Realtime 2 bundles several practical controls that teams asked for when shipping voice agents at scale:

On benchmarks that track production-like voice behavior, OpenAI reports meaningful gains versus the prior generation—for example, higher scores on Big Bench Audio for audio intelligence and on Audio MultiChallenge for instruction following, context management, and natural corrections. Treat these figures as vendor-reported signals and validate on your own transcripts and tools.

“What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions… The combination of agentic competence and guardrail strength is what makes it viable for production voice…”

Josh Weisberg, SVP and Head of AI at Zillow (early testing quote, OpenAI announcement)

Three shapes of voice products that show up in the field

Voice-to-action

A user states a goal; the system reasons, calls tools, and completes work—searching inventory, scheduling, or filing structured updates—while keeping the dialogue coherent.

Systems-to-voice

Backend events become proactive spoken guidance: flight changes, order status, or workflow nudges delivered as concise audio rather than forcing the user back to a screen.

Voice-to-voice

Live conversations continue across languages or roles, with translation and turn management handled in realtime so participants can stay in their preferred language without awkward handoffs.

Translation and transcription in the same realtime stack

GPT Realtime Translate targets live multilingual sessions: broad input language coverage with a smaller set of output languages, tuned so meaning is preserved while keeping pace with natural speech. Use cases include support queues, events, education, and creator platforms with global audiences.

GPT Realtime Whisper is a streaming speech-to-text model aimed at captions, live notes, and agent pipelines that need tokens to arrive continuously as audio plays—useful for meetings, classrooms, broadcasts, and compliance-sensitive workflows that depend on timely text.

A practical rollout path for a voice agent

Most successful deployments treat voice as a full stack problem: audio capture, session state, tool schemas, observability, and safety policies all need to align. The steps below mirror how engineering teams iterate without boiling the ocean.

01

Define the job-to-be-done

Start with a narrow scenario—password reset, appointment change, inventory lookup—and write the success criteria in plain language, including what “done” sounds like on the phone.

02

Model tools and guardrails explicitly

Specify which APIs the agent may call, what parameters are required, and how errors should be spoken back. Pair model-level mitigations with your own classifiers where policy demands it.

03

Tune reasoning and preambles

Match reasoning effort to latency budgets. Use preambles when users benefit from hearing progress; skip them when brevity matters more than narration.

04

Evaluate on real audio

Synthetic text prompts miss disfluencies, accents, and background noise. Record redacted production-like clips, measure task completion, and track silent failures as first-class defects.

Build trust with layered controls

Realtime sessions can include active classifiers that halt conversations when policies are violated. Developers can add guardrails through agent frameworks and make AI involvement obvious to end users when it is not already clear from context. For regulated deployments, review OpenAI’s usage policies, data residency options, and enterprise privacy commitments as part of your design review—not as an afterthought.

How billing is structured (check the console for live rates)

As announced alongside the model family, GPT Realtime 2 is priced on audio tokens in the Realtime API (with lower cached input pricing), GPT Realtime Translate is billed per minute of audio, and GPT Realtime Whisper is billed per minute for streaming transcription. Rates can change; verify current numbers in your OpenAI billing dashboard before forecasting.

Quick answers before you prototype

What is GPT Realtime 2 in one sentence?

It is OpenAI’s realtime voice model line built for live, spoken interactions where the assistant must reason, use tools, and stay coherent through interruptions—aimed at production agents rather than short demos.

How is it different from text-only GPT models?

The session is anchored in streaming audio end-to-end. That changes latency budgets, error handling, and UX patterns: users cannot skim a transcript unless you add separate transcription, and spoken recovery copy must be concise.

When should I raise reasoning effort?

Increase effort when tasks involve multi-step planning, fragile tool chains, or nuanced policy interpretation. Keep it low for confirmations, lookups, and predictable flows where milliseconds matter.

Do I need separate models for translation or captions?

GPT Realtime 2 focuses on spoken assistance. If you need live multilingual relay or streaming text, evaluate GPT Realtime Translate and GPT Realtime Whisper alongside your architecture—often combined in a single product surface.

Verify details on the official announcement and docs