Voice is one of the most natural ways to use software: someone can ask for help while driving, change a travel plan while walking through an airport, or get support in
their preferred language without stopping to type. The hard part is not audio quality alone—it is maintaining intent across corrections, using tools while the
conversation continues, and recovering gracefully when a request changes mid-sentence.
Alongside GPT Realtime 2, the broader realtime audio lineup includes live translation and streaming transcription so developers can ship multilingual experiences and
low-latency captions in the same architectural family. Together, these models move realtime audio from simple call-and-response toward interfaces that can listen, reason,
translate, transcribe, and take action as a session unfolds.