OpenAI Build Hour Deep Analysis: GPT-Realtime-2 How to Reshape Voice Interaction and Real-Time Agents

Written by: Techub News Organized

In the latest session of OpenAI's "Build Hour" technical sharing, Product Manager Terry, Solutions Engineer Erica, along with engineers Ken and Research Lead Soham from the AI entity company Sierra, delved into the newly released GPT-Realtime-2 model and its related technologies. This discussion not only showcased the leap this new model has made in real-time translation, voice interaction, and intelligent tool invocation but also revealed how enterprises can leverage these capabilities to build reliable, scalable production-grade AI applications.

GPT-Realtime-2: Infusing Top-Level Reasoning Capabilities into Real-Time Voice

Last week, OpenAI simultaneously released three new audio-related models: a real-time translation model, the GPT-Realtime Whisper model, and the core GPT-Realtime-2. This release signifies OpenAI's push to extend its mature and powerful reasoning capabilities in the text domain to the forefront of real-time voice interaction.

The real-time translation model supports over 70 input languages and 13 output languages, achieving low-latency streaming translations aimed at eliminating language barriers in video calls, live broadcasts, and customer service. Meanwhile, the GPT-Realtime Whisper model offers adjustable streaming functionalities with latencies as low as 200 milliseconds, supporting 80 input languages, significantly enhancing responsiveness and accuracy in scenarios like real-time subtitles and meeting notes.

But the real focus is on GPT-Realtime-2. It is officially described as “the smartest voice model,” which brings GPT-4o level reasoning capabilities into the voice domain. This means the model is incredibly powerful in following instructions, calling tools, and performing in multiple languages, features that are pivotal for voice production workflows. The new model comes with several key upgrades:

Significantly Expanded Context Window: The scale has been increased fourfold from the previous size, reaching 128K, which roughly corresponds to handling nearly an hour's worth of conversational content, greatly enhancing instruction following and intelligence levels, as it minimizes the need to frequently truncate context.
Parallel Tool Invocation: The model no longer needs to call tools sequentially (waterfall-style); it can handle multiple tool requests in parallel, making the interaction process much more efficient.
Controllable Expressiveness: Developers can instruct the model to speak in specific emotions or styles, such as "speak softly," "speak excitedly," or "speak with a hint of jealousy," opening up possibilities for creating more natural and personalized voice interactions.
Enhanced Domain Vocabulary Understanding: It performs better in understanding vocabulary related to specialized fields like healthcare and AI terminologies.
Context-Aware Terminology: This brings about better agent behavior.

These improvements have resulted in a significant leap in intelligence, instruction-following, and tool invocation capabilities in benchmarking tests. OpenAI believes this heralds a new wave of AGI that will be more global, hands-free, and easier to use for people around the world.

Real-World Demonstrations: From Voice Shopping to Data Analysis, Agents Become the Core of Operations

To concretely showcase the capabilities of GPT-Realtime-2, Erica conducted two vivid real-time demonstrations. The first demonstration built a voice-driven e-commerce search agent.

In a demonstration on an e-commerce website named Supply Co, which sells hiking gear, users can naturally converse with the shopping assistant through voice. The assistant not only understands user needs (such as “remind me what I still need from my last shopping trip,” “find a tent under $450 suitable for 3-4 people”) but can also actively manipulate the website interface: execute searches, filter products, view product details, read user reviews (even specifically look at low-star ratings), and add products to the shopping cart. Even more impressively, when a user is unsure if the weather is suitable for the chosen tent, the assistant can invoke external tools (check the weather) to obtain the weekend weather forecast for Seattle and give purchase advice (“the risk of storms is moderate, that tent may suit an ordinary rainy weekend, but I recommend using a ground tarp and sturdy stakes”). The entire process is smooth and natural, with the agent concurrently invoking up to 15-20 tools in the background and updating the user's visual experience based on reasoning results.

Erica emphasized that this is no longer a simple "voice input, single action output" old mode, but a truly capable shopping assistant that can reason across tools and update visual experiences in real-time.

The second demonstration switched perspectives to showcase the application of "voice-driven actions" in internal workflows. Playing the role of Product Manager at Supply Co, Erica faced a complex product analysis dashboard and used voice commands to have the AI assistant assist in analyzing recent drops in user activation rates in the European market.

She simply needed to state commands: “filter European data,” “view the past seven days and compare with the previous seven days,” “point out other relevant filtering conditions,” “start a root cause investigation,” “compare mobile Safari and Chrome.” The AI assistant executed these operations silently—filtering data, generating charts, and conducting comparison analyses—only providing vocal summaries when explicitly requested (“the investigation shows it’s a problem unique to mobile Safari; the product detail page size selector validation failed to update correctly…”). The model’s excellent instruction-following capabilities allow it to speak only when necessary rather than confirming every step, making it like an analyst embedded in the workflow capable of handling large volumes of simulated data, writing code, and creating dashboards.

These two demonstrations together highlight the core value of GPT-Realtime-2: it is a real-time agent capable of cross-tool routing, maintaining inquiry status, and transforming real-time workflows into conversational forms, while developers still have complete control over data and UI.

Sierra's Practice: From Powerful Models to Trustworthy Production-Grade Agents

OpenAI's models provide a strong foundation, but transforming them into enterprise-level, highly reliable applications requires additional architecture and experience. Ken and Soham from Sierra shared how they utilized GPT-Realtime-2 to build customer service agents that serve numerous Fortune 100 companies.

Sierra's agents not only need to generate responses but also decide when to act, what tools to invoke, how to use vast knowledge base information, and determine whether a certain action is permitted. In customer-facing scenarios, even small error rates can quickly translate to real business risks. An agent that has even a 0.1% chance of violating policy or taking incorrect actions is deemed undeliverable.

Thus, Sierra built an "agent management framework" to handle all the additional infrastructure necessary to ensure the agents operate reliably and safely in production environments. This framework defines the required workflows for each client, including tools available for the agent to use, language and brand standards to follow, necessary guardrails, and grounding to ensure compliance with specific customer policies.

In voice interactions, Sierra employs a custom-tuned VAD (Voice Activity Detection) model to determine when users stop speaking. This provides better control compared to general models in real-world audio with high background noise, accents, frequent interruptions, and topic changes. Additionally, the framework addresses sensitive information tracking and masking, PCI-compliant payment processes, among various production-grade functionalities.

Ken pointed out that it is this production layer that turns powerful foundational models (like Realtime-2) into controllable, observable, and safe systems trusted by the world's largest companies for direct interaction with customers.

In preliminary testing, Sierra has already observed significant performance improvements: compared to traditional cascaded systems (voice to text + text model + text to voice), calls using Realtime-2 are about 30% faster in P50 (median) latency and can be up to 200% faster in P90 (high latency scenarios). The voice quality is also strong and can compete with some professional synthesis providers they use.

However, Sierra places more value on the end-to-end task success rate. They utilize simulation testing to replay actual customer calls tailored to each client's workflow to assess whether the agent truly completes tasks rather than just sounding good. Soham elaborated on the unique challenges faced by voice agents: production calls filled with interruptions, accents, and background noise (such as near highways or in cars with children) do not transmit messages clearly and flawlessly. Agents need to complete tasks while managing these complex situations.

Common failure modes include: spelling mistakes when recording names and numbers (a single letter error can lead to subsequent tool invocation failures), logical misunderstandings (like canceling the wrong flight), and difficulty handling common "back channeling" in human conversations (like whether to ignore or respond to signals such as "uh-huh," "ah," "right," etc.). Soham excitedly noted that the emergence of "Thinking Models" represents a true transformation for voice-to-voice models. Realtime-2 as a thinking model can "think" during voice conversations (for example, say "give me a minute"), maintaining state even when interrupted, which is particularly important in multilingual scenarios. He believes OpenAI has made significant progress in this area, with Realtime-2 showing considerably better performance, making the mainstreaming of voice-to-voice models exciting.

Developer Q&A: Handling Interruptions, Context Management, and Production Considerations

In the Q&A session, the team addressed several core concerns developers have.

Regarding interruption handling, Sierra recommends choosing based on specific use cases. Realtime-2 comes with a built-in voice activity detection (VAD) model that performs well, but Sierra uses a custom model because of its suitability for customer service call audio (high background noise, etc.). OpenAI states that developers can disable interruption features turn-by-turn. For instance, if it is necessary for the model to complete a disclaimer, the VAD can be disabled for that turn to ensure users cannot interrupt and then re-enabled for subsequent turns. This provides more reliable control than solely relying on instruction-following.

Regarding the trade-offs between voice and text, the team believes voice has a clear advantage in "rapid capturing and quick intent recognition" scenarios, being more convenient for driving, moving contexts, or when users only have vague ideas. Voice allows for a closer, more casual interaction, enabling users to provide more context in a "stream of consciousness" manner (like describing the house or car they want to buy). Furthermore, in some "voice-first" countries (like Brazil and India), voice interactions may be a more natural entry point.

For long conversations over an hour, it is recommended to save the session state and, when needed, start a new session and "inject" (hydrate) previous context. Thanks to the 128K extended context window, more historical information can now be carried into new sessions.

About when to "upgrade" to more powerful models (like GPT-4o), Sierra shared two production strategies: one is to set up an asynchronous supervisor to monitor the conversation and inject additional context as necessary to correct the trajectory; the other is to choose a model based on agent complexity—using Realtime-2 for scenarios requiring extremely fast responses, and traditional text models for more complex agents. OpenAI added that developers can inject context at any time through "conversation item create" without triggering model responses, enabling background asynchronous tool invocations.

Finally, regarding how real-time agents maintain context and decision consistency during multi-tool interactions, the team stated that Realtime-2 itself is a reasoning model with "train of thought" ability, and its parallel tool invocation method aligns with other cutting-edge models without losing context during calls. By default, the model automatically maintains state and context between turns. Developers can also dynamically manage context engineering. The team particularly emphasized the model's exceptional instruction-following capabilities, even suggesting that developers allow the model to optimize prompts by itself, checking for conflicting instructions.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

OpenAI Build Hour Deep Analysis: GPT-Realtime-2 How to Reshape Voice Interaction and Real-Time Agents

GPT-Realtime-2: Infusing Top-Level Reasoning Capabilities into Real-Time Voice

Real-World Demonstrations: From Voice Shopping to Data Analysis, Agents Become the Core of Operations

Sierra's Practice: From Powerful Models to Trustworthy Production-Grade Agents

Developer Q&A: Handling Interruptions, Context Management, and Production Considerations

Selected Articles by Techub News

Table of Contents

Related Articles