Key Takeaways
OpenAI has upgraded the GPT-4 large language model, incorporating more multimodal functionalities. The update, GPT-4o (“o” for “omni”), introduces a key new feature: the ability to converse using natural speech rather than text-based prompts and responses.
While this development promises a more intuitive and engaging user experience, OpenAI acknowledges that voice capabilities also introduce “a variety of novel risks.”
In an official announcement, OpenAI described GPT-4o as “a step towards much more natural human-computer interaction.”
Unlike the standard GPT-4, OpenAI trained the new model on text, image and audio data simultaneously. This means it can process voice inputs natively, rather than relying on separate AI modules for voice-to-text and text-to-voice transcription.
Whereas the previous versions had an average latency of 5.4 seconds, GPT-4o takes an average of just 320 milliseconds to generate a response. This, OpenAI observed, puts it on par with human response times.
The company started integrating GPT-4o’s text and image capabilities into ChatGPT on Monday, May 13. But before it rolls out the new voice features, the model will undergo further safety testing.
Commenting on the new model, OpenAI said, “we recognize that GPT-4o’s audio modalities present a variety of novel risks.”
Reading between the lines, this could be a reference to unauthorized AI impersonation.
In recent months, the capacity for modern generative AI to convincingly emulate real voices has sparked concerns over deceptive deepfakes and the rights of voice artists.
In a move that could help prevent the new model from being abused, OpenAI said that GPT-4o’s audio outputs will be limited to a selection of preset voices.
While the new voice-enabled version of ChatGPT has its limitations, its capacity for natural language is superior compared to the previous generation of voice assistants.
Platforms like Siri and Alexa can’t really be called conversational and are mostly used to carry out a limited range of tasks: searching for basic information online, setting alarms or taking notes, for example.
However, GPT-4o suggests the future of AI voice interaction will be far more personalized and intuitive.
One area where the technology holds significant potential is customer service, where AI voice assistants can replace clunky IVR (Interactive Voice Response) systems.
Voice AI could also be a powerful tool for blind people, helping them navigate challenging situations and environments more independently.
Finally, models like GPT-4o could drastically improve the performance of real-time AI translation services.
Emerging technologies like the Rabbit R1 and Humane’s AI pin are betting on the next stage of human-AI interaction being more voice-driven.
These devices envisage a post-smartphone technoscene in which voice-enabled AI supplants screen-based media as the dominant mode of interacting with digital information flows.
Platforms like GPT-4o will be powerful agents if this vision is to become a reality. But they still have a long way to go if they are to change behavioral patterns that have been entrenched over decades.