Key Takeaways
Ever since OpenAI released the first clips showcasing its new video-generating AI model SORA, viewers have been stunned by their lifelike quality. But how exactly does the latest AI generate such realistic videos?
Although OpenAI has yet to confirm a release date for SORA, it has shared certain details of its architecture. Under the hood, SORA is powered by a machine learning (ML) process known as a diffusion transformer. Combining aspects of previous models such as the DALL.E and GPT range, the firm’s latest video engine incorporates several cutting-edge technologies to make its outputs as close to real footage as possible.
To generate realistic images and videos, diffusion models first start with random data and a target sample. For example, the diffusion process might start with something like television static on one side and a video it intends to replicate on the other.
Through successive iterations, diffusion models adjust random noise to resemble the target sample. At each step, small changes are made to transform the noise back into the desired data, improving the generated sample.
During training, diffusion models minimize the difference between the generated and target data. But how does diffusion work when there isn’t an existing image or video the model has to aim for?
The most important thing to note about AI models like SORA is that the diffusion process doesn’t attempt to replicate all the information about an image in pixel space. Instead, it targets a condensed representation of that data in what’s known as latent space.
Because the libraries used to train diffusion models also include captions, latent representations of images encode both visual and semantic information.
To generate original images, diffusion models represent a text prompt in latent space, generate a new sample and then decode it back into pixel space.
While the diffusion paradigm first emerged in the context of still images, it wasn’t long before AI researchers turned their attention to video.
To develop SORA, OpenAI took inspiration from the way large language models (LLMs) break down input text into generalized tokens. But whereas LLMs have text tokens, SORA has visual “spacetime patches.”
To generate these patches, the diffusion process doesn’t just condense pixel space, it also encodes temporal information from video training data.
Similar to the way LLMs tokenize text inputs, SORA breaks down latent representations of videos into bitesize chunks. After that, it follows a similar pattern, predicting how patches follow on from each other based on information absorbed during the training process.
But to get to a stage where SORA can create realistic outputs, OpenAI first had to assemble the necessary training data.
To train the first generation of image diffusion models, researchers used large databases of labeled images. For instance, OpenAI trained DALL.E using 400 million pairs of images with text captions scraped from the Internet.
However, amassing the equivalent training data for SORA required additional work.
In the wild, video captions are less likely to depict their content in the kind of neatly descriptive language that lends itself to AI training. To overcome this problem, OpenAI first built a dedicated “captioner” model. This was then used to produce highly descriptive text captions for the videos in SORA’s training set.
One of the most impressive features of modern text-to-video AI platforms is their ability to create detailed scenes from basic prompts.
As OpenAI researchers explained : “we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables SORA to generate high-quality videos that accurately follow user prompts.”
Combining ML techniques honed in other contexts, SORA is the culmination of years of AI development that pulls together the 2 most influential research traditions in the field: natural language programming and computer vision.
As the AI spring continues to unfold, further interpretation of the visual and linguistic will lead to an increasingly multi-modal machine learning environment. Looking ahead, users can expect not just more realistic images and videos from AI models, but also better reasoning and a higher level of prompt interpretation.