A new artificial intelligence model called Sora, unveiled today by leading AI lab OpenAI, can generate high-quality videos from simple text prompts. Sora represents a major advance in AI’s ability to understand and simulate the physical world.
The San Francisco-based company says Sora can produce videos up to one minute long while maintaining coherence, quality and adherence to prompts. The model is able to generate complex scenes involving multiple characters, specific motions, accurate subject details and appropriate backgrounds.
“Sora has a deep understanding not just of what the user has asked for, but of how things exist in the real, physical world,” the company said on its website. “It can interpret prompts and generate compelling, emotionally-rich characters and scenes.”
Safety Steps Before Release
OpenAI says it will allow artists, designers and “red teamers” – experts who adversarially test for issues like bias and misinformation – to experiment with Sora and provide feedback before considering full public release.
The lab is also developing additional safety tools, including a classifier to detect Sora-generated videos and metadata standards for attribution. Earlier detection systems and policies from the DALL-E image generator will be leveraged.
“Despite extensive research, we cannot predict all beneficial uses or abuses of this technology,” the company noted. “Learning from real-world testing is critical to releasing increasingly safe systems over time.”
How Sora Works
Sora utilizes “diffusion” technology, which works by taking static noise and incrementally transforming it into a clear video over multiple steps. This allows full videos to be produced at once, with consistent subjects even when temporarily out of frame.
The model encodes videos as collections of smaller “patches”, enabling training across diverse resolutions, durations and aspect ratios. Sora also applies AI techniques from past OpenAI models like DALL-E 3 and GPT language models.
Limits and Potential
OpenAI admits Sora still faces challenges accurately simulating complex physics and spatial/temporal details. Subjects may fail to interact properly with objects, confuse left and right directions, or misalign camera motions with prompts.
Nonetheless, the lab says mastering video synthesis marks a milestone towards advanced AI able to model and reason about the real world – and perhaps one day achieving artificial general intelligence surpassing human capabilities.
Further technical details will be published in a forthcoming paper. For now, OpenAI aims to gather more external feedback to ensure model safety and maximize societal benefit.
“This technology enables new creative possibilities, but also risks like misleading synthetic media,” the company said. “Engaging broadly will help us understand concerns and identify positive applications.”