OpenAI's Sora: Transforming text into visual stories

OpenAI's just dropped something big: Sora, their first shot at making videos through AI, and it's turning heads for all the right reasons. This new tool isn't just good; it's blowing past what even OpenAI thought it could do.

Just to give you an idea of how big Sora is, actor and filmmaker Tyler Perry has put his $800 million studio expansion on hold after seeing OpenAI’s Sora in action, cautioning that “Jobs are going to be lost.”

tyler perry puts studio on hold — Source

Ever since ChatGPT started chatting us up, the community’s been waiting for the multi-modal advancements of OpenAI, and Sora’s worth the wait. Imagine telling your computer to whip up a video about, say, a Dalmatian walking along the buildings in Bulano, Italy. And boom, there it is.

Prompt: The camera directly faces colorful buildings in Burano, Italy. An adorable dalmatian looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.‍

With Sora, we're not just talking about standard video generation. This model can craft videos in any resolution up to a minute long and aspect ratio up to the crisp clarity of 1080p, showcasing rare flexibility in today's AI landscape. But Sora's skills don't end there. From looping videos endlessly to extending scenes beyond their original frames and transforming mundane backgrounds into captivating scenes, Sora's toolkit is impressively vast, all thanks to the groundwork laid by Dall-E and GPT models.

Trying Sora

Sora’s got a knack for “simulating digital worlds”. Throw in a prompt with "Minecraft," and it doesn't just play the game; it practically recreates it, complete with a HUD and physics that'll make you do a double-take. And it's all happening while keeping a virtual hand on the game controller.

Or here’s a prompt depicting a scene of a stylish woman walking down the Tokyo streets. The realistic scenes and the smooth motion of this video are mind-blowing.

Prompt: A stylish woman walks down a street in Tokyo filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, while carrying a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

Or maybe you need a video of archeologists discovering a plastic chair from the ground? You probably don’t, but look at how realistic Sora gets with this prompt:

Prompt: Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care. Source

Going beyond text to video: video-to-video by Sora

We’ve hardly gotten used to the crazy flexibility of image generation tools, and here’s OpenAI with changing prompts on the same exact video:

Or the ability to blend separate videos together smoothly. See how a drone transforms into a butterfly that gradually goes underwater.

Look at the realistic, long, and dynamic camera movement in the next video. The people in the video might still be a little wonky, but the scene is still mind-blowing.

Take a look at this incredibly natural morph between two subjects and the visual effects.

But, of course, it's not all sunshine and rainbows. Sora's still brushing up on getting the physics just right—like not forgetting bite marks on a burger.

Yet, the dream is big: Imagine games generated from thin air just from typing out a description. It's both awesome and a bit spooky when you think about the deepfake side of things. So, for now, OpenAI's keeping Sora under wraps in a limited access program.

Sora vs. Pika vs. RunwayML vs. Stable Diffusion

Gabor Cselle compared the output of different models and showed the results on his X page. He gave the other models Sora’s starting frame. He says, “ I gave the other models SORA's starting frame. I tried my best prompting and camera motion techniques to get the other models to output something similar to SORA. SORA's just much better at longer scenes.”

Ethical concerns

In the midst of all this excitement, there's a serious conversation brewing about the darker side of AI's capabilities, especially with deepfaked media of celebrities and politicians becoming a growing concern. The Federal Trade Commission is stepping in, proposing rules to curb AI's potential for impersonation fraud.

Red teaming

The US Presidential Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence mentions red-teaming eight times, defining the practice as follows:

“The term ‘AI red-teaming’ means a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI. Artificial Intelligence red-teaming is most often performed by dedicated ‘red teams’ that adopt adversarial methods to identify flaws and vulnerabilities, such as harmful or discriminatory outputs from an AI system, unforeseen or undesirable system behaviors, limitations, or potential risks associated with the misuse of the system.”

Red teaming is basically when experts play the role of potential attackers to test the limits and safety of AI systems. Think of it as a stress test to catch any issues before they become real problems. This is crucial because it helps prevent scenarios where AI could accidentally cause harm or spread misinformation.

OpenAI takes this seriously, and Sora’s not an exception. They're not just releasing it into the wild; instead, they've chosen a group of experts and red teamers to put Sora through its paces. This careful approach shows they’re really thinking about the impact of their tech.

OpenAI stands as a prime example of doing things thoughtfully in the AI world. They’re showing that it’s possible to lead the way in innovation while also being super careful about how technologies like Sora are used. This blend of excitement and caution is pretty much the vibe as we step into new territories of content creation with AI. Sora’s got a lot to offer – from making new forms of entertainment to possibly changing the way we learn and share stories across different cultures.

But as we move ahead, finding that sweet spot between innovation and doing things right is key. In a time when digital content can be so lifelike it’s hard to tell what’s real, paying attention to the ethics behind the tech is more important than ever. So, as we explore all the cool stuff AI can do, let’s not forget the importance of keeping things on the up and up.

After all, in a world where seeing is no longer believing, the truth behind the pixels matters more than ever.

Nuts and bolts of Sora

Peeling back the layers of how Sora operates reveals a sophisticated process at its core. Sora is built on a diffusion model, a technique that begins with an image resembling static noise, which it methodically refines into a clear, coherent video. This process is akin to watching an artist gradually bring a painting to life, starting from a blank canvas.

Sora’s remarkably flexible. Its ability to not just generate videos from scratch but also to extend existing ones, adding new content while maintaining consistency, addresses a common challenge in video production: ensuring that characters or subjects remain consistent throughout the video, even if they temporarily leave the scene.

At the heart of Sora's architecture is the transformer model, drawing inspiration from the GPT series known for their scalability and performance. This architecture allows Sora to handle a vast array of visual data, adapting to various durations, resolutions, and aspect ratios with ease.
Sora interprets videos and images as collections of smaller data units or patches, similar to how GPT models view text as tokens. This unified approach to data representation enables Sora to be trained on a diverse set of visual information, offering a broad understanding of visual content.

Building upon the foundation laid by DALL·E and GPT, Sora incorporates the recaptioning technique from DALL·E 3, which generates detailed captions for visual training data. This method allows Sora to adhere to the user's text instructions closely, enhancing the relevance and accuracy of the generated video content.
Furthermore, Sora is equipped to animate still images or extend videos, filling in missing frames with precision. This ability showcases Sora's attention to detail and its potential as a tool for creative and practical applications.

Wrapping up

The big picture? Sora's not just about making videos; it's about reshaping our reality, opening doors to new ways of learning, entertaining, and maybe even dreaming. Sure, it's a bit of a wild ride, with all the ethical twists and turns, but that's the thrill of it.

With Sora, we're not just watching the future unfold; we're scripting it, one prompt at a time.

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate

Thank you for subscribing to our newsletter!

Oops! Something went wrong while submitting the form.

Contents