Tech

Why OpenAI’s Sora is Secretly a Massive Step Towards AGI

Thomas SmithFebruary 29, 2024

Last week, OpenAI released examples from its new AI video generator, Sora. I’ve seen this coming for a long time, but like everyone else, I was blown away by the quality of Sora’s video creations.

OpenAI’s examples show the obligatory cat and puppy videos, but also things like a stunning aerial pan across a beach at California’s Big Sur, or wooly mammoths frolicking in the snow.

The tech is so good that filmmaker Tyler Perry reportedly canceled a $400 million expansion of his studio after seeing Sora’s videos. There was no point in expanding his studio’s physical footprint, Perry said, since he would soon be able to replace a physical studio with AI.

Certainly, Sora is a huge step forward for AI video generation. But the real breakthrough behind Sora is much more dramatic — and more disruptive.

Models of the World

Generating AI videos is far harder than generating images.

To create an AI image, generative AI systems need to learn how to assemble pixels into something that resembles an object, place, person, or avocado chair.

It’s a tough technical challenge. But the system doesn’t necessarily need to understand what it’s creating in order to make an AI image. It just needs to slowly remove the non-cat-like, or non-person-like features of the image until it reaches something that its internal systems judge to be acceptable.

Creating convincing videos is different. In a video, objects and their environment don’t simply exist as static entities; they interact with each other in rigid, rules-guided ways.

A ball rolling across a table, for example, is unlikely to suddenly fall through the table’s solid wood top. A race car driving through San Francisco isn’t going to suddenly levitate off the ground or transform into an elephant.

Likewise, if two people are walking, a third person generally won’t suddenly appear out of the ether and punch one of them in the face.

The limitations and physics-bound interactions of objects in the real world seem obvious to us as adult humans. But our knowledge of them is actually hard-won.

Studies have shown, for example, that babies spend an inordinate amount of time watching the world to understand its physical laws.

When babies are shown a scene that violates the laws of physics, their brains suddenly light up, as they try to integrate the strange new scene with their existing predictions about how the physical world works.

Even as adults, we make constant predictions about how objects around us will interact.

We’ve all had the experience of trying to pick up an object that turns out to be lighter than we expect (an empty milk jug we thought was full, say).

We grab it with more force than is necessary, violently (and often hilariously) jerking it into the air.

This happens because our brains subconsciously predict the weight of every object we see before we ever go to pick it up. Based on our previous experiences of the physical world, we make assumptions about how the world around us will behave.

In all these cases, we may not notice that we’re doing anything special. But in reality, as humans, we’re constantly making assumptions about how the physical world is set up, based on our extensive experience living in it.

Physical Computers

Unlike humans, computers have no innate knowledge of the physical world.

Researchers have tried to systematically teach them the kinds of basic physical knowledge that humans take for granted. The Kinetics database, for example, consists of 500,000 video clips capturing 600 different types of human movement. It’s often used to train machine learning systems to do things like predict when a hospital patient is at risk of falling.

Teaching these physical basics is slow going, though. Training a computer to identify a falling human is one thing. But training it to understand how dust might fly up from the wheels of a car driving on a dirt road, or how light might play through the translucent petals of a flower at sunrise, is a much tougher ask.

Ultimately, that’s what makes Sora so impressive. As OpenAI explains in its announcement about the system, Sora isn’t just creating cool videos; it’s creating those videos based on a model of the physical world that it’s developed entirely on its own.

Sora “…understands not only what the user has asked for in the prompt, but also how those things exist in the physical world,” OpenAI’s announcement says. “We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.”

Sora, in other words, has likely reviewed millions or billions of hours of video of the real world. It may also have been trained on the output of the physics simulators used to power modern videogames and special effects, such as the Unreal Engine.

From all that watching, Sora has developed a detailed model of how the physical world works. Like a baby observing objects around it, Sora has learned that cars drive on roads, dogs make cute faces, and art galleries are filled with lovely directional light.

From this knowledge, it can create long and convincing videos. When you ask it to create a scene of a person walking through Tokyo, it draws on its derived knowledge and model of the world to create a non-existant, digital place where that thing is happening. That’s the video.

Again, the videos would be impressive on their own. But the fact that Sora developed its own internal model of the world is far more impressive — and far more impactful.

Developing a world model is a huge step towards Artificial General Intelligence (AGI), the holy grail of AI research.

As OpenAI says in its announcement, “Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.”

If Sora truly understands how the world works, it could use that knowledge for way more than creating fun videos. It could guide a robot through a real-world environment, for example, or write like a person who has genuinely experienced life, in all its fully-embodied glory.

Given the potential impact of an AI world model, it’s no surprise that other companies have tried to throw cold water on OpenAI’s discovery. Researchers from rival AI companies like Meta have argued that Sora doesn’t really understand the world but rather is simply mimicking patterns that it’s seen in its training data.

I don’t buy this argument.

Imagine that my dog Lance suddenly started reciting Shakespeare. A poet would probably argue that, being a dog, he doesn’t fully understand the words of Shakespeare.

But who cares? He’d still be a dog reciting Shakespeare.

Likewise, if AI develops a representation of the world that’s good enough to allow it to do useful things, the exact technical aspects of how it’s accomplished that task hardly matter.

If Sora has created a model of the world simply through observation that allows it to make good predictions and reason accurately, that’s as close to understanding as it needs to get. And based on the videos I’ve seen, it seems to have done just that.

The Future of Sora

Sora currently isn’t available to the general public. Given the potential destructive power of an AI system with a detailed model of the real world hidden in its silicon brain, OpenAI is understandably approaching Sora with some caution.

It’s currently red-teaming the system, bringing in trained experts to try to get Sora to do destructive things, so the company can fix those weaknesses before the system’s release to the general public.

Ultimately, though, if an AI system can develop a detailed and accurate model of the world simply through observation (and the early results from Sora suggest it can), OpenAI will not be the only company to develop such a system.

Training these models is likely enormously expensive — even more so than training generative AI image creators. In a few years, though, it’s likely that even open-source models will catch up to Sora’s capabilities, and we’ll be awash in world-mimicking AI.

That could unlock powerful video creation capabilities, as Perry predicts. But it could also do much more, revolutionizing robotics, making all AI systems better at logical reasoning, and taking us a step closer to AGI.

Lifelike AI cat videos are cool. But as with so many things in the artificial intelligence world, they’re merely a harbinger of much bigger things to come.

I’ve tested thousands of ChatGPT prompts over the last year. As a full-time creator, there are a handful I come back to every day. I compiled them into a free guide, 7 Enormously Useful ChatGPT Prompts For Creators. Grab a copy today!

Thomas SmithFebruary 29, 2024

Why OpenAI’s Sora is Secretly a Massive Step Towards AGI

Models of the World

Physical Computers

The Future of Sora

Related

Thomas Smith

Leave a ReplyCancel reply

How Deep is the Water Under the Golden Gate Bridge?

Where Does Southern California Stop and Northern California Start?

The Key Differences Between Northern California and Southern California

Models of the World

Physical Computers

The Future of Sora

Share this:

Related

Thomas Smith

Leave a ReplyCancel reply

How Deep is the Water Under the Golden Gate Bridge?

Where Does Southern California Stop and Northern California Start?

The Key Differences Between Northern California and Southern California

Discover more from Bay Area Telegraph