Tech

First Hands-On With RunwayML’s Gen-3 Alpha Video Generator

An adorable kitten plays outdoors, bathed in the diffuse glow of a beautiful sunset.

Suddenly, the kitten’s head pops off. Its front half and back walk off in different directions. The video ends.

That’s just one example of the bizarre results I’ve seen from my initial testing of RunwayML’s new Gen-3 Alpha. Billed as the most advanced AI video generator currently on the market, Gen-3 Alpha just came out of private Beta last week.

That makes it one of two AI video generators you can actually use (OpenAI’s Sora, which dazzled people with cinematic films and stunning AI-created shots is still not available to anyone beyond people like Tyler Perry).

How does RunwayML’s new model do? Is my bifurcated kitten an aberration, or typical output? I tested the new model to see how it performs — and if it could be a Sora-killer.

Models of the World

Gen-3 Alpha is the newest model in RunwayML’s line of AI video generators. RunwayML made a splash with its previous Gen-2 model, which was the first even remotely viable video generator available to the public.

Gen-2 was cool, but very rudimentary. Its output was vaguely video-like. But Gen-2’s creations often lacked detail and realism. Animals had multiple noses. Backgrounds were often shown in soft focus, probably to hide the fact that Gen-2 didn’t know what to put there.

Example from Gen-3 model. Credit: RunwayML

Gen-3, RunwayML says, is “a major improvement in fidelity, consistency, and motion over Gen-2, and a step towards building General World Models.”

That’s quite an ambitious claim. General World Models are essentially computerized models of the physics, lighting, objects, and other elements of the real world.

OpenAI’s Sora is widely believed to rely on a General World Model built by watching billions of hours of videos, including most of YouTube.

If Gen-3 Alpha uses a similar model — and achieves good results — that would be a strong validation of the idea that these kinds of models are genuinely useful for creating cinematic AI-generated video.

Stunning Vistas, Tasty Chicken

I decided to put Gen-3 Alpha through its paces. I signed up for a paid account (it costs $15 per month) and began testing.

To start, I decided to give Gen-3 Alpha some fairly easy video prompts. Unlike Gen-2 and competing systems including Luma’s AI generator, Gen-3 Alpha doesn’t allow you to upload an image to prompt the system — it only allows text-based prompting.

To get the best results, RunwayML recommends writing prompts in a super specific way.

Basically, you need to describe the shot as if you’re a cinematographer, peppering in terms like “FPV”, “Macro cinematography”, “SnorriCam” or lighting styles like “Venetian lighting” or “Diffuse lighting” to achieve the shot you want.

I started with a simple, aerial shot of a golf course along the Pacific Ocean. Here’s my exact prompt: “Wide angle; The camera pans across a golf hole on a golf course overlooking the Pacific Ocean on a sunny day.”

So far so good! The shot looks incredibly realistic and captures the aesthetic I was going for. In my mind, I was thinking of Half Moon Bay Golf Links, a real Pacific Ocean golf course I’ve photographed many times.

Here’s the real-world place:

The vibe in Gen-3 Alpha’s video is spot on. We’ve got the manicured course, the cliffs, and even the crashing waves in the background.

Next, I tried another popular type of shot — the food video. In this case, I asked for: “Close-up; panning across a plate of fried chicken on a restaurant table”

That looks more like chicken picatta than fried chicken to me. Still, the detail here is impressive — Gen-3 even dreams up a beer and a plate of Brussels sprouts in the background to complement our chicken.

I also love that this one perfectly captures the hesitant, shaky camera work of a real food video. This feels like something a person could have taken in a restaurant — if you found it on social media, you’d definitely think it was real.

Finally, I asked Gen-3 Alpha to create something a little more bizarre — a prompt dreamed up by my seven-year-old. I find that kids come up with the best ideas for AI-generated photos and videos. My son wanted to see “a donkey riding on a train.”

Here’s the result:

Don’t Make Me Move

So far, my results were looking great. But to fully test the system, I decided to give it some more challenging prompts.

Specifically, I started to give it prompts that required more than a simple panning shot across an object, or a simple scene. Instead, I asked for dynamics — objects moving and interacting.

And that’s where things fell apart.

First, I asked for “Cellphone video; A Bichon Frise dog jumps from a couch onto a very high bookcase”

That’s very weird! First, we get what looks like a Yorkie — not a Bichon. For some reason, the video runs backwards.

And as the dog jumps, it somehow sprouts another head, which subsequently disappears (in slow motion!) as it lands on the bookshelf.

Next, I asked for a Millennial woman cooking. The results:

Again, asking objects to interact with each other within the scene seems to break the system.

The cooking implement keeps morphing, from a slotted spoon to some kind of metal spatula. And as the person’s hands stir the dish, a burned hole seems to appear on the bottom of the pan.

And then, of course, there’s Bifurcated Kittie!

For this one, I didn’t even ask for dynamics — I just wanted a cute kitten. But by trying to make the kitten play, Gen-3 Alpha inadvertently created a deeply disturbing (yet somehow, strangely compelling) clip.

What’s Happening?

Why is Gen-3 Alpha failing when it tries to make objects interact?

In general, AI video generators struggle with dynamics. As OpenAI discloses, even its flagship Sora “may struggle to simulate the physics of a complex scene, and may not comprehend specific instances of cause and effect (for example: a cookie might not show a mark after a character bites it).”

Sora’s example videos include instances where a pack of puppies plays, and new puppies suddenly appear at random. It also includes a video of a grandmother blowing out candles on a cake — except despite her blowing (and her family members applauding), the candles don’t go out.

These errors are very similar to the ones I saw in my testing of Gen-3 Alpha.

Interestingly, Luma Lab’s Dream Machine doesn’t appear to suffer from these same issues. In my testing, it made other mistakes, but it didn’t seem to struggle so much with simulating dynamics.

Why’s that? These struggles may be specific to the General World Model approach for creating AI videos.

Models like Sora and Gen-3 Alpha attempt to simulate the physics of the scenes they’re creating.This is a powerful approach since it allows the models to create extremely realistic videos of nearly anything. But it also means that — if the physics fail — the videos come out looking crisp and clear, but absolutely ridiculous.

Luma Lab’s doesn’t talk much about their model’s approach. But from my testing, it appears that their video generator works by creating an AI image, and then evolving it into a video.

That’s very different from trying to simulate the entire world in the silicon brain of a machine. The physics and interactions are likely constrained by the specifics of the base image.

That makes systems like Dream Machine potentially less powerful, and also less capable of generating long videos. But it also makes them less prone to the kind of strange mistakes that Runway Gen-3 Alpha often makes.

As with most things in generative AI, the problem likely comes down to training data. Runway Gen-3 Alpha and Sora created their General World Models by watching lots of videos. That works great when they’re simulating things that appear often in web videos — panning across some food, or showing a beautiful landscape.

But when they’re asked to create something totally new — like a Bichon jumping onto a bookshelf — the training data isn’t there. Their General World Models have no visibility into that specific part of the real world. And so they fail.

The Future

Over time, the gap will close. As systems like Gen-3 Alpha ingest more training video and their understanding of real-world physics improves, their General World Models will serve them extremely well.

Again, General World Models allow for simulating far more complex scenes. They also allow for longer video clips (Sora brags of generating clips up to one minute long). A General World Model is essentially creating a virtual copy of reality each time it dreams up a video, and it can thus spend as much time simulating things within that virtual world as it wants.

Gen-3 Alpha, despite its flaws, is also a huge step forward for the AI video space as a whole. RunwayML’s success (two-headed dogs aside) with launching a General World Model suggest that OpenAI will have healthy competition in the space if it finally does launch Sora. As RunwayML’s models improve, they could be Sora-killers.

Models like Gen-3 Alpha extremely well at certain things (namely, creating videos that have limited dynamics and that show scenes commonly depicted in online videos), and fail spectacularly at others.

In the future, General World Models might be able to create movie-length videos, simulating complex virtual worlds with multiple settings, characters, and scenes.

For now, though, it’s Bifurcated Kitties all the way down!

I’ve tested thousands of ChatGPT prompts over the last year. As a full-time creator, there are a handful I come back to every day that fit with the ethical uses I mention in this article. I compiled them into a free guide, 7 Enormously Useful ChatGPT Prompts For Creators. Grab a copy today!

Originally appeared on The Generator on Medium.

Thomas Smith

Thomas Smith is a food and travel photographer and writer based in the San Francisco Bay Area. His photographic work routinely appears in publications including Food and Wine, Conde Nast Traveler, and the New York Times and his writing appears in IEEE Spectrum, SFGate, the Bold Italic and more. Smith holds a degree in Cognitive Science (Neuroscience) and Anthropology from the Johns Hopkins University.

Leave a Reply

Back to top button

Discover more from Bay Area Telegraph

Subscribe now to keep reading and get access to the full archive.

Continue reading