We have seen OpenAI’s Sora, we have seen ByteDance’s Goku+, and we have seen Google’s Veo. So, we know AI is becoming better day by day at generating a video. Now, in a new study, NVIDIA and Stanford have released a research paper that creates a complete Tom & Jerry episode.
Researchers created Tom & Jerry episode using AI
Tom & Jerry is maybe the most well-known cartoon in the world. The iconic cat-and-mouse duo has now officially entered the world of AI.
A team of researchers from NVIDIA, Stanford, UC Berkeley, UCSD, and UT Austin revealed a new AI project where they used Test-Time Training to generate a complete 1-minute video.
In the AI-generated video, Tom arrives at an office, takes the elevator, and gets to work at his desk. Then Jerry strikes, chewing a wire and kicking off a classic cat-and-mouse chase.
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.
— Karan Dalal (@karansdalal) April 7, 2025
We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.
Every video below is produced directly by… pic.twitter.com/Bh2impMBWA
Normally, making long AI videos is super tricky. Most AI models today can barely manage a 10 to 20-second clip before losing track of the story. That’s because traditional AI video generators struggle when they have to remember lots of stuff over time. Imagine trying to remember hundreds of thousands of details at once; it becomes incredibly slow and unwieldy.
What’s this new AI video generation method?
This is solved here by Test-Time Training (TTT) layers. In simple terms, TTT layers are a special kind of model component that updates its “memory” even when it is being used to generate output (or “tested” on new data). These new layers basically let the AI learn and adapt while it’s generating the video.
The researchers tried the new layers with a cartoon like Tom & Jerry for a specific reason. This animated series is known for fast-paced action and quick scene changes.
They assembled a dataset of about 7 hours of these cartoons and broke each episode down into detailed 3-second segments. Human annotators wrote rich descriptions for these segments, providing everything from the background setting to the characters’ actions and even camera movements.
People compared this new system with older methods like Mamba & DeltaNet, and got a great response.
The AI episode followed the story better with smoother movements.
The researchers enhanced CogVideo-X, a pre-trained model with 5 billion parameters initially limited to 3-second clips, by adding TTT layers. Through gradual training, they extended its capabilities to generate videos lasting 9, 18, 30, and eventually 63 seconds.\
Despite the progress, the model still faces challenges, such as objects morphing between segments or showing sudden lighting shifts.
The researchers explained that their current experiments are restricted to one-minute videos because of computational limits. However, the method is theoretically capable of scaling to longer videos and more intricate storylines, with the potential to transform the animation and video production industries.
What is the public’s reaction?
The AI-generated Tom and Jerry video has sparked a spectrum of reactions online. Some viewers are amazed by the new technology’s ability to create a one-minute animated episode from a simple text prompt.
However, some viewers have expressed concerns about the animation’s quality, describing it as “soulless” and noting various errors.
There are also apprehensions regarding the potential impact on human animators, with fears that such technology might overshadow traditional artistic efforts.
Takeaways
The researchers believe that with more improvements, we could see even longer AI-generated cartoons, better storylines, and more detailed animations. You can read the research paper here. Note that the current system is a proof-of-concept focused on one-minute videos.