10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
The cutting edge technical discussions beneath the wow factor.
Author note: I normally post on Wednesdays, but when things are timely I often want to get them out into the world. When the content schedule is normal and standalone, you can still expect it on Wednesdays. Sometimes, like this, I may just send it early. These days, I’m worried about another major release taking this out of the mind-share before I post it.
With videos, I recommend viewing on the web rather than in email.
Most people agree that Sora will capture the general public mindshare due to the wow factor and developers will benefit most from the Gemini context length changes in the short term. In between the shock and awe, there’s so much to learn about ML from folks analyzing these releases. Here’s a ranked list of those that are most interesting to me. In case you missed it, I already covered the technical details and general overview of Gemini 1.5 and Sora in my previous post.
And oh, before we start, Mistral confirmed on Discord that the model in the arena is their next LLM.
1. Deepfake detection of Sora
The first thing that many people thought of when seeing Sora was deepfakes. This isn’t where my brain landed, thinking about the immediate value gain, but it’s obviously prescient. I give it less than a 1% chance that the general public will have access to Sora before the 2024 US elections for this reason — multimodal red teaming and safety is different and primarily uncharted territory. We see this in the emergence of the multimodal RLHF space — we have very little idea how people will use these models, so how do we fine-tune them?
To start, let’s watch part of the video OpenAI used in the announcement tweet:
Soon after, a friend of mine now working on 3D rendering posted an analysis of the video put through a Gaussian splatting toolkit (from this project). In short, the Sora-generated world is not graphically coherent like the real world:
For example, here’s a video of a real-world video that has been Gaussian splatted.
Gaussian splatting is one of the best-named techniques in ML. Essentially, it renders pixels and Gaussian and samples then squashes — or splats them. Here’s an introductory blog post, video, or the paper website with more examples.
2. Playing with long-context, problem settings, and prompting
The first thing all of us nerds want to do with an ultra-long context model is ask it questions with an entire code base in context (in the prompt). Even larger code bases like PyTorch will all fit in Gemini 1.5s 10 million experimental context length (the 10 million results were in the paper, 1 million is what is available early).
Thankfully, someone already put this to the test! TLDR is it often ties and sometimes beats GPT 4 Turbo. This is with a Gemini “Pro” line model, which on benchmarks and most other tests was behind GPT4. The Ultra model will take this to another level.
One of my most popular use cases for ChatGPT is repetitive data processing tasks of relatively long segments. ChatGPT in recent days has yet to mess up copying simple data. With this type of performance, you can do a call to Gemini that is simply “document my code base,” and it’ll return your entire working code base with doc strings and type checking. Talk about immediate value.
This sort of long-context capability is required for many non-text domains such as DNA processing. Stable and accessible long-context training ultimately means even more fields can get touched by the Attention is all you need curse (or the Bitter Lesson, really). Wide, and exciting error bars coming in the future here. We’ll see if Gemini needs Anthropic’s simple “prompting trick” for long context, which isn’t much of a trick.
Someone also was impressed by 1.5’s ability to analyze long-form videos, but this wasn’t a comparison to existing models and ideas, so it’s a little more exploratory.
3. Gemini paper snooping: contamination and citation games
While these labs are intentionally not sharing their core developments in their papers, I think all of the major reports from Google, OpenAI, or whoever have still held many super interesting details. It’s funny, because if they published the full papers, all of these details would be known. Many people operate in the mindset that there is nothing to learn from them, so there’s still value to reading.
In this case, the Gemini 1.5 report had two that caught my mind. First, a clear comment on the practical challenges of evaluation contamination. Quoting the paper (emphasis mine):
An analysis of the test data leakage of Gemini 1.0 Ultra showed that continued pretraining on a dataset containing even a single epoch of the test split for HumanEval boosted scores from 74.4% to 89.0%, highlighting the danger of data contamination. We found that this sharp increase persisted even when examples were embedded in extraneous formats (e.g. JSON, HTML). We invite researchers assessing coding abilities of these models head-to-head to always maintain a small set of truly held-out test functions that are written in-house, thereby minimizing the risk of leakage.
They then proceed to discuss their proposed benchmark, Natural2Code, which follows the HumanEval format but doesn’t have this problem.
The second one is far funnier. Google cited the work of Liu et. al (2024) as “recent concurrent work” for a multimodal context length of up to 1 million tokens. The funny thing is, this reference doesn’t appear in the references of the report, so the left is left to an exercise to the reader. With a little back and forth from a reader, we think it’s this paper for Ring Attention. If I had to guess, Google came up with something similar. If you look into all of these attention variants, they’re getting very specific and there are likely subtle differences between some of them. Another long-context paper we found in this process proposed shifted sparse attention.
4. Training data and token estimates of YouTube
Estimating token counts is the greatest consulting style (Fermi estimate) problem for ML practitioners. Guessing how many tokens OpenAI could easily extract from YouTube is the one of the day.
Most of the estimates rely on two key facts: 1) the amount of videos on YouTube and 2) the amount of tokens in any given video. Roughly, it seems like there are about 1 billion videos on YouTube. If you say the average video is 10 minutes at 30 frames per second, that’s 300 billion frames, times 256 tokens per frame (with patching), which gives ~76.8 trillion tokens.
Another way to estimate the token count of video is through the Gemini 1.5 report, which logs 3 hours of video as 2.8 million tokens. This is about 15,5 thousand tokens per minute. YouTube gets about 500 hours of new video uploads per minute, so they’re getting almost 500 million tokens every minute of existing.
I received multiple other estimates from folks in the 1 quadrillion range (10^15). It seems like a safe bet that the total amount of data available on YouTube is easily in the range of 100 trillion-plus. However, the important question is how much do models like Sora downsample for pretraining? To get high-quality video, you can’t just train on all the 480p videos. OpenAI almost surely filtered by quality as one of the initial steps, in order to get the high-quality outputs they have today.
Regardless, moving these huge datasets around takes some wild infrastructure. Many companies in the LLM space are just dealing with updating their training hardware for image capabilities (higher data storage and in-out requirements).
Many people brought up the fact that OpenAI had acquired a game engine recently, but that’s likely a confounding variable. OpenAI’s strengths have always been bridging from 0 to 1 on new scopes of internet-scale generative pretraining. The game engine acquisition was mostly an acquihire.
5. Unlocking model-based RL and downstream research
Sora’s most likely area to unlock another step function change in research progress is robotics. We know AI-generated video will put strong negative economic pressure on lots of existing industries, but it’s more fun to consider those that’ll potentially go from 0 to 1. In fact, Sora isn’t the first world simulator to come out recently — the self-driving startup Wayve released a driving transformer-as-a-simulator called GAIA, which now looks quite similar. In the last few years, the defining success of deep RL has been the ability to use high-speed simulators to bootstrap policies in the real world. Sora is not yet high-speed, but it can fix the problem of simulating complex dynamics reasonably well (or generalization to new domains). It’s in the 2-4 year bucket because we need the GPU build-out to proceed more before researchers have access to these text-to-video models, but it's a good time to place early bets.
There’s also a rabbit hole to go down with the fact that Sora is a latent diffusion model, so we could do planning in the latent space (like many popular RL approaches these days), but I think this is secondary to the points above.
6. Midjourney style matching, V-JEPA, replicating Sora in the open
Soon after the Sora launch, someone compared all the Sora videos to the results with the same prompts in Midjourney. The results are striking — Sora’s style is remarkably similar to the default Midjourney style. The video is shown below:
Some folks then wonder if this is due to OpenAI bootstrapping data from Midjourney, using the same architecture, or what. If I had to put money on it, it’s just because they’re both scraping YouTube. All of our generative models are drinking from the same well, and for domains like video where filtering is harder, we’re going to have similar results.
Why I’m most confused here is how does the Sora / DALLE prompt re-writing via ChatGPT even make this possible? When passing a prompt into Sora it’s re-written by ChatGPT and passed into the model. As far as we know with Midjourney, when you write a prompt, it’s passed into the model. Even if they have some re-writing, I would guess it’s not through ChatGPT. Then, how do we end up on the same styles from the same prompt, if one is re-written? There’s some much deeper stuff at play here.
Meta’s big announcement that got squashed in the news cycle, V-JEPA, is a video understanding tool that actually could help us answer some of these questions and build open video models.
7. Architectures and academic links
OpenAI also has said that the Sora model is a diffusion transformer — a vision transformer used in a latent diffusion process, but there’s still lots of discussion on what the recent literature hints they could be doing. Here’s a note from an author of the diffusion transformer that goes into much more detail on his thoughts on the process: Video compressor networks, simplicity, scalability, and much more speculation on the future of image and video generation. The thread is a good read.
From there, you can browse the papers OpenAI actually did cite, and Jia-Bin Huang made an explainer video covering the basic links in the literature to denoising videos. I copied the slide below from there:
8. Pixel peeping from the arts
As someone raised with a so-called sense of design and strong spatiotemporal intuitions, I’m really bummed I didn’t notice how truly awful at perspective some shots in Sora are. This Twitter thread showcased a pretty foolproof way that the models distort reality. What amazes me most here is that the model simultaneously can generate real enough visuals of turbulent water and snow, but has huge mismatches in basics like perspective. Outside of abstract art, there isn’t much training data that messes this thing up! It could point to the diffusion process, where we see the same in plenty of AI-generated images — grounding in the data only happens some of the time.
On the other hand, an animator posted a video and thread with many updates that a human animator would make on one of the Sora-generated videos.
For anyone deeply working in AI, it’s obvious all these details will get better, but they’re the sort of things that indicate where the state-of-the-art is progressing.
9. Inference costs
A lot of discussion that wasn’t promoted by the algorithms was around the ridiculous likely inference costs of this model. Via the API, OpenAI currently charges $.04 to $.12 for a DALLE 3 image. My guess is that a Sora video would cost between 5 and 100 times that amount, depending on the prompt and use case.
Text-to-video is actually the simplest use case. The technical blog post for Sora detailed other use cases like video editing or animation, which require grounding and tokenization of image or video inputs. There are architectural and functionality questions (e.g. they didn’t showcase it because it’s not as good), but simply those would cost even more for inference. In this environment where I think Runway and co still have a chance versus OpenAI, the GPU moat may kick the hardest in the short term.
In a world where GPU data center hosts are bidding on electricity costs, we need nuclear energy ASAP to feed to coming wave of TikTok AI videos.
Inference costs also apply to ultra-long-context models, which is seen by them taking a long-ass time to return the first token for new queries. In reality, there’ll be different pricing models and Google can KV cache your long-prompt so the time to return tokens goes way down. This’ll unfold very fast via the Gemini product and ChatGPT Plus soon.
10. Sound effects, physics, and the complete picture
A common experience reflecting on Sora is that the models are so good that we didn’t have time to notice a lack of sound. Audio is a very important part of video and one that’s deeply tied to physics as well.
11labs released an audio effect model (prompt to an audio clip, rather than text to speech) to capture some opportunistic energy. My question is, if video is a popular medium, what happens to audio companies? I suspect the model responsible for generating the video will do much better at generating audio (rather than conditioning on video), but time will tell.
The 11labs sound announcement is here and a good longer summary from Jim Fan is here.
Pressure on Llama and Mistral
Now that Mistral and Meta’s initial models are out, it’ll be much harder for them to get the wow factor that Gemini and Sora have brought. Llama 3 now needs to do many things to keep up in the narrative department: increased context lengths, the mixture of experts architecture (which Gemini also uses now), license improvements, multimodal generations (which puts pressure on the GenAI organizational structure), and more. Llama 2 was probably the biggest event in AI last year, how do they match that? If they can’t, how long are investors okay with the spending (this answer is a long time with how AI is heading).
A lot of these discussion items unfolded in real-time on the Interconnects Discord server. Upgrade to paid if you want to be part of it, and thanks to all of you that participated and resulted in this fun follow-up post.
Audio of this post will be available later today on podcast players, for when you’re on the go, and YouTube, which I think is a better experience with the normal use of figures.
Looking for more content? Check out my podcast with Tom Gilbert, The Retort.
Newsletter stuff
Elsewhere from me
On Episode 20 of The Retort, we discussed the recent burning of the Waymo car and what it means.
Models, datasets, and other tools
In model merging land, a new model came out that tops all the major benchmarks for 7B, or at least gets close. I’m waiting to know if people get value out of the model or if it is a new type of overfitting.
Links
I recently watched the Foom debate to learn about some original discussions in AI safety, the singularity, and some good counter arguments.
Housekeeping
An invite to the paid subscriber-only Discord server is in email footers.
Interconnects referrals: You’ll accumulate a free paid sub if you use a referral link from the Interconnects Leaderboard.
Student discounts: Want a large paid student discount, go to the About page.
When you say Sora can lead to 0 to 1 progress in robotics do you mean AVs specifically or robotics in general? As far as I understand, RL training for robotics requires physics accuracy and Sora doesn't have physics accuracy according to OpenAI: "The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect."
Link -https://openai.com/sora#:~:text=The%20current%20model%20has%20weaknesses.%20It%20may%20struggle%20with%20accurately%20simulating%20the%20physics%20of%20a%20complex%20scene%2C%20and%20may%20not%20understand%20specific%20instances%20of%20cause%20and%20effect.