OpenAI’s Sora for video, Gemini 1.5's infinite context, and a secret Mistral model
Emergency blog! Three things you need to know from the ML world that arrived on Thursday.
Yesterday was one of those days in AI where everything happens at once. In short, you need to know about each of these:
OpenAI announced Sora, their video generation model. It’s remarkably good.
Google announced Gemini 1.5 Pro, with performance close to 1.0 Ultra and an almost infinite context length (up to 10 million tokens).
The community uncovered a model called Mistral-Next in the ChatBot Arena, hinting at a coming release. Initial tests show it is at least a solid model.
This post will try to be a to-the-point technical summary of what we know.
Sora: OpenAI’s text-to-video model
We’ve known this is coming for a long time. I was still shocked by how good it was. You need to watch some of these AI-generated videos. OpenAI announced Sora and Sam Altman spent all day sharing videos on Twitter of its magical generations. Later in the day, OpenAI released a slightly more technical blog post that confirmed most of the rumors folks had converged on.
In short, Sora is a combo Vision Transformer (ViT) and diffusion model. The core idea behind vision transformers, and data processing for Sora it seems, is to embed chunks of video down into a latent space called a patch which then acts like a token.1
Quoting from the OpenAI blog:
Sora is a diffusion model; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer. Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, computer vision, and image generation.
In this work, we find that diffusion transformers scale effectively as video models as well.
There are a bunch of interesting things in the blog post, but nothing truly important like model size, architecture, or data. For me, the data is almost surely a ton of YouTube with some procedurally generated videos (from game engines or something custom, more later). Some things to know:
They train on multiple resolutions (most multimodal models are fixed at something like 256x256), including 1920x1080p landscape or portrait.
“We apply the re-captioning technique introduced in DALL·E 3 to videos.” This comes in two points:
Having a language model mediate the prompting is still important to getting good outputs. People don’t do this unless they need to. I think this’ll be solved eventually with better data controls.
More importantly, it’s linked to their “highly descriptive captioner model” (think, video to text) that is needed to provide labels to the data. This confirms that base GPT4 can do this, or OpenAI has many other state-of-the-art models they’re hiding up their sleeve.
Sora does animations, editing, and similar actions by taking in image inputs as well.
Sora does video-to-video editing with video inputs.
One of the based ML anonymous accounts on Twitter dug up a paper that studies a directionally similar architecture. I copied the architecture figure below.
Sora’s most impressive feature is it’s ability to realistically model the physical world (OpenAI describes this as "Emerging simulation capabilities”). No text-to-video model before has been remotely close to this. Google’s Lumiere just came out weeks ago and was impressive, but it looks downright pedestrian compared to Sora.
Lot’s of rumors that Neural Radiance Fields (NeRFs), a popular technique for 3D reconstruction from images, may be used under the hood based on the characteristics of the videos (like the physical world), but we have no explicit evidence of this. My take is it is procedurally generated game engine content. Just using games is not enough, you need a way to generate data diversity, as is the case with all things synthetic data. An example of how to think about is what we were building at HuggingFace for RL agents. The diversity in data probably unlocked another level of performance in generation — we see this all the time in large models.
All the comments on the death of Pika and Runway ML (other popular ML video startups) are totally overblown. If the rate of progress is this high, we have so many more turns in the tail. If the best models come and go rapidly, the most important thing is a user touchpoint. This hasn’t been established for video and heck, MidJourney still relies on Discord (though, a UX is in alpha)!
Gemini 1.5: Google’s effectively infinite context length
A few hours before Sora, Google shocked everyone by already shipping the next version of Gemini. The immediate changes that this may bring to how people use LLMs is arguable way more impactful than Sora videos, but Sora has the visual demo quality that is captivating.
In summary:
Gemini 1.5 Pro nears Gemini 1.0 Ultra performance with greater efficiency per parameter and adds Mixture of Expert as the base architecture.
Gemini 1.5 Pro scales up to 10 million context length. For reference, it was a big deal when OpenAI increased GPT4 to 128k. 10million almost doesn’t make sense — it isn’t a Transformer. It can take in way more information than the average ChatGPT user ever considers.
Google could’ve found some new way to combine architecture ideas for long-context with their TPU compute stack, and got great results. According to one of the leads of long-context at Gemini, Pranav Shyam, this idea just sprouted a few months ago. There’s surely more runway here if it was shipped in a minor version (v1.5) rather than v2.
As a thought experiment, the communications around Gemini 1.5 tell you that you can include an entire production codebase in-context for the model (see examples provided by Google). This is truly life-changing for libraries that aren’t popular enough to get scraped hundreds of times for the next GPT version. As an enterprise tool, this is worth a ton of money. They visualize how much content 10 million tokens is, and it’s a ton. Think about 3 hours of video or 22 hours of audio being processed by a model with no segmentation or loss.
To be clear, the 1 million context length is coming soon to the paid Gemini users (similar to the ChatGPT plus plan), and the 10 million window is mentioned in the technical report. I’m thinking it may be withheld for cost more than anything else at this time. That’s so much compute with any model.
This figure about the context length breaks my brain. The longest context window gets more accurate.
Seeing this makes it clear that the model is not a Transformer. It has a way to route information through a non-attention model. A lot of people brought up Mamba, but more likely is that Google implemented its own model architecture with optimized TPU code. Mamba came with special Nvidia kernels and integrations.
This has me very excited for a future where the models we interact with route the compute to sub-models that specialize in different tasks. I expect if we were to see the Gemini 1.5 Pro architecture drawing, it would look more like a system than a normal language model diagram. This is what the development phase of research and development looks like.
The type of changes this can cause was shared by the famous prompt engineer Riley Goodside:
So many implications here. Why [supervised fine-tune] when you can 100K-shot? And if it can translate Kalamang given a grammar and dictionary, what can’t the right words teach it?
Essentially, this is saying we can now just tell the models how to act in context. Fine-tuning is no longer needed for capabilities. I expect there to be synergies here, and fine-tuning is cheaper when inference is at scale, but it’s exciting.
Read more in Google’s Gemini 1.5 blog post or technical report.
For the last note on how Google is hanging in there with ML, in this interview, the CEO of Perplexity says Google quadrupled the offer of someone he was trying to hire. This is nuts, and I’m not sure if it is a bullish or bearish signal for Google.
Mistral-next: Another funny release method
As if this wasn’t enough, I was pointed to the fact that there’s another Mistral model stealthily available for chat in the LMSYS arena. I’ve heard rumors that another model was coming soon, but this is obviously more real. Basic tests show it’s a strong model. Surely the Twitter mobs will now go and run even more vibes-evals, but Mistral will just tell us soon. I’m guessing this is their API-based GPT4 competitor.
Turns out it was added about a week ago, so it did a pretty good job of staying hidden.
Keep reading in a deeper discussion post!
Meta also had a major announcement on Thursday the 15th that got totally buried in the news cycle. V-JEPA is a video embedding tool.
Wrapping up for a hopefully chill weekend in the ML world, here are two Twitter messages from Sasha Rush:
The interesting one: “Random Q: How many models that surpass Llama 7B have been trained, from scratch, in the world?”
”People seem to guess only ~30 orgs have trained a Llama2-7B level model, open or closed, but that those aiming higher have trained 100s. (Power-law kind of thing).”The uplifting one: “Your research is cool, your problem is important, your work is worthwhile. There is so much we don't know.”
Audio of this post will be available later today on podcast players, for when you’re on the go, and YouTube, which I think is a better experience with the normal use of figures.
Looking for more content? Check out my podcast with Tom Gilbert, The Retort.
Folks were annoyed they did not cite recent work on the Vision Transformer (ViT) in the announcement, which is similar to Matryoshka Representation Learning, MRL, an issue that just happened in the embeddings world. They cited it in the technical post.