GPT-4o's images and lessons from native input-output multimodality
Hints of a natively multi-modal future.
One of the largest open questions on the trajectory of language modeling research that has been unanswered over the last couple of years has been: Will natively multimodal models like GPT-4o, Chameleon, or Gemini have fundamental advantages in scaling by having more pretraining data sources?
GPT-4o and Gemini are the flagship models surrounding this question. I remember it vividly — when GPT-4o launched in May of 2024, many people asked, “Wait, isn’t this model worse?” It was one of those many times where I had to have faith in the long trajectory modeling progress and that it would get better. The other side was Gemini, where the blog post in December of 2023 stated:
It was built from the ground up to be multimodal, which means it can generalize and seamlessly understand, operate across and combine different types of information including text, code, audio, image and video.
Over the last 12-18 months, these natively multimodal trained models definitely have continued to improve in performance and it’s no longer a reasonable complaint that the original GPT-4 (or Bard!?!?) is better. Replications of this are arguably even further behind than techniques like text-only models and multimodal finetunes for vision.
These omni models are really not widely available. Chameleon had its generative abilities restricted for release by removing the weights for multimodal generation. Some new serious work just emerged last week from Qwen, but it’s still just a 7B omni model (there are other early models you can track in our Artifacts Log roundups).
All of this is to say that while some major model providers made the jump to native multimodal training, the use of generation hasn’t broken through into the forefront of AI discourse to the extent that image generation has last week. Both OpenAI and Google have voice modes that directly interact with these multimodal models (e.g. simply confirmed in OpenAI’s FAQs).
As with much of society these days, visuals are what breaks through the noise and becomes a focusing point of discussion. At least in the case of OpenAI’s audio mode, the image generation is also far more permissive in complying with potentially sensitive requests than the audio model. The audio mode of ChatGPT won’t even sing, let alone make instrumental noises, but it seems like that’ll come soon as the general restrictions on behavior loosen.
Users of Gemini have seen the benefit of this for taking in audio or video (or plugging YouTube directly into the model context in AI Studio), but this was not super new relative to models like GPT-4v, which were finetuned to also take visual content.
Now, we are finally getting access to generation. There are no clear signs of why this took so long, but there are plenty of directions this points to for the future.
If native multimodal training is table stakes for having a model truly at the frontier of performance and user experience in the next 1-2 years, it’ll make it substantially harder for the next likes of xAI or DeepSeek to pop onto the scene. Training these native multimodal models presents a natural next step in technical difficulty compared to text-only or text-image input models, by needing substantial data acquisition and mixing knowledge along with more training stability improvements.

Shifting norms of model control and behavior
The most impactful change for a switch from DALLE for image generation to the model natively is that it allows more control over aligning the generated images with other stages of behavior and character training within the organization.
Previously, generations to DALLE had to go through a “prompt rewriting” stage where GPT-4o rewrites the user prompt before passing it into DALLE in order to make it more detailed. This helped the final output be preferred in human preference rankings.
The change in mode here was that images didn’t really work as inputs here because the embeddings couldn’t go across systems. Generating with DALLE (and similar applications like Midjourney) often felt more like rolling the dice and hoping for a wonderful output. The tenor of the standalone AI images often felt separated from reality in a tacky way, while obviously being incredibly impressive.
GPT-4o makes prompting the image generation easier for technical and UX levels. It’s one of the most common habits of modern society to include images and text context in a text box. We’re used to this.
Then, technically, the best guesses are that GPT-4o does a first pass of autoregressive image generation (via generating patches in a uniform way across the model, like generating tiles from top left to bottom right), and then it is handed off to a diffusion model that takes that noisy image in instead of random noise. The core of the image being generated from a model that reads the image and inputs text is crucial.
Having one model in control is far easier from a behavior perspective. The gap between prompt rewriting and the generative model made problems like the model generating elephants when asked not to or content violations much messier to pin down.
So now OpenAI has a better system design to implement their goals on model behavior.