Sycophancy and the art of the model

GPT-4o-simp, LMArena backlash, and people refusing to understand how messy and crucial RLHF is.

May 04, 2025

∙ Paid

Upgrade to paid to play voiceover

Reminder that paid subscribers who enjoy the voiceovers should set up their podcast feeds through Substack by going to interconnects.ai, clicking on your user icon, clicking “manage subscription,”and then set up “private podcast.” Reach out with any questions.

ChatGPT’s latest flagship model becoming extremely sycophantic, sparking a reasonable yet huge backlash and a quick rollback, is far from just being a training methodology problem. While it reflects challenges in current training practices, it is grounded in the shifting landscape of language model assistants. This specific GPT-4o episode and related snippets across the industry should help people remember that techniques like reinforcement learning from human feedback (RLHF) and other preference or character tuning methods are absolutely central to current models — and still being optimized rapidly — even when the current in-vogue topic is reasoning.

This preference tuning is one of the unsolvable problems facing chatbots and leads us directly to trade-offs that the AI community will have to face on balancing output and engagement with this new interface.

This post will have many sections and topics, as it is likely one of the most important releases and episodes of the year, even if it isn’t about new record benchmark scores.

Source: https://x.com/___frye/status/1916346474893656572

You may want to read some of the following to understand the facts of the matter, but the core snippets will be called out along the way.

The recent Interconnects post on transparency, which was a direct response to the lax release process from OpenAI that foreshadowed potential issues as followed. It is hopefully a wake-up call for folks in industry that OpenAI’s least transparent release in a while sparked an immediate crisis.
OpenAI’s initial response to sycophancy and, more importantly, their full post-mortem (strongly recommend reading in full).
OpenAI’s Model Spec (or specifically the part on sycophancy), and the old and new Interconnects posts on the matter.
A textbook overview of how product, character training, and RLHF relate to recent models.
The Leaderboard Illusion: A research paper led by Cohere Labs studying how industrial players manipulate a lack of transparency from LMArena to potentially boost their scores. LMArena responded in multiple ways to clarify.
Model personality is a recurring theme on this blog, such as being why I switched to Claude last summer, but OpenAI’s recent moves here helped me move back to ChatGPT.
Honorable mentions: A mini-benchmark I found on Sycophancy: Syco-Bench, discussion of the unauthorized r/ChangeMyMind bots.

To set the stage, we need to consider some basic facts about the AI industry today:

Much like how YouTube and TikTok are major competitors for Netflix (they regularly highlight it in their earnings), Character AI, CHAI, Meta AI, and others are major competitors for ChatGPT. Here’s a direct quote from Netflix’s earnings from Q1 2024:
When I look at the short-form viewing on YouTube and TikTok, some of it is adjacent and quite complementary to our viewing … At the same time, some short-form viewing is directly competitive with Netflix.
While usage time is obviously in conflict with OpenAI’s business model in a way by increasing costs, they also know that ChatGPT’s biggest advantage is its extensive use, so they must protect it. One of OpenAI’s best skills is creating viral moments to boost that.
Sycophancy is far from a new problem in language modeling research. The seminal paper is often referred to as Towards Understanding Sycophancy in Language Models, but plenty | of other | work | exists | directly studying it or mentioning it.

This recent point from John Schulman points to how tricky and understudied the major issues facing the field with preference learning are:
Whether to collect preferences ("do you prefer response A or B?") from the same person who wrote the prompt, or a different person, is important and understudied.

Sycophancy probably results when you have the same person doing the prompting and labeling, especially when the user does both.
ChatGPT has been making a series of serious improvements to personality and personal usefulness throughout 2025, from better character training to product features such as memory. They’re driving the industry in new directions of what a chatbot should be, and Meta AI is following suit with their new app announced at LlamaCon that has social features.

And some basic facts about what happened with this release:

OpenAI updated the model itself and not just the system prompt (which was also updated - source).1 This was a new set of model weights designed to improve many tasks, but most of the “improvements” were in personality.
OpenAI trains their models with many standard post-training techniques, but also on their Model Spec and from user feedback data (likely indirectly).
For more information on what happened, see Ethan Mollick’s general overview.

In this post, I cover:

So, what actually happened in the release process?
Why did the model end up like this?
What did OpenAI do right?
What did OpenAI do wrong?
What comes next?

So, what actually happened in the release process?

In terms of model versions, OpenAI has summarized the events well in their full post-mortem:

On April 25th, we rolled out an update to GPT‑4o in ChatGPT that made the model noticeably more sycophantic. It aimed to please the user, not just as flattery, but also as validating doubts, fueling anger, urging impulsive actions, or reinforcing negative emotions in ways that were not intended. Beyond just being uncomfortable or unsettling, this kind of behavior can raise safety concerns—including around issues like mental health, emotional over-reliance, or risky behavior.
We began rolling that update back on April 28th, and users now have access to an earlier version of GPT‑4o with more balanced responses.

This entire episode lasted about 3.5 days. The crux of the issue is: Why did OpenAI think this was a good model to release?

The short of this is that OpenAI could not measure a signal, which showed that the model was weird. This GPT-4o model, which I wish they would give an official identifier — for the purposes of this post, it will be referred to as GPT-4o-new — is far from the first weird vibes release in recent times.

Claude 3.7 is still thought of as being worse than Claude 3.5 New by many programmers due to its hallucinations and weird behaviors, but the magnitude of the behavioral issues is minor. Weird models are common in the AI process and it can be very hard to pick up on just what is off — you can feel like you’re psycho-symptomizing yourself, trying to convince yourself that a model is either good or bad. Relying on qualitative judgments feels risky!

This GPT-4o-new episode is the better case study for a slightly borked model.

OpenAI details how they use offline evaluations — e.g. public capability evals like GPQA and private, yet similar evaluations), spot checks and expert testing — vibe tests and reviewing specific prompts for weird behavior, safety evaluations — checks in high-risk safety domains, and A/B tests — testing engagement time and thumbs up/down on a small set of real users.

This is fairly extensive and represents on the order of 100s of signals to check before release. OpenAI’s vibe testers sensed something was wrong, but all the other scores looked strong. There were numerous quantitative signs that the model was better, likely spanning capabilities, personality, and user benchmarks.

The core of AI researchers’ abilities is to hillclimb on good benchmarks, so OpenAI was faced with a difficult real-world decision:

should we withhold deploying this update despite positive evaluations and A/B test results, based only on the subjective flags of the expert testers?

They went for it. It is so at the core of researchers and engineers to trust the measurable over the gut check, which makes AI decisions like this particularly challenging. With so many measurable components of a model, it’s easy to get lulled into thinking they’re complete enough and down-weigh the unmeasured.

These labs operate in a domain of very incomplete information, and in some ways, it’s impressive that things like this don’t happen more often.

OpenAI, particularly, with their huge user base, is re-discovering many headaches that other technology companies have gone through in their own growth cycles.

Why did the model end up like this?

The most interesting component of the full post-mortem is the hints into OpenAI’s training processes — the evaluation processes highlighted above are standard practice and were what I would’ve written here regardless of OpenAI documenting them.

To get started, OpenAI hinted at how their new product features are changing the behavior of the models.

We have also seen that in some cases, user memory contributes to exacerbating the effects of sycophancy, although we don’t have evidence that it broadly increases it.

User memory consists of multiple things: Key facts the model extracts and stores (likely in-context), custom instructions, a RAG-like search over conversations when relevant, and maybe recent thumbs-up/down data. This is saying that for some users, their ChatGPT setup makes sycophancy worse, while for most users, the product features don’t change anything. These per-user features are extremely hard to test because OpenAI now needs to create evaluation personas spanning normal use, which is by definition almost impossible to completely cover.

This is OpenAI building the future — models tuned to individual users — at a time when nothing remotely close to it has been shipped.

For a specific training intervention that likely amplified the risk of a failure mode like sycophancy getting baked into the model, rather than just emerging from certain use-cases, OpenAI mentions their new reward signal:

For example, the update introduced an additional reward signal based on user feedback-thumbs-up and thumbs-down data from ChatGPT. This signal is often useful; a thumbs-down usually means something went wrong.

I’ve long wondered who regularly clicks the thumbs up button in AI chat windows. Thumbs down is fairly obvious and done in frustration, but it is not surprising that thumbs up is related to misaligned engagement bait-y behaviors. Some variants of this are likely very obvious, but most of them are likely subtle and contribute to things like sycophancy when optimized for in aggregate.

This goes into the standard RL post-training they mentioned earlier in the post:

To post-train models, we take a pre-trained base model, do supervised fine-tuning on a broad set of ideal responses written by humans or existing models, and then run reinforcement learning with reward signals from a variety of sources.

Where textbook-standard reward models predict the likelihood that a response from a model falls in the “chosen” bucket instead of “rejected,” it seems as if OpenAI has trained a reward model to predict the user's thumbs-up/down signals. This reward model would need to be different because it’s being trained on single datapoints rather than pairwise completions. While unusual, there are no reasons for it to be impossible.2

This reward model likely proved too easy to over-optimize against — OpenAI continues:

But we believe in aggregate, these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check. User feedback in particular can sometimes favor more agreeable responses, likely amplifying the shift we saw.

When presented with multiple rewards, reinforcement learning will always hillclimb on the simplest one.

OpenAI paid the price for increasing their training setup complexity faster than the paired evaluation set. This is partially bog-standard over-optimization, but also OpenAI taking a risk by training directly on user data without sufficient controls in place.

While the focus of today is on RL for reasoning, this’ll help everyone remember that RL from human feedback (RLHF) is a forever problem for these types of models. The reasoning RL with verifiable rewards (RLVR) will "saturate" like pretraining, and RLHF will never fully be solved.

RLHF is where the art of the model is crafted and requires a qualitative eye, deep intuition, and bold stances to achieve the best outcomes. While pushing so hard to reach the frontier of models, it appears that the best models are also the ones that are closest to going too far.

o3 is spectacular because of its amazing new behaviors, but that comes with just a tolerable amount of weird new hallucinations. What if it turned out for most interactions, this new ChatGPT was actually a far better assistant — it just doesn’t work at all for personal reflection and debate?

As competition increases and the importance of personality increases, more and more labs are trying to tread this line. Remember that even Grok was woke?

What did OpenAI do right?

OpenAI handled the response very well, but also has set itself up for challenging situations like this with better documentation of its modeling goals than any other leading laboratory. I’m saying again and again that every frontier lab should have a Model Spec like OpenAI pioneered. This is a major reason why. OpenAI’s model spec explicitly said that they don’t want the models to be sycophantic.

I’d rather have a model spec than details of a system prompt. A system prompt is often designed to pass a variety of checks in an artful manner. Amanda Askell summarized this well:

You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

The Model Spec tells us what, roughly, those tests would be, rather than the input variable (the system prompt). For example, if Gemini had a Model Spec stating they want the models to be factual, especially with history and including examples, the Gemini backlash for generating forced diversity in factually inaccurate responses to historical queries would’ve been far more forgivable.

On top of all of this, the post-mortem on OpenAI’s release processes will be referenced as one of the best summaries of how frontier model post-training release decisions are handled. It is a place to start to build out further industry transparency.

What did OpenAI do wrong?

It’s easy to go on and on with this one, but it’s best to be practical. OpenAI is learning that they are not just a research lab anymore, and they need to take any change in model very, very seriously. OpenAI, by having the most users, is most poised to make the general anti-tech and anti-AI sentiment of consumers even stronger. They, hopefully, will now take that responsibility even more seriously.

At the technical level, OpenAI got too deep in the data, which is a classic problem. As Jeff Bezos said on the Lex Fridman Podcast:

When the data and the anecdotes disagree, the anecdotes are usually right.
It’s usually not that the data is being miscollected.
It’s usually that you’re not measuring the right thing.

With more users, more data, and more complexity, it tends to become easier to manipulate the story to fit your needs.

What comes next?

A surprising amount of people in my circles who are well-respected members of the AI community think this change was intentional and only rolled back because of a negative reaction. Too many people attribute to malice what was a simple training (and evaluation, so, tooling) error. OpenAI has obvious cultural oddities, but their ideology is quite culturally aligned with providing user value in terms of output rather than engagement farming, even if this is imposing a ceiling on their business relative to the potential of ads.

There’s pressure to increase user retention, as the industry gets more competitive, but in talking to leading folks doing the behavior training across AI laboratories, that isn’t remotely on their radar. There are far more immediate problems to address, which normally are around making the model correctly permissive across a range of sensitive topics. There are gains in everyday tasks like coding and deep research from the models’ having better personalities, but de-correlating these improvements from potential engagement traps across different user profiles is extremely hard.

This points to a structural limitation in OpenAI’s, and the industry’s in general, current approach rather than any malice. They’re optimizing a single, default model for a wide variety of use-cases and users. I would say a reasonable critique is that they’re blind to it because many people in leading labs are using cleverly crafted prompts or model versions that pull them closer to the frontier of performance, rather than closer to the average user.

In the future, models will: