Evaluations: Trust, performance, and price (bonus, announcing RewardBench)

Evaluation is not only getting harder with modern LLMs getting more complicated, it’s getting harder because it means something different.

Mar 20, 2024

My most recent major project, RewardBench — the first evaluation tool for reward models — is out here. More below, but I’ll write up more takeaways soon. The key points on evaluation first.

A lot of smart people I know give off the vibes of having “given up” on evaluating the new LLMs that are rolling off the shelves. In the last year, evals have become as much a marketing piece as a measure in the literal sense of the word — a standard unit used to express the size, amount, or degree of something. There are two key ways in which evaluation is changing.

Evals are now about trust and performance, whereas previously they were just about performance.
Evals used to be cheap, but now evaluation tools are only cheap to big tech and those building the thing — i.e. it’s hard to evaluate a model as a consumer rather than a builder.

These two facts completely change the facts of who, what, and why we evaluate models. Part of why I’m excited about evaluation is because my employer, AI2, is in a good position because they can do both. I’ll get back to why. Organizations like NIST are looked to because they have an abundance of trust and little to offer on performance (academics are similar, with more performance). This is a natural role for government entities. New startups are exciting because they have the potential on the performance side of things. Synthetic data, automatic red-teaming, online monitoring of LLM products, feedback, etc., all offer ways to make the AI products we use and love better.

There are increasingly few organizations that can be trusted in a simple manner to be a source of truth on evaluation — they have no reason to cook the books. As every corporation tries to make money on AI, their trust goes down. On the other hand, the organizations that do have the potential to be trusted tend to be anti-correlated with those that have a technical voice leading in how models should and do work. AI2, through the OLMo models, can easily stay relevant in evaluations, and they can foot the bill to do fancy things like human evals, A/B testing, or whatever is in vogue with LLM evaluation.

A common phrase we hear when discussing LLM evaluation is “vibes-based evals.” Vibes-based evaluation is the tricky mechanism where one decides if they can trust their performance metrics. The most discussed evaluation tools today are discussed because they’re in locally optimal zones of trust and performance. LMSYS ChatBot Arena is trusted because it’s maintained by a nonprofit (and funded mostly by other technology companies). It’s limited in performance because the inputs are somewhat of a black box. AlpacaEval is trusted because it’s transparent, but it suffers from having a bottleneck in using LLM-as-a-judge versus a static model. It’s more accessible than LMSYS, where you need to find someone to foot the compute bill for your model, but anyone can break out on it.

The rising price of evaluation

For a long time evaluation was cheap. You ran the evaluation techniques on the same GPUs that you used to train the models. Now, evaluation is cheap only for big tech companies due to accruing costs and engineering complexity. The costs associated with evals for chat models range from actual human labor (e.g. A/B testing or Scale services) to large amounts of synthetic data (e.g. Synth Labs or Patronus.ai to tons of OpenAI credits. For any actor other than the likes of OpenAI, Anthropic, and Google, some set of these evaluations will have meaningful costs.

The fact that the leading open fine-tuning labs are priced out of evaluation tools core to RLHF in industry will make it even harder to match the subtle benefits that RLHF can confer on chat models.

For places like AI2, the cost of human testing isn’t really doable for every model, and even OpenAI credits can add up. For academics, the cost of OpenAI credits adds up (running one evaluation on AlpacaEval costs between 5 and 15 dollars depending on the version used). While this really isn’t much compared to compute, many academic labs run on a lot of free compute credits, so they need to figure out the same setup for free evaluation credits. There’s a big opportunity to have an API for LLM-as-a-judge that is cheaper and free of any terms of service constraints.

A lot of evaluation tools are popular by arbitrage of some sort of cost. HuggingFace takes the compute cost because it benefits from the Open LLM Leaderboard network effects. ChatBotArena pays for your GPU usage in exchange for usage and votes. Paying for many of the most used evaluation resources in the world was definitely not a thing of the past. Automatic eval was kind of a negligible cost you could add onto your training runs.

Depending on how the cost of evaluation evolves, this could be where government action is useful. The government is good at spending on things just for the sake of trust. We’re at the point where truly hidden eval sets are considered by many parties to avoid contamination. That becomes expensive because you have to evaluate and maintain everything if it’s not open-sourced.

This doesn’t even acknowledge how LLM evaluation also touches products and vertical applications, which it didn’t before. There’s an entire other post that should be made delineating a taxonomy for model evaluation vs. system evaluation, but I’m not there yet.

Announcing RewardBench: The First reward model evaluation tool

I’m excited to share something that we've needed since the early open RLHF days: Reward Bench, the first benchmark for reward models. In short (the paper is here):

We evaluated 30+ of the currently available RMs (w/ DPO too).
We created new datasets covering chat, safety, code, math, etc. We learned a lot.

For long-time readers of the blog, you’ll know this is a long time coming. I started making a lot of noise about reward models last spring. For anyone who works on AI and used to work on optimal control, the history and normative context of what it means to design a cost or reward or preference function means a lot. This background turned into a long paper on the history and risks of RL and human feedback (or a short YouTube talk). This history is the extended introduction to Reward Bench. Reward Models, when you talk to folks in industry or read the RLHF papers closely, are the fulcrum point of good RLHF. They’re messy, not well documented, almost nonexistent in the open, and hard to do right.

Reward Bench is mostly infrastructure and a starting point for more to come. The code makes it easy to do inference on reward models (with more sophisticated training coming soon). As for details, the paper introduction summarizes the many crazy things we find quite well.

Release a common framework for evaluating the many different architectures of reward models, along with tools for visualization, training, and other analysis. We also release all data used in the evaluation, composed of text-score pairs for all inputs, to enable further data analysis on the properties of reward models.
Illustrate the differences between DPO and classifier-based reward models across a variety of datasets. DPO models, while more plentiful due to the method’s simplicity, fail to generalize to popular preference data test sets and present a higher variance in performance.
Chart the landscape of current state-of-the-art reward models. We showcase the scaling laws, the propensity to refuse (or not), the reasoning capabilities, and more for popular RMs.
Show the limitations of existing preference data test sets for evaluating these models, show-casing common pitfalls of RMs on subtle, but challenging instruction pairs (e.g. intentionally modified rejected responses, which superficially look high quality but answer the wrong prompt).
We hope this benchmark enables more advanced reward model training, scientific understanding of the integration of human preferences in LMs, and ultimately better aligned, open language models.

I’ll share more takeaways soon, but everything people have been studying in open RLHF is reflected here: weird refusal behaviors, DPO being strong but with some weird stuff, base models mattering, and much more.

There’s already plenty of good discussion on Twitter. I put together an intro video of the various parts of RewardBench (for developers) here:

And here’s what the leaderboard looks like!

Updates to RLHF evaluation tools

By the way, the ChatBot Arena paper dropped recently.

Whenever building a new evaluation system, you need to figure out where on these niches you land. For example, the new benchmark from AI2 to compete in this open-ended generation landscape is WildBench. WildBench is a sort of AlpacaEval-LMSYS Leaderboard hybrid. Anyone can add a model to it (see readme, it’s less trivial than AlpacaEval because the maintainers need to design an algorithm for figuring out which models to compare to — one of the downsides of Elo), it does model-to-model comparisons, it does GPT4 as a judge.

It also has new things, such as better control over distributions (AlpacaEval is biased towards generation tasks like emails and ChatBotArena has no information on what tasks are in it), the ability to add human data, length penalties, and lots of other polish things. This seems like the natural evolution of where things like AlpacaEval will go — it’s not clear that AlpacaEval 2, as judged and compared to GPT4, offers much signal improvement over the original. WildBench is also gated by this problem, where we haven’t overcome the potential LLM-as-a-judge giant bias problem. There should be a trust ceiling until we have definitive science on what the heck LLM-as-a-judge is doing. In the meantime, incremental upgrades like WildBench are still important.

Another good point with relative rankings relative to a static comparison is the ability to fold in more data over time. I’m not 100% sure how it’ll work out, but it’ll be interesting to see if any noisy or unexpected scores get leveled over time by getting better at controlling prompts in evaluation datasets. If you’ve ever looked closely at them, we have a lot to learn.

It seems like an almost sure thing that WildBench will enter discussions along the likes of MT Bench, LMSYS, and AlpacaEval. It won’t have the same allure as being first in the space, but it’ll offer substantial robustness to rankings provided by other evaluations. If WildBench is to fail, it’s because it’s stuck in the middle between LMSYS and AlpacaEval, so the right signal is hard to find. Ease of use is likely the single most important factor when nourishing a new evaluation system.

When you look at the leaderboards, all of the models are very strongly correlated, and they have been for some time. Note, that I’m not an author of WildBench, I just want to see it succeed.

I know I’ve been promising you all another how RLHF actually works piece, and I’m still working on it. The core theme will be how a lot of RLHF is overfitting to these benchmarks, given they’re the tools we have, rather than matching what OpenAI and Anthropic are up to.

Aside: Length bias problems

While writing this, AlpacaEval got updated to include length bias controls. You can read more on Twitter, but essentially they learn a linear model that takes in the length measurement in addition to the completion and the instruction to calculate preferences. The linear model directly powers up or down the model abilities depending on if it overshoots the length — I think this could be gamified a bit but seems like the correlations and new orderings make sense. We’ll see! They can also add more biases to this “rerank” model.

This trick makes them the best-correlated instruction evaluation tool with ChatBotArena, which is generally accepted as the ground truth today for model abilities (WildBench needs to join this game, it seems). Yet another metric that seems to show Claude Opus being behind GPT-4, though, which some people online are confused about. LMSYS showed Claude 3 to be approximately equal to GPT-4, so any remaining gap could be the LLM-as-a-judge fudge factor.

For a while, it seemed all the strong open models had horrendous length problems, but that is no longer the case. Open models are getting much better and have similar lengths to models like GPT-4 now (at least the good ones do).

Audio of this post will be available later today on podcast players, for when you’re on the go, and YouTube, which I think is a better experience with the normal use of figures.

Newsletter stuff

Elsewhere from me

RewardBench is the big news for me this week. I’m going to talk about it at DLCT this Friday (10am PST).

Models, datasets, and other tools

Starchat 2 is exciting. Medium sized code models will enable a lot of product explorations.
Apple announced but didn’t release VLM, which isn’t too surprising.
Grok released via magnet link, which I don’t think policymakers will love.
Starling 7B v2 and a new Starling 34B reward model dropped! I am excited about these (especially the RM).

Housekeeping

Paid subscriber Discord access in email footer.
Referrals → paid sub: Use the Interconnects Leaderboard.
Student discounts in About page.