evaluation

Building on evaluation quicksand

On the state of evaluation for language models.

Oct 16, 2024 • Nathan Lambert

Interviewing Riley Goodside on the science of prompting

Listen now | Interconnects interview #6. o1, chain of thought, evaluation, and the future of prompting.

Sep 30, 2024 • Nathan Lambert

1:08:38

On Nous Hermes 3 and classifying a "frontier model"

The latest model from one of the most popular fine-tuning labs makes us question how a model should be identified as a “frontier model.”

Aug 16, 2024 • Nathan Lambert

GPT-4o-mini changed ChatBotArena

And how to understand Llama 3.1’s results on the community's favorite benchmark.

Jul 31, 2024 • Nathan Lambert

Evaluations: Trust, performance, and price (bonus, announcing RewardBench)

Evaluation is not only getting harder with modern LLMs, it’s getting harder because it means something different.

Mar 20, 2024 • Nathan Lambert

Big Tech's LLM evals are just marketing

A PSA everyone needs. The importance of a wait and see attitude when it comes to new models, big and small, open and closed.

Dec 13, 2023 • Nathan Lambert

Evaluating and uncovering open LLMs

When choosing a model, we're stuck in the middle between classic NLP benchmarks (e.g. MMLU) and qualitative chatbot ranking. Neither are exactly what we…

May 31, 2023 • Nathan Lambert

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts