Discussion about this post

User's avatar
T Stands For's avatar

It's wild to think that new lightbulb products might undergo far more stringent pre-release testing than some frontier models. Does the cost of delaying a release by a few weeks for third-party evaluation (e.g. o1 testing approach) really outweigh the potential costs of misalignment? Even if one believes extreme harms/misuse are far away, developing muscle memory will be key as these risks inevitably become more palpable with continued model improvement.

“If you think safety is expensive, try having an accident”

Expand full comment
Joshua Saxe's avatar

Great post as always! I feel the most meaningful benchmarks for indexing progress towards transformative consumer and enterprise AI are agentic; e.g. SWEBench, GAIA, VisualWebArena, etc. Ultimately these benchmarks measure systems, not bare models, and are more nuanced and don't sound as impressive to talk about as, say, competitive programming, and so get less press in model launches. Also blocking many AI agent applications is prompt injection, and we need better and more realistic agentic evals to index progress here.

Expand full comment
1 more comment...

No posts