It's wild to think that new lightbulb products might undergo far more stringent pre-release testing than some frontier models. Does the cost of delaying a release by a few weeks for third-party evaluation (e.g. o1 testing approach) really outweigh the potential costs of misalignment? Even if one believes extreme harms/misuse are far away, developing muscle memory will be key as these risks inevitably become more palpable with continued model improvement.
“If you think safety is expensive, try having an accident”
I agree. I think a lot of Dario Amodei's efforts to make a "race to the top" on safety and how the current climate has undone all of that. Of course, we can't control what Chinese laboratories do directly, but it felt like the right thing to try.
(I advocate for the same race to the top for something like Model Specs)
Great post as always! I feel the most meaningful benchmarks for indexing progress towards transformative consumer and enterprise AI are agentic; e.g. SWEBench, GAIA, VisualWebArena, etc. Ultimately these benchmarks measure systems, not bare models, and are more nuanced and don't sound as impressive to talk about as, say, competitive programming, and so get less press in model launches. Also blocking many AI agent applications is prompt injection, and we need better and more realistic agentic evals to index progress here.
It's wild to think that new lightbulb products might undergo far more stringent pre-release testing than some frontier models. Does the cost of delaying a release by a few weeks for third-party evaluation (e.g. o1 testing approach) really outweigh the potential costs of misalignment? Even if one believes extreme harms/misuse are far away, developing muscle memory will be key as these risks inevitably become more palpable with continued model improvement.
“If you think safety is expensive, try having an accident”
I agree. I think a lot of Dario Amodei's efforts to make a "race to the top" on safety and how the current climate has undone all of that. Of course, we can't control what Chinese laboratories do directly, but it felt like the right thing to try.
(I advocate for the same race to the top for something like Model Specs)
Great post as always! I feel the most meaningful benchmarks for indexing progress towards transformative consumer and enterprise AI are agentic; e.g. SWEBench, GAIA, VisualWebArena, etc. Ultimately these benchmarks measure systems, not bare models, and are more nuanced and don't sound as impressive to talk about as, say, competitive programming, and so get less press in model launches. Also blocking many AI agent applications is prompt injection, and we need better and more realistic agentic evals to index progress here.