In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?
Share this post
Why reward models are still key to…
Share this post
In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?