In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?
Why reward models are still key to…
In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?