5 Comments
User's avatar
Nathan Lambert's avatar

Reward model comes in a static form as an output of state b)

VJAnand's avatar

So the final reward computed is weighted between human input and what the reward model generates

Nathan Lambert's avatar

Think of part c as a feedback loop. The samples keep getting generated and the bottom part is the reward a la a classic RL problem.

VJAnand's avatar

Hi @Nathan - In the RLHF systems diagram above the 3rd stage where the Policy LLM is getting trained, we see the signal of corrected caption passed into reward model - is the reward model also getting updated?

Nathan Lambert's avatar

Don’t think so, it’s just an input usually