5 Comments
User's avatar
Nathan Lambert's avatar

Reward model comes in a static form as an output of state b)

Expand full comment
VJAnand's avatar

So the final reward computed is weighted between human input and what the reward model generates

Expand full comment
Nathan Lambert's avatar

Think of part c as a feedback loop. The samples keep getting generated and the bottom part is the reward a la a classic RL problem.

Expand full comment
VJAnand's avatar

Hi @Nathan - In the RLHF systems diagram above the 3rd stage where the Policy LLM is getting trained, we see the signal of corrected caption passed into reward model - is the reward model also getting updated?

Expand full comment
Nathan Lambert's avatar

Don’t think so, it’s just an input usually

Expand full comment