5 Comments
author

Reward model comes in a static form as an output of state b)

Expand full comment

Hi @Nathan - In the RLHF systems diagram above the 3rd stage where the Policy LLM is getting trained, we see the signal of corrected caption passed into reward model - is the reward model also getting updated?

Expand full comment