A sampling of recent happenings in the multimodal space. Be sure to expect more this year.
Reward model comes in a static form as an output of state b)
So the final reward computed is weighted between human input and what the reward model generates
Think of part c as a feedback loop. The samples keep getting generated and the bottom part is the reward a la a classic RL problem.
Hi @Nathan - In the RLHF systems diagram above the 3rd stage where the Policy LLM is getting trained, we see the signal of corrected caption passed into reward model - is the reward model also getting updated?
Don’t think so, it’s just an input usually
Reward model comes in a static form as an output of state b)
So the final reward computed is weighted between human input and what the reward model generates
Think of part c as a feedback loop. The samples keep getting generated and the bottom part is the reward a la a classic RL problem.
Hi @Nathan - In the RLHF systems diagram above the 3rd stage where the Policy LLM is getting trained, we see the signal of corrected caption passed into reward model - is the reward model also getting updated?
Don’t think so, it’s just an input usually