Multimodal LM roundup: Unified IO 2, inputs…

Nathan Lambert

Jan 10, 2024

A sampling of recent happenings in the multimodal space. Be sure to expect more this year.

Read →

5 Comments

Nathan Lambert

Jan 11, 2024

Reward model comes in a static form as an output of state b)

Expand full comment

So the final reward computed is weighted between human input and what the reward model generates

Expand full comment

Think of part c as a feedback loop. The samples keep getting generated and the bottom part is the reward a la a classic RL problem.

Expand full comment

Hi @Nathan - In the RLHF systems diagram above the 3rd stage where the Policy LLM is getting trained, we see the signal of corrected caption passed into reward model - is the reward model also getting updated?

Expand full comment

Reply (1)

Share