Can RLHF even simpler to maximize the expectation of rewards? #236

kindernerd · 2025-01-07T11:27:40Z

GRPO simplifies advantage to (r-mean)/std, i'm wondering whether RLHF can even be simpler by directly maximum the following objective:
$\sum_o\pi_{\theta}(o|q)[r_o - E(r_o|q)]$
which can be approximated by sampling or using the N-best Lists
$\sum_{o_i\in \pi_{old}}\pi_{\theta}(o_i|q)[r_{o_i} -mean(r)]$
this is similar to sequence training (MWER) in e2e asr optimization, proposed by google in this paper https://arxiv.org/abs/1712.01818

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can RLHF even simpler to maximize the expectation of rewards? #236

Can RLHF even simpler to maximize the expectation of rewards? #236

kindernerd commented Jan 7, 2025 •

edited

Loading

Can RLHF even simpler to maximize the expectation of rewards? #236

Can RLHF even simpler to maximize the expectation of rewards? #236

Comments

kindernerd commented Jan 7, 2025 • edited Loading

kindernerd commented Jan 7, 2025 •

edited

Loading