feature(nyz&dcy): add LLM/VLM RLHF loss (PPO/GRPO/RLOO) #857

PaParaZz1 · 2025-02-13T06:38:38Z

Description

polish and test original PPO
implement GRPO and RLOO
optimize efficiency in real LLM cases

Related Issue

TODO

Check List

merge the latest version source branch/repo, and resolve all the conflicts
pass style check
pass all the tests

codecov · 2025-02-13T07:09:06Z

Codecov Report

Attention: Patch coverage is 98.22222% with 4 lines in your changes missing coverage. Please review.

Project coverage is 75.52%. Comparing base (64efcb3) to head (17a7a71).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
ding/rl_utils/tests/test_grpo_rlhf.py	94.54%	3 Missing ⚠️
ding/rl_utils/ppo.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #857      +/-   ##
==========================================
+ Coverage   75.44%   75.52%   +0.07%     
==========================================
  Files         689      698       +9     
  Lines       56360    56679     +319     
==========================================
+ Hits        42523    42807     +284     
- Misses      13837    13872      +35

Flag	Coverage Δ
unittests	`75.52% <98.22%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

- Add test_grpo_rlhf.py for GRPO unit tests - Add test_rloo_rlhf.py for RLOO unit tests - Update GRPO implementation - Update RLOO implementation

PaParaZz1 · 2025-02-19T05:03:32Z

ding/rl_utils/grpo.py

+) -> Tuple[namedtuple, namedtuple]:
+    """Calculate the policy loss for GRPO
+    Args:
+        data (grpo_policy_data): Data containing the following fields:


polish comment formats

PaParaZz1 · 2025-02-19T05:03:38Z

ding/rl_utils/grpo.py

+        clip_ratio: float = 0.2,
+        beta: float = 0.1,  # Weight coefficient for KL divergence
+) -> Tuple[namedtuple, namedtuple]:
+    """Calculate the policy loss for GRPO


add paper link

PaParaZz1 · 2025-02-19T05:05:26Z

ding/rl_utils/rloo.py

+    }
+
+    # Create return namedtuples
+    loss_info = namedtuple('LossInfo', ['policy_loss'])(policy_loss=loss)


if there is just one field, you can directly return it rather than use namedtuple

PaParaZz1 · 2025-02-19T05:05:46Z

ding/rl_utils/rloo.py

+
+    # Create return namedtuples
+    loss_info = namedtuple('LossInfo', ['policy_loss'])(policy_loss=loss)
+    metric_info = namedtuple('MetricInfo', list(metrics.keys()))(**metrics)


you can define namedtuple at the beginning of this file

PaParaZz1 · 2025-02-19T05:07:11Z

ding/rl_utils/grpo.py

+        data (grpo_policy_data): Data containing the following fields:
+            - logit_new: Current policy logits [B, L, V]
+            - logit_old: Old policy logits [B, L, V]
+            - logit_ref: Reference policy logits [B, L, V]


PaParaZz1 · 2025-02-19T05:07:26Z

ding/rl_utils/grpo.py

+            - logit_ref: Reference policy logits [B, L, V]
+            - action: Actions taken [B, L]
+            - adv: Advantage values [B]
+            - weight: Attention mask [B, L]


use the extra Shapes part

PaParaZz1 · 2025-02-19T05:08:24Z

ding/rl_utils/grpo.py

+            - adv: Advantage values [B]
+            - weight: Attention mask [B, L]
+        clip_ratio (float): PPO clipping ratio, default 0.2
+        beta (float): Weight coefficient for KL divergence, default 0.1


add a period to the end of each sentence.

PaParaZz1 · 2025-02-19T05:10:15Z

ding/rl_utils/grpo.py

+            - logit_old: Old policy logits [B, L, V]
+            - logit_ref: Reference policy logits [B, L, V]
+            - action: Actions taken [B, L]
+            - adv: Advantage values [B]


…umption

PaParaZz1 added the enhancement New feature or request label Feb 13, 2025

PaParaZz1 added 2 commits February 13, 2025 14:43

test(nyz): polish ppo and add rlhf ppo loss test

2a51392

interface(nyz): add naive interface about grpo/rloo

2e49437

PaParaZz1 force-pushed the dev-rlhf-loss branch from 6965fd3 to 2e49437 Compare February 13, 2025 06:43

PaParaZz1 changed the title ~~feature(nyz): add LLM/VLM RLHF loss (PPO/GRPO/RLOO)~~ feature(nyz&dcy): add LLM/VLM RLHF loss (PPO/GRPO/RLOO) Feb 13, 2025

PaParaZz1 added the algo Add new algorithm or improve old one label Feb 13, 2025

Berit-chengyi force-pushed the dev-rlhf-loss branch from d3f6f3f to 9cb6ca3 Compare February 13, 2025 12:52

test&implement(dcy): add unit tests for GRPO and RLOO

8d34eac

- Add test_grpo_rlhf.py for GRPO unit tests - Add test_rloo_rlhf.py for RLOO unit tests - Update GRPO implementation - Update RLOO implementation

Berit-chengyi force-pushed the dev-rlhf-loss branch from 9cb6ca3 to 8d34eac Compare February 13, 2025 13:02

PaParaZz1 mentioned this pull request Feb 13, 2025

Roadmap for DI-engine #548

Open

Berit-chengyi added 2 commits February 14, 2025 14:45

polish(dcy): polish grpo and rloo and test unit

71190d4

(dcy) rloo and grpo

2cbd9fb

Berit-chengyi force-pushed the dev-rlhf-loss branch from eba91a1 to 2cbd9fb Compare February 14, 2025 09:11

(dcy) redesign avd from reward

17a7a71

Berit-chengyi force-pushed the dev-rlhf-loss branch from 7bcd64d to 17a7a71 Compare February 18, 2025 07:17

PaParaZz1 commented Feb 19, 2025

View reviewed changes

(dcy) Polish style：Use selective log-softmax to reduce peak vram cons…

5358b8d

…umption

Berit-chengyi force-pushed the dev-rlhf-loss branch from 7a82a7b to 5358b8d Compare February 20, 2025 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(nyz&dcy): add LLM/VLM RLHF loss (PPO/GRPO/RLOO) #857

feature(nyz&dcy): add LLM/VLM RLHF loss (PPO/GRPO/RLOO) #857

PaParaZz1 commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

PaParaZz1 Feb 19, 2025

feature(nyz&dcy): add LLM/VLM RLHF loss (PPO/GRPO/RLOO) #857

Are you sure you want to change the base?

feature(nyz&dcy): add LLM/VLM RLHF loss (PPO/GRPO/RLOO) #857

Conversation

PaParaZz1 commented Feb 13, 2025 • edited Loading

Description

Related Issue

TODO

Check List

codecov bot commented Feb 13, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaParaZz1 commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading