-
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference ICLR 2025. Paper
Qining Zhang, Lei Ying
Moti: Reward function construction bottleneck: RLHF -> DPO -> GRPO.
Design: Directly apply policy-gradient through ZO-based value function estimation.
-
Notifications
You must be signed in to change notification settings - Fork 0
Efficient Agentic LLM
License
simmonssong/efficient-agentic-llm
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
Efficient Agentic LLM
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published