Efficient Agentic LLM

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference ICLR 2025. Paper

Qining Zhang, Lei Ying

Moti: Reward function construction bottleneck: RLHF -> DPO -> GRPO.

Design: Directly apply policy-gradient through ZO-based value function estimation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback