It seems that the importance sampling code part is wrong. #22

yhy258 · 2023-05-07T12:36:25Z

Lines 108 to 119 in e200eb8

 fixed_log_prob = normal_log_density(Variable(actions), action_means, action_log_stds, action_stds).data.clone() 

 def get_loss(volatile=False): 

 if volatile: 

 with torch.no_grad(): 

 action_means, action_log_stds, action_stds = policy_net(Variable(states)) 

 else: 

 action_means, action_log_stds, action_stds = policy_net(Variable(states)) 

 log_prob = normal_log_density(Variable(actions), action_means, action_log_stds, action_stds) 

 action_loss = -Variable(advantages) * torch.exp(log_prob - Variable(fixed_log_prob)) 

 return action_loss.mean()

The fixed log prob part of the line and the "get_loss" function part are exactly the same.
The two parts are executed consecutively so that the two values ("fixed_log_prob", "log_prob") are exactly the same.
Is there a reason you wrote the code like this?

asyua-ye · 2024-01-14T06:37:21Z

get_kl，also has this problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It seems that the importance sampling code part is wrong. #22

It seems that the importance sampling code part is wrong. #22

yhy258 commented May 7, 2023 •

edited

Loading

asyua-ye commented Jan 14, 2024

It seems that the importance sampling code part is wrong. #22

It seems that the importance sampling code part is wrong. #22

Comments

yhy258 commented May 7, 2023 • edited Loading

asyua-ye commented Jan 14, 2024

yhy258 commented May 7, 2023 •

edited

Loading