-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about results in the paper #43
Comments
Hi @lchenat, all the parameters should have been documented in the appendix of the paper. However, it is not guaranteed that you will get exactly the same result, due to difference in random seeds. I'd be happy to assist if you observe significant discrepancies. |
I did not find the parameters of DDPG in the appendix of the paper. I ran the following code and the maximal average return in each iteration is no more than 2400: import re path = "/home/data/lchenat/rllab-master/data/local/experiment/" stub(globals()) env = normalize(CartpoleEnv()) policy = DeterministicMLPPolicy( #delete the previous data run_experiment_lite( |
By the way, is there a function that can calculate the metric defined in the paper? (average over all iterations and all trajectories) The debug log only provides average return of each iteration and the number of trajectories in each iteration are not provided in some algorithms. |
Also, as mentioned in the paper (probably should have been more clear), we scaled all the rewards by 0.1 when running DDPG. Refer to https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py#L112 In general, we found this parameter to be very important, and due to time constraint at the time we weren't able to tune it extensively. You may try some other values on other tasks, which may give you even better results. Re the second question: I think we did a very crude approximation and simply averaged the results over all iterations (treating it as if all iterations had the same number of trajectories). Feel free to submit a pull request that adds additional loggings. |
I have scale the reward by 0.1, but I still got return around 2500, are there any other parameters that I need to tune? |
Oh, you should change max path length in ddpg to 500. Otherwise, the optimal score is 2500! |
Yes, the optimal score has increased to 5000 after I have changed the max path length to 500, but the average over all iteration is around 3100. Here are average return extracted from the debug.log: [85.1877, 22.4833, 22.2935, 22.4445, 22.561, 22.3393, 22.8141, 22.2145, 22.2697, 22.3604, 100.441, 177.388, 196.363, 183.331, 223.452, 272.554, 293.124, 407.079, 535.813, 619.828, 695.468, 872.355, 1028.65, 952.744, 645.209, 846.002, 601.686, 607.632, 656.687, 697.427, 715.399, 646.103, 646.78, 621.531, 609.173, 629.381, 598.768, 633.524, 603.093, 692.313, 627.032, 665.51, 671.895, 678.046, 721.31, 670.6, 645.387, 603.164, 594.49, 617.101, 676.009, 634.184, 627.533, 658.008, 700.695, 684.835, 622.859, 596.207, 691.321, 615.621, 612.777, 573.243, 598.272, 611.166, 596.099, 598.044, 551.066, 636.267, 740.511, 599.541, 605.533, 615.751, 710.193, 662.288, 619.205, 661.016, 582.386, 582.968, 601.911, 653.29, 617.729, 651.414, 744.331, 714.654, 658.312, 804.903, 841.202, 925.207, 855.179, 1044.97, 895.128, 936.976, 1066.89, 1406.07, 2131.26, 4021.35, 1814.43, 1877.28, 1512.61, 1993.6, 1686.47, 1991.07, 3476.89, 4138.7, 2385.71, 3379.73, 2648.44, 2970.91, 4008.72, 4683.97, 3603.48, 4999.14, 4999.04, 4998.86, 2328.25, 4534.03, 4999.28, 4999.24, 4998.56, 4283.28, 4998.47, 4998.89, 4998.86, 2223.49, 4999.18, 2702.06, 4998.8, 4998.67, 4999.02, 4998.57, 4999.6, 4998.84, 4998.5, 4998.65, 2449.9, 2153.85, 2034.24, 1275.76, 1394.86, 2258.75, 4557.9, 4998.51, 4998.52, 4998.37, 4998.73, 4998.16, 4997.71, 4997.81, 4583.94, 4998.32, 4998.46, 4998.38, 4998.21, 4804.9, 4997.79, 4998.41, 4998.03, 4998.44, 4998.26, 4998.16, 4998.07, 4998.21, 4997.73, 4998.04, 4997.81, 4998.3, 4998.33, 4998.2, 4998.27, 4998.15, 4998.6, 4998.23, 4998.63, 4998.58, 4998.57, 4999.11, 4999.32, 4999.47, 4999.41, 4790.46, 4999.45, 4999.45, 4999.57, 4999.45, 4781.79, 4999.5, 4999.46, 2834.94, 2667.89, 4999.43, 4879.07, 4999.51, 4999.5, 4256.07, 4999.24, 3749.83, 3140.73, 2184.49, 3293.37, 4276.64, 4570.93, 4549.38, 4448.15, 4999.32, 4608.16, 4999.52, 4999.38, 4999.16, 4999.43, 4790.45, 4999.54, 4724.55, 4999.43, 4627.56, 4999.58, 4999.45, 4272.88, 4999.26, 4999.38, 4784.83, 4731.7, 4696.11, 4427.15, 4165.41, 4906.99, 4422.53, 3953.47, 3692.44, 4123.02, 4571.29, 4450.07, 4999.32, 4859.32, 4999.44, 4498.9, 4895.5, 4999.22, 4589.09, 4998.88, 4733.38, 4775.73, 4999.29, 4999.18, 4640.48, 4610.55, 4935.44, 4999.2, 4883.15, 4852.51, 4900.67, 4835.74, 4500.04, 4738.27, 4531.23, 4530.79, 4999.0, 4999.18, 3974.69, 4797.54, 4998.95, 4000.32, 3699.98, 3424.3, 4998.86, 4003.68, 4878.38, 4915.73, 4763.66, 4998.63, 4688.21, 4998.92, 4926.33, 3244.25, 4507.45, 4998.75, 4998.79, 4998.45, 3060.27, 2583.36, 2717.86, 2005.12, 4911.39, 4998.91, 4998.66, 4660.82, 4789.71, 4998.43, 4998.52, 4884.03, 4541.58, 4998.37] The average return drops down from 5000 to 2000-3000 from time to time, is that a normal phenomenon in ddpg? |
@lchenat The benchmark results were run over 25 million samples, to match the sample complexity used by other algorithms. This should correspond to roughly 2500 epochs. A good approximation could be to extrapolate the performance of the last few epochs to the same amount of sample, and compute the average return using all these data. I have also observed that ddpg is sometimes unstable, even in cartpole. What you're getting seems about right. One thing we didn't try was batch normalization, which we did not get working before the paper deadline and this could be a good thing to try. You can also try other reward scaling (e.g. 0.01), which might stabilize learning more. |
Hi, I recently tried to reproduce the experiment result in your paper and I found some results are somehow different from the results in the paper. Did you use default parameters for all algorithms when you did the experiment?
The text was updated successfully, but these errors were encountered: