Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checkpoint monitoring class #75

Closed
wil3 opened this issue Jun 14, 2020 · 11 comments · Fixed by #77
Closed

Add checkpoint monitoring class #75

wil3 opened this issue Jun 14, 2020 · 11 comments · Fixed by #77
Labels
enhancement New feature or request

Comments

@wil3
Copy link
Owner

wil3 commented Jun 14, 2020

Is your feature request related to a problem? Please describe.
RL training produces checkpoints however the examples do not include the evaluation

Describe the solution you'd like
The thesis work used a checkpoint monitor to evaluate new checkpoints once created

Describe alternatives you've considered
We can alternatively do this ondemand but we'll leave this to a new issue/ PR

Additional context
This would be one of a new features to support RL training and evaluation.

@wil3 wil3 added the enhancement New feature or request label Jun 14, 2020
wil3 added a commit that referenced this issue Jun 14, 2020
wil3 added a commit that referenced this issue Jun 17, 2020
Closes: #75. Disable gravity to match thesis results. Fix bug where
RNG seed was ignored.
wil3 added a commit that referenced this issue Jun 17, 2020
Closes: #75. Disable gravity to match thesis results. Fix bug where
RNG seed was ignored.
@wil3 wil3 mentioned this issue Jun 17, 2020
wil3 added a commit that referenced this issue Jun 17, 2020
Closes: #75. Disable gravity to match thesis results. Fix bug where
RNG seed was ignored.
wil3 added a commit that referenced this issue Jun 17, 2020
Closes: #75. Disable gravity to match thesis results. Fix bug where
RNG seed was ignored.
@wil3 wil3 closed this as completed in #77 Jun 17, 2020
wil3 added a commit that referenced this issue Jun 17, 2020
Closes: #75. Disable gravity to match thesis results. Fix bug where
RNG seed was ignored.
This was referenced Jun 17, 2020
@xabierolaz
Copy link

Will test this tomorrow and see if I can add my own landing and takeoff scripts slowly

@wil3
Copy link
Owner Author

wil3 commented Jun 17, 2020

@xabierolaz this has been merged in to master was just referencing the changes. If you are looking to merge navigation support into GymFC it would be a good idea to open a feature request issue so others know you are working on it and so the approach can be discussed. Recently #76 was added however I have no idea if anything will come of it but if multiple people were working on it it'd be completed much faster.

@xabierolaz
Copy link

@wil3, have opened a new issue #79 and mentioned #76 in case we can complete it faster

@varunag18
Copy link

varunag18 commented Jun 23, 2020

Hi Wil, I have been going through the two scripts ppo_baselines_train.py and tf_checkpoint_evaluate.py. As per my understanding, ppo_baselines_train.py is the training script which saves the trained model in the checkpoints folder, while tf_checkpoint_evaluate.py is the script for retrieving the contents from the checkpoints folder(.meta file) and creating the trial.csv files.
If my understanding is correct, then there are a couple of doubts i need clarification on:

  1. Why do we have the below code in tf_checkpoint_evaluate.py:
ac = pi.action(ob, env.sim_time, env.angular_rate_sp, env.imu_angular_velocity_rpy)
ob, reward, done,  _ = env.step(ac)

that invokes the step function of Gymfc environment. It is this data that is getting logged. Should we not be retrieving the data from the checkpoints folder for storing it in the .csv files?
2. What is the need to run tf_checkpoint_evaluate.py in parallel to ppo_baselines_train.py? Can we not run tf_checkpoint_evaluate after the training has completed?
3. tf_checkpoint_evaluate.py is running in an infinite loop for now because of the below code in monitor.py
while self.watching:
self._check_new_checkpoint()
time.sleep(10)
I believe there should be the condition to terminate after waiting for a given time duration.

@wil3
Copy link
Owner Author

wil3 commented Jun 23, 2020

Hi @varunag18 ,

  1. Which part of the code do you have a question about?
    The first line we use the policy interface so we have a common interface for testing and benchmarking different policies. You can have a look at the current available ones here. A policy must have a action and reset function defined. If you look at that policy it is loading data from the checkpoint folder. The csv file is just a log of the episode. If you use the plotting tools you'll see how its used. The second line is the Gym interface, this is the same for all OpenAI gyms.
  2. There's nothing saying you need to run them parallel, it's not requirement just suggested, this is why they are separate processes. It's recommended so you can see the performance semi real-time so you can determine if you need to kill the job and its more efficient than running sequentially. In most cases you want to be training a large number of agents to get the best one.
  3. Yep this is a bug, feel free to submit a bug issue to track it. The monitor callback should probably return true to continue, false otherwise and set the return value here to self.watching.

@varunag18
Copy link

varunag18 commented Jun 24, 2020

  1. The first line we use the policy interface so we have a common interface for testing and benchmarking different policies. You can have a look at the current available ones here. A policy must have a action and reset function defined. If you look at that policy it is loading data from the checkpoint folder. The csv file is just a log of the episode. If you use the plotting tools you'll see how its used. The second line is the Gym interface, this is the same for all OpenAI gyms.

I think I did not understand it clearly then. Your response makes it clear to me now. The training script invokes the MlpPolicy class for training the NN, while the evaluation script invokes the PpoBaselinesPolicy class to get the trained data from the checkpoints folder. Am I correct now?

  1. In most cases you want to be training a large number of agents to get the best one.

Where exactly are we setting the count of number of agents?

  1. Yep this is a bug, feel free to submit a bug issue to track it.

Will do this for sure.

@wil3
Copy link
Owner Author

wil3 commented Jun 24, 2020

Yes that is correct. Checkpoints are just data files containing the neural network and other functions used during training. Training is specific to the RL algorithm. However as long as the algorithm produces a Tensorflow graph we can do the evaluation independent of the training library and just on the Tensorflow checkpoint which is ideal because this is more scalable. All you need to know is the input and output tensor names to extract the NN subgraph.

Where exactly are we setting the count of number of agents?

You don't. When you execute the PPO trainer it trains a single agent. To train more than one just execute that script N number of times. The number N depends on your research goals. Wrap the python call in a bash loop script if you want to automate it.

@varunag18
Copy link

varunag18 commented Jul 26, 2020

Hi Wil,
I was wondering if its possible to view the voltage and current values of the motors, as mentioned in Section VI C of your thesis, using which, the energy consumed at a given step can be calculated. In fc_env.py, within the _step_sim() method, if I print the self.state_message, I get the esc_current and esc_voltage values as 0, in the output. Please guide on how to add feature to monitor the energy consumption at each step.

@varunag18
Copy link

Another question, how do we exactly zero down on the best checkpoint? What is the criteria to do so? In your thesis, you write, "Once training was complete, we select the checkpoint that provided the most stable step responses, which occurred after 2,500,000 steps to use as our flight controller policy." Please elaborate on this.

@xabierolaz
Copy link

xabierolaz commented Jul 27, 2020

@varunag18
I thik he's refering to based on we getting a checkpoint each 100.000 steps, the model starts to converge (MAE gets closer to 0) after 2 million steps, so getting a checkpoint with the lowest available error would be the best choice

@wil3
Copy link
Owner Author

wil3 commented Jul 28, 2020

Hi @varunag18, this issue is currently closed. In the future please open a new issue if you have a new question.

Which chapter are you referring to? ESC voltage and current is baked into the framework (see the message type here) but there currently doesn't exist a model, this is not something I explored for my thesis work.

If this is something you are looking to support you can fork the aircraft-plugin repo, add the model and pass the value back here.

As @xabierolaz points out, if you plot over training validation the MAE or other error metrics it will begin to converge after a couple million steps. If the reward function was perfect we'd probably select the longest trained one with the highest reward. Unfortunately this isn't the case and it isn't perfect so after convergence it usually takes looking at a bunch of step responses plots and selecting the agent that produces the best step responses in terms of minimizing error and oscillations. Then when you look to have a good one trying it out on the drone and confirming there are no visual oscillations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants