-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarting training job doesn't visualize earlier log files #6782
Comments
How are you running TensorBoard in this situation? Are the log files created in the same directory? Are all steps recomputed in the new file, or are only newer steps added? |
Yes the log are created in the same directory. So I have the first log file and then a second log file that we the job restarting after the Google spot preemption. |
Have you tried specifying --reload_multifile? |
Yes I am using both |
Can you please provide the output of |
With that command over the log directory where I have found the issue (there are 4 logs file there of the same run) ======================================================================
Processing event files... (this can take a few minutes)
======================================================================
Traceback (most recent call last):
File "/usr/local/bin/tensorboard", line 8, in <module>
sys.exit(run_main())
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/main.py", line 41, in run_main
app.run(tensorboard.main, flags_parser=tensorboard.configure)
File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.11/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 278, in main
return runner(self.flags) or 0
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/program.py", line 291, in _run_serve_subcommand
efi.inspect(flags.logdir, event_file, flags.tag)
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 451, in inspect
inspection_units = get_inspection_units(logdir, event_file, tag)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 405, in get_inspection_units
field_to_obs=get_field_to_observations_map(generator, tag),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_inspector.py", line 212, in get_field_to_observations_map
for event in generator:
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 244, in Load
for record in super().Load():
File "/usr/local/lib/python3.11/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 178, in Load
yield next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /logs/my_job10/log/tensorboard/events.out.tfevents.1710242212.mymodel.74.0; No such file or directory And
|
The output of
followed by event statistics. Can you please attempt a rerun of |
Would you mind running with |
It doesn't solve it seems to me that it is always restarting the scalar graphs and the images from the last log file only. |
Are you able to copy the log directory locally and confirm that things work as expect? Would like to rule out whether this is something internal to TB or isolated to just reading from GCS. |
I've copied the same folder locally and it seems to work correctly. |
Do things work as expected when pointing to GCS directly instead of using the locally mounted path? |
Same bucket passing |
Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected. One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing |
1 similar comment
Hmm, this seems like it may potentially be an issue around the mounted path if all other avenues are working as expected. One additional thought, TB has some special filesystem handling that is enabled when fsspec is installed. Can you trying installing |
Ok I will try. See: |
TB itself (when using the python filesystem readers) relies on TensorFlow's filesystem API via When using RustBoard for file reading (which is the default, you can try disabling with |
@songjiaxun Any opinion about this on the CSI gcsfuse POSIX limitations? |
@groszewn I've isolated the bug. It is going to happen when I use The first one in the list is always loaded correctly after the TB restart.
Now the problem is that on the bucket it is common to have a lot of subfolders/runs with TB logs files and I cannot load all the logs just specifying the main root. |
Also for the nature of the GCS fuse we are constrained to use |
I do not believe the team currently has bandwidth to add Unfortunately usage of |
It is really going to be a problem to use symlinks as it is not supported in GCS fuse and also GCS fuse is going to write substituting file in the gcs bucket. As we are trying to support an official Google cloud product with the TB can you investigate internally as it seems we don't have any other workaround. |
Also this ticket it is still opened with this last comment asking for feedback: Do we need to open a new one? |
Yes, please open a new ticket with your feature request (which seems to be support for |
Is TB still actively developed? Or is it mainly in maintenance mode? I a relatively low commit/PR activity. |
In any case,
I still don't understand what filtering option we have on fs that don't support symlinks like Google gcs-csi-fuse |
I have a training job on a spot instance and a TB log on GCS-CSI fuse.
As it is a spot instance when the job it is going to reastart it going to create a new log file.
If I don't restart the TB the curves are ok but when I restart the TB service it going to read only then last log file.
Is there a ways to workaround this?
The text was updated successfully, but these errors were encountered: