-
Notifications
You must be signed in to change notification settings - Fork 58
Description
Background
Our Hobot pipeline is long and has four stages:
low-level -> expert -> agent -> deployment
Each stage can be called a 'task'. For any task, the tasks on its left are 'upstream' tasks. Generally, a task needs its upstream task config files and model weights for both training and deployment (which additionally requires its own configs and model weights).
A desirable scenario
In a brief discussion with @Haichao-Zhang on Friday, we agreed that in order to easily manage and use training results of upstream tasks, two important properties are desired:
- The model weights are always stored as one ckpt file, regardless of where the task is at the pipeline. For example, an agent's ckpt contains the model weights for low-level, expert, and itself.
- We only need to look up at one stage to get all needed configurations. For example, for either agent training or deployment, it only needs the expert's job dir but not the low-level's. Similarly, for deployment, it only needs the agent's job dir.
The above two properties simplify ckpt and conf management, because we don't want multiple training dirs passed to a downstream task.
Solution
For model weights, it's straightforward to store all as one ckpt. However, it's a little tricky when it comes to conf management. Below is a simple hack for that purpose.
def save_upstream_confs(upstream_task_root_dir: str):
"""When training the current task B, we copy all upstream task (C,D,...) confs
to './.upstream_confs', and then add them to ``_CONF_FILES``.
This will make them further copied to 'config_files' under the TB directory
of B when later ALF writes the config.
So later when one wants to use the ckpt of B for a new downstream task A, he
doesn't need trained dirs of C,D,..., because their conf files have been
included in B.
To use any cached upstream conf ``x_conf.py``, one needs only to do
.. code-block:: python
alf.import_config('./.upstream_confs/x_conf.py')
This will also work if ``x_conf.py`` also imports some upstream conf ``y_conf.py``,
if inside ``x_conf.py`` it's written as
.. code-block:: python
alf.import_config('./.upstream_confs/y_conf.py')
A general template of using/saving upstream confs:
.. code-block:: python
if is_training:
save_upstream_confs(upstream_task_root_dir)
# import conf files of the current task
alf.import_config('x_conf.py')
alf.import_config('y_conf.py')
# import conf files of upstream tasks
alf.import_config('./upstream_confs/z_conf.py')
Args:
upstream_task_root_dir: the root dir of the upstream task
"""
root_dir = upstream_task_root_dir
dst = pathlib.Path(__file__).parent
dst = dst / ".upstream_confs/"
os.system(f"mkdir -p {dst}")
# Copy the upstream task config files, along with its upstream task conf files
# if existing.
if os.path.isdir(f"{root_dir}/config_files/.upstream_confs"):
os.system(f"cp -r {root_dir}/config_files/.upstream_confs {dst}")
os.system(f"cp {root_dir}/config_files/*.py {dst}")
for f in glob.glob(f"{dst}/**/*.py", recursive=True):
_add_conf_file(f)Generally, we copy all files under config_files of an upstream root_dir, recursively to the path of the current conf file, under a special dir called .upstream_confs. Then we add all files in this special dir recursively to ALF's _CONF_FILES which will be copied by ALF to the config_files of the training root dir after one training iteration of the current task.
This can satisfy the second property, if in any conf file x_conf.py of the current task, we import another conf file y_conf.py of the immediate upstream task by
alf.import_config('.upstream_confs/y_conf.py')This works for both the training and deployment modes of the task. We only call the above function when it's in the training mode:
if is_training:
save_upstream_confs(upstream_task_root_dir)