Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AEP: Caching with remote data #35

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions 008_cache_with_remate_data_node/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# AEP 008: Proper caching policy for calcjob with RemoteData as output or input node in edge

| AEP number | 008 |
|------------|--------------------------------------------------------------|
| Title | Proper caching policy for calcjob with RemoteData as output or input node in edge |
| Authors | [Jusong Yu](mailto:[email protected]) (unkcpz) |
| Champions | |
| Type | S - Standard Track AEP |
| Created | 28-June-2022 |
| Status | WIP |

## Background
There are now two issues when the `CalcJob` has `RemoteData` node as its output and the `CalcJob` is then furthur used for caching.
First problem is that the `RemoteData` node is only shalow copy with creating a new `RemoteData` node for the new cached node but has the remote folder pointed exactly the same in the remote machine.
This lead to that when doing `clean_workdir` from the cached calcjob the remote folder of the original node also cleaned up and unable to be used for further caching for other subsequent calculation which is not expected.

Another problem when caching with `RemoteData` is, the hash of `RemoteData` node generated from identical two separated run of `CalcJob` are different, which lead to subsequent calculation using `RemoteData` as input is not properly cached from.
As shown by diagram blow (copy from https://github.com/aiidateam/aiida-core/issues/5178#issuecomment-996536222):
![caching_problem](https://user-images.githubusercontent.com/6992332/146514431-c9634668-6a0d-43ca-8829-4a3a69c16d27.png)

We have `W1` that launches a `PwCalculation` (`Pw1`) which creates a `RemoteData` (`R1`), which is used as input for a `PhCalculation` (`Ph1`).
Another `PwCalculation` (`Pw2`) is run outside of a workchain with the same input `D1`. The hash of `Pw1` and `Pw2` are identical, but the hashes of their `RemoteData`, `R1` and `R2` are different. Now the user launches a new workchain `W1'` which uses the exact same inputs as `W1`.
The `PwCalculation` can now be cached from both `Pw1` and `Pw2` since their hashes are identical. Let's say that `Pw2` is chosen (by chance). This produces `RemoteData` (`R2'`) which has the same hash as `R2` since it is a clone.
Now the workchain moves on to running the `PhCalculation`, but it won't find a cache source, because no `PhCalculation` has been run yet with `R2` as an input.

Moreover, the the complexity comes into play when we consider a chain of calculations that are all cached.
The integraty of real data store in remote machine of RemoteData is always not straightforward to check, which cause the issue that when remote folder is modified, it should be disabled from as a caching source.
This type of check requires the transport which rely on the connection to the remote computer and integraty of all files in remote, so it will become the bottleneck when doing the hash computing.
It is hard to find a perfect solution for the final point, we can probably only check the folders hierarchy rather than all files.
The plan of this AEP is only to solve the first two issues but leave the final issue open to discuss.

The overall goal of this AEP in summary is to have a new caching policy that the clean and modification of caching source will not have side effect on the cloned calculations and nodes.

## Proposed Enhancement
The three issues mentioned above are related to each other.
One advantage of current caching policy is that when we run a chain of calculations and the remote folder is integrate, the second run with caching will pick up all the cached source and spend zero resource to finish the whole calculation chain.
This is because when the `RemoteData` is input for the calculation process, it has the same hash as the source which make the subsequent calculation run from the caching.

The goal of this proposal is to have a proper policy to caching the calculation when it involves the `RemoteData` node as input or output.
For the shalow copy of `RemoteData`, I propose to when clone the `CalcJob` node with `RemoteData`, the `RemoteData` cloned with actually open a connection to remote machine and copy the whole remote folder to a new remote repository.
However, if we use the current hashing method for `RemoteData`, the cloned `RemoteData` will have the different hash as the source, which lead to it is a new node never be the input of any other calculation on the chain.
I propose for the `RemoteData` node hashing, since in the production environment the `RemoteData` can only be generated by other calculation process, the hashing compute from the hashing of hash of calculation process.

### The drawbacks of proposed enhancement
It still has the problem when there are more than one calculation process node as source, it will randomly pick up the source and clone it.
(TBD)

## Detailed Explanation