Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multiple revisions for project repos #4659

Draft
wants to merge 59 commits into
base: master
Choose a base branch
from

Conversation

marcodelapierre
Copy link
Member

@marcodelapierre marcodelapierre commented Jan 15, 2024

This PR adds support for handling local copies of multiple revisions of the same pipeline.

Key points:

  • Under NXF_ASSETS, each pipeline is now pulled as <org>/<repo>[:<revision>] ; :<revision> is only appended if the corresponding flag was used on CLI, -r/--revision
  • The key implementation change is adding the revision attribute to the AssetManager class, as both pipeline name and revision are now required to fully identify a pipeline
  • Support has been added and tested for all relevant CLI commands: run, pull, clone, drop, list, view, config, info, inspect, kuberun
  • Unit tests for AssetManagerTest have been updated

Caveats:

  • Jgit does not implement git worktree, so the original idea within Allow the concurrent run of multiple pipeline revisions #2870 could not be applied
  • depth = 1 (shallow clones) was investigated to reduce disk usage, but could not be implemented as it would not allow to checkout branches/tags/commits
  • In the current implementation, pulling without --revision and with --revision <default branch> create two duplicate pulls; this is not optimal, but with very limited known negative impact; [update] this only happens if the default branch is not declared in the manifest and differs from master
  • As a by-product, if a repo has a default branch different from master, pulling and running it is now possible without specifying the branch name explicitly

Closes #2870 .
Also indirectly addresses #3593

Copy link

netlify bot commented Jan 15, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 0b0b510
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6650a089cade960008481ae8

@marcodelapierre marcodelapierre self-assigned this Jan 15, 2024
@marcodelapierre marcodelapierre changed the title Multiple revisions 1st: AssetManager, Cmd Clone Pull Run Enable multiple revisions for SCMs Jan 15, 2024
@marcodelapierre marcodelapierre marked this pull request as draft January 15, 2024 12:39
@marcodelapierre marcodelapierre changed the title Enable multiple revisions for SCMs Enable multiple revisions for project repos Jan 15, 2024
@bentsherman
Copy link
Member

You could save disk space by cloning with depth = 0 when a revision is specified

@marcodelapierre
Copy link
Member Author

marcodelapierre commented Jan 16, 2024

You could save disk space by cloning with depth = 0 when a revision is specified

Thanks @bentsherman , great idea, will do, for nextflow pull but not for clone. Why not for any case, including no revision specified? - I just have to check that a subsequent nextflow pull still works (for the purpose of updating the local copy).

@bentsherman
Copy link
Member

yes I suppose we could just clone with depth = 0 in all cases, since no revision just defaults to master. the clone, pull, and run commands all provide a -deep option for this, so users can change it if they want

@marcodelapierre
Copy link
Member Author

Ah! unfortunately we cannot implement the deep functionality, as with a depth of 1 we cannot checkout to other branches/tags/commits.
This would be supported : git clone -b <branch/tag> --deep <deep> <repo> , but only works for branches and tags, not commits. Git still does not support a way to clone to a specific commit straight away.

Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
… operation

Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
…f "master"

Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
@marcodelapierre marcodelapierre marked this pull request as ready for review January 18, 2024 08:11
@pditommaso
Copy link
Member

It's not common but happens. We have this same problem in the Platform as well.

This is why the original ticket was planning the checkout the repo as a bare repository, and then checkout the branch into a path named as the commit it associated the branch

@bentsherman
Copy link
Member

The problem is really with using a branch which can map to any number of revisions. It can be avoided by using a commit or tag.

@pditommaso
Copy link
Member

This does not fit the user expectation

@bentsherman
Copy link
Member

Maybe we should change their expectation. It is well known that you shouldn't use latest tag of a container if you care about reproducibility because it can change at any time. Why not promote the same best practice for the git repos -- "if you run pipelines in a shared environment, avoid using branches in favor of commits / tags because branches can be changed by other users"

Or we could append the commit hash to the directory name if a pipeline is checked out to a particular revision, i.e. .nextflow/assets/<org>-<project>-<branch>-<commit>

Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
@marcodelapierre
Copy link
Member Author

marcodelapierre commented Mar 21, 2024

I agree with Ben, we need to make expectations explicit, and remind that sharing the same NF repo collection between multiple people may result in not running the expected branch commit.

Even Paolo's solution with explicit commit id in the path does not solve the situation Here's the case that breaks it IMO.

  • Person A pulls a pipeline branch, commit 1, which is shared also by person B.
  • Days and commits later
  • Person A updates that branch, new implicit path created with commit 5
  • Days and commits later
  • Person B updates that branch, new implicit path created with commit 10

And now in this new situation, how do people get to use the version "they think they are using"? All of the implicitly requested commits are available, but how do you reliably pick one or the other for usage? I don't see an implicit way to achieve this.

The simplest way (for the implementation and also, crucially, for the users) is to advise them to use tags or commits if they really care about reproducibility.

So my suggestion is: let's keep this implementation, and add a paragraph to the docs to highlight these reproducibility caveats. See commit 8fcbf08

@marcodelapierre
Copy link
Member Author

These are actually relatively important caveats to clarify, because I have heard multiple instances of site (cluster/service) administrators willing to curate a collection of pipelines for their users, including multiple revisions.
In this case, the admin is the person that might change revisions under the hood to all users; in this case, though, it seems quite clear to me that it is up to the admin to communicate with their users about updates to the curated list of pipelines/revisions.

@marcodelapierre marcodelapierre requested a review from a team as a code owner April 3, 2024 14:39
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
…eproducibility

Signed-off-by: Dr Marco De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
Signed-off-by: Dr Marco Claudio De La Pierre <[email protected]>
@marcodelapierre
Copy link
Member Author

I have been studying and making experiments using a local bare repository, as a local intermediate between the remote repo and the various checked out revisions.

In my first round of attempts, I was trying to use the local bare repo as the source of truth for any local checkouts, in substitution for the remote repo. Essentially, the idea was:

  • clone the bare from the remote, only update when needed
  • otherwise, for most other operations use the local bare as the source of truth

This does not work at present, as the general assumption in AssetManager methods is that the "source of truth" has everything available, i.e. both repo information and file contents. However a bare repo lacks the latter.

Hence, I started an attempt of targeted usage of either bare or remote, depending on the required information. This not only is confusing, but currently fails in my tests, again because of assumptions that cross each other around the existence and contents of the various types of repos.

At the end of today:

To bring the effort to a more manageable and productive ground, I have decided to move forward with a radically simplified approach:

  • keep using the remote repo as much as possible
  • only use the local bare repository to map requested revisions into commit IDs

This approach, if workable, tackles the main goal of this round of PR revision: to locally use commits to store and run pipelines, to avoid clashes when updating repositories.

@marcodelapierre
Copy link
Member Author

marcodelapierre commented May 17, 2024

I have started to add localBarePath. Scope is:

  • to dereference branches/tags into commit IDs
  • to source the local git config file

Ok, I have fixed the chicken and egg at AssetManager instantiation, between defining localPath and devining the repository provider. Using localBarePath has been instrumental.

Next:

  • actual dereferentiation to commits
  • effective update of bare repo, on demand and transparent (i.e. only when download or run with latest)

To this end I have left some intermediate notes in the codebase, AssetManager.groovy.

@marcodelapierre
Copy link
Member Author

@pditommaso Any tips on how to get a commitID for a branch or tag, I am a taker.
Otherwise, I will go into it next week.

@marcodelapierre marcodelapierre marked this pull request as draft May 17, 2024 08:24
@pditommaso
Copy link
Member

Not sure how much I can help this week. @bentsherman you want to have look at this

@marcodelapierre
Copy link
Member Author

thanks Paolo. did good progress myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants