Skip to content

Fix additional issues with atomic replication#7711

Draft
dralley wants to merge 3 commits into
pulp:mainfrom
dralley:atomic-replication
Draft

Fix additional issues with atomic replication#7711
dralley wants to merge 3 commits into
pulp:mainfrom
dralley:atomic-replication

Conversation

@dralley
Copy link
Copy Markdown
Contributor

@dralley dralley commented May 13, 2026

In situations where an installation was replicating >1 external Pulp servers, the result was cascading failures as the replications would race and conflict with each other.

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

dralley added 2 commits May 13, 2026 00:28
Locking didn't prevent two UpstreamPulps with overlapping Distributions
from racing and clobbering each other.

Assisted-By: claude-opus-4.6
re pulp#7614
Atomic distribution is now "atomic" distribution. All successful updates
happen simultaneously, rather than all updates happen simultaneously.

last_replication_timestamp is not updated if there are errors.

Assisted-By: claude-opus-4.6
re pulp#7614
With policy ALL, multiple upstreams will very likely conflict in
problematic ways.

Assisted-By
closes pulp#7614
@dralley dralley force-pushed the atomic-replication branch from e157ff7 to 37f4ba4 Compare May 13, 2026 04:57
@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented May 13, 2026

I recommend looking at the commits individually, and probably in 1, 3, 2 order

Comment thread pulpcore/app/replica.py


def distros_lock_uri(domain_id):
return f"pdrn:{domain_id}:distributions"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was never an uri to begin with? This function is solely used for locking, the domains distribution url space?
Is it the same lock used for updating distributions directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that similar lock strings were being created in multiple places, so I factored it into a helper, but the value should actually be the same - I believe self.domain.pulp_id == server.pulp_domain_id in this context.

Comment on lines +136 to +140
raise serializers.ValidationError(
"Multiple upstream-pulp configurations exist in this domain. "
"Set the 'policy' to 'labeled' on all of them to avoid conflicts, "
"or remove the extras."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather opinionated. Are we sure we can take that "feature" away?

Are there possible usecases for having two conflicting upstream configurations in one domain?

  • Something like Blue-Green upstream where you sync them alternatingly.
  • Having a hot spare upstream in case the main one is out of reach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, I will ask them in our call today.

Copy link
Copy Markdown
Contributor Author

@dralley dralley May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having multiple potentially overlapping Upstreams replicating into the same domain is probably a very bad idea. I don't see an obvious use case for it. With LABELED it would be allowed but at least you would get a better error in cases where they overlap instead of just wreaking havoc

edit: actually there is no error, they are skipped silently by the 2nd replication. That's probably not ideal.

@gerrod3 should we be raising an error if a 2nd UpstreamPulp tries to touch objects created by a different UpstreamPulp with LABELED policy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, my understanding was that the second upstream would become the new owner of that object as they replace it. This behavior might not be expected for users of the labeled policy. If we are changing the upstream existence policy then we probably should also change this overlap behavior for labeled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of "labeled" is

Replication only manages objects that it created in a previous run. Objects created by replication
are tagged with an UpstreamPulp label linking them to the specific Upstream Pulp object. Manually
created local objects with the same content types are left untouched, even if they share names with
upstream distributions. If a replicated object has its UpstreamPulp label removed, replication
will no longer manage or delete it.

Which implies no replacing

failed_tasks = task_group.tasks.exclude(pk=task.pk).exclude(state=TASK_STATES.COMPLETED)
has_failures = failed_tasks.exists()

# Best-effort: update all distributions we can, even if some subtasks failed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the antithesis to atomicity?
I thought the idea is the domain (maybe scoped by labels...) looks exactly like upstream today, or nothing changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea was just that successful updates should be synchronized rather than randomly applied over the course of an hour+. That was the primary original complaint

@dralley dralley marked this pull request as draft May 13, 2026 18:02
@dralley dralley marked this pull request as draft May 13, 2026 18:02
@dralley dralley requested review from mdellweg and removed request for mdellweg May 13, 2026 18:02
@dralley
Copy link
Copy Markdown
Contributor Author

dralley commented May 13, 2026

After discussion, commits 2 and 3 are likely not suitable for merge in their current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants