Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(judgements): Fix/manual judgment concurrent execution #4410

Merged

Conversation

kkotula
Copy link
Contributor

@kkotula kkotula commented Mar 9, 2023

This PR aims to address the issue of concurrent changes to execution stages. We have discovered a rather intriguing bug that arises from concurrent read/writes of stage data to storage. This bug is not limited to a particular storage solution, as it occurs with both Redis and SQL alternatives. Therefore, the solution must be implemented on the application code level.

Here's a brief summary of what's happening. Both CompoundExecutionOperator and RunTaskHandler read and write stage data to storage. However, if one thread executes the logic of CompoundExecutionOperator while another thread executes RunTaskHandler, there may be data inconsistency. This inconsistency can result in unexpected behavior.

To address this issue, we need to ensure that the application code is equipped to handle concurrent changes to execution stages. One possible solution is to implement locking mechanisms that prevent multiple threads from accessing the same stage data at the same time. This will ensure that data consistency is maintained, and the application runs as expected.

Below is a diagram that illustrates the problem.
bug diagram

The approach I've taken in order to ensure the reliability of the distributed locking pattern is to leverage externalised storage. By doing so, we can ensure that the locking mechanism is properly implemented and that the system is able to handle multiple requests at once without failing. In this regard, we have implemented two different solutions that are both effective in their own right.

The first solution is the Redis locking mechanism. This solution utilizes the LockManager interface and implementation that is already present in the kork library. The second solution is the SQL-based approach that is based on the spring shedlock. This solution is also effective in ensuring the reliability of the distributed locking pattern.
Below is a diagram that illustrates the solution.
Distribiuted locking

Locking is disabled by default. It can be enabled by setting flags redis.external-lock.enabled=true or sql.external-lock.enabled=true. If none of the flags is set, application will fallback to default NoOp locking that doesn’t do any locking and executes Runnable/Callable instances.

@kkotula kkotula marked this pull request as draft March 9, 2023 11:24
@spinnakerbot
Copy link
Contributor

The following commits need their title changed:

  • 6b242de: Added RunOnLockAcquired interface with 3 implementations: redis, sql and noOp

  • c695a79: Added tests for RunOnLockAcquired implementations

  • e4b101e: Added wrapper class for RunOnLockAcquired interface that retries to acquire lock X number of times.

  • e7110c7: Added @configuration class that instantiates RetriableLock with default noOp locking mechanism

  • 81b966c: Enhanced stage update logic with locking to prevent from concurrent stage updates in multiple threads

  • 41ccf74: Enhanced RunTask handling with locking to prevent from concurrent stage updates in multiple threads

  • 7b40628: Instantiating SQL lock + liquibase scripts

  • 8c66585: Instantiating Redis lock

  • 4573ba5: Reverted deleted methods

Please format your commit title into the form:

<type>(<scope>): <subject>, e.g. fix(kubernetes): address NPE in status check

This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here.

@kkotula kkotula force-pushed the fix/manual-judgment-concurrent-execution branch from 4573ba5 to aa8a9ba Compare March 9, 2023 12:13
@armory-abedonik
Copy link
Contributor

Hi Chris, I would like to draw your attention to the fact that both MySQL and PostgreSQL offer support for transactions. As such, it is worth considering whether these options may be more advantageous than utilizing the use of a shared lock.

Copy link
Contributor

@ovidiupopa07 ovidiupopa07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🚀

@@ -54,6 +54,8 @@ dependencies {
implementation("com.jayway.jsonpath:json-path:2.2.0")
implementation("org.yaml:snakeyaml")
implementation("org.codehaus.groovy:groovy")
implementation("net.javacrumbs.shedlock:shedlock-spring:2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide an overview of this dependency, and why it's a good choice, other alternatives, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative I know for distributed services locks:

  • Raft consensus protocols
  • jgroups
  • Quartz (which is ... full CRON scheduler)
  • This lib :) which is pure SQL it seems

I don't know that we really want raft (aka https://ratis.apache.org/) or jgroups (which.. keycloak uses.. and is kinda a geast). Quartz is a bit overkill. I'd be open to another distributed lock lib - ideally though this is one that replaces any OTHER distributed lock needs LIKE Igor's current system for polling locks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there!

@jasonmcintosh
Thanks for sharing alternative options! All of them would get the job done!

I was looking for a simple library that focuses on providing a locking mechanism without any additional scheduling features, which I didn't need. After some research, I found a library that perfectly fits my needs: it is lightweight and easy to use, and it provides a reliable locking mechanism that can be integrated with Redis (maybe it can replace LockingManager?) and other storage options. This library has the potential to become the foundation of a new implementation that could replace the current messy locking situation. In the future, by centralizing the locking options within kork, we can achieve better code organization and reduce the risk of errors. Currently, the locking options are spread across the kork and apps repositories, which makes it hard to manage and maintain them. By using this library, we can simplify the locking process.

If you have any questions or concerns, please don't hesitate to reach out. I am happy to help or explain more!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONLY other question is the 2.2.0 version. Which I'd expect to match spring maybe? Which seems like it's not quite matching? At that point, it might need a kork change :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped the version to 4.44. Cannot bump it to 5.2, because it requires Java 17.

@jasonmcintosh
Copy link
Member

SO biggest long term issue for me is ... right now there's a couple different lock manager type implementations crossing kork & igor. They're not great implementations today however. This sounds like a good start (assuming answers to the lock library), eventually extract to kork, and then get implemented consistently across the services. The challenge is of course are a few of the libs which do NOT use SQL (and should... looking at you Igor).

ONLY other concern, and I can see why - is Primary beans cause exclusive locks preventing replacement by alternative implementations. But I'm ok with it for now.

Ignoring THAT... this has some GREAT details, GREAT Explanation of what's going on/why/where. THANK YOU!

@ovidiupopa07 ovidiupopa07 marked this pull request as ready for review March 23, 2023 08:55
@kkotula
Copy link
Contributor Author

kkotula commented Mar 23, 2023

SO biggest long term issue for me is ... right now there's a couple different lock manager type implementations crossing kork & igor. They're not great implementations today however. This sounds like a good start (assuming answers to the lock library), eventually extract to kork, and then get implemented consistently across the services. The challenge is of course are a few of the libs which do NOT use SQL (and should... looking at you Igor).

ONLY other concern, and I can see why - is Primary beans cause exclusive locks preventing replacement by alternative implementations. But I'm ok with it for now.

Ignoring THAT... this has some GREAT details, GREAT Explanation of what's going on/why/where. THANK YOU!

Hi @jasonmcintosh ,

I totally understand where you're coming from. It would definitely be better to have fewer locking mechanisms and store them in the kork library. However, while working on this feature, I noticed that some of the locking options aren't compatible with each other, and it's kind of tricky to merge them all into one solution.

For example, Redis uses keyName to create a lock in the cache, but this library sql library creates a lock per session. They aren't compatible with each other and they accomplish different goals.

@jasonmcintosh jasonmcintosh added the ready to merge Approved and ready for merge label Mar 28, 2023
@mergify mergify bot added the auto merged Merged automatically by a bot label Mar 28, 2023
@mergify mergify bot merged commit 706ffb4 into spinnaker:master Mar 28, 2023
4 checks passed
@jasonmcintosh jasonmcintosh changed the title Fix/manual judgment concurrent execution fix(judgements): Fix/manual judgment concurrent execution Mar 28, 2023
@link108 link108 added the backport-candidate Add to PRs to designate release branch patch candidates. label Mar 29, 2023
@dbyron-sf
Copy link
Contributor

How far back to backport this? At the moment the oldest active branch is release-1.29.x.

@dbyron-sf
Copy link
Contributor

Still curious how far to backport this. But also, I can't quite see what this has to do with manual judgment.

@ovidiupopa07
Copy link
Contributor

@Mergifyio backport release-1.29.x release-1.28.x

@mergify
Copy link
Contributor

mergify bot commented Aug 3, 2023

backport release-1.29.x release-1.28.x

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Aug 3, 2023
* fix(manual_judgment): added distributed locking to ensure sequential updates to execution's stages

* fix(manual_judgement): fixed test

* fix(blue-green-deploy): added docs and reformatted files

* fix(blue-green-deploy): bumped shedlock dependency version to 4.44.0

---------

Co-authored-by: Cameron Motevasselani <[email protected]>
Co-authored-by: ovidiupopa07 <[email protected]>
(cherry picked from commit 706ffb4)

# Conflicts:
#	orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt
#	orca-queue/src/test/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandlerTest.kt
mergify bot pushed a commit that referenced this pull request Aug 3, 2023
* fix(manual_judgment): added distributed locking to ensure sequential updates to execution's stages

* fix(manual_judgement): fixed test

* fix(blue-green-deploy): added docs and reformatted files

* fix(blue-green-deploy): bumped shedlock dependency version to 4.44.0

---------

Co-authored-by: Cameron Motevasselani <[email protected]>
Co-authored-by: ovidiupopa07 <[email protected]>
(cherry picked from commit 706ffb4)

# Conflicts:
#	orca-core/src/test/java/com/netflix/spinnaker/orca/pipeline/CompoundExecutionOperatorTest.java
#	orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt
#	orca-queue/src/test/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandlerTest.kt
@kkotula
Copy link
Contributor Author

kkotula commented Aug 3, 2023

Still curious how far to backport this. But also, I can't quite see what this has to do with manual judgment.

Hey @dbyron-sf,

@ovidiupopa07 already opened backport PRs:

  1. fix(judgements): Fix/manual judgment concurrent execution (backport #4410) #4493
  2. fix(judgements): Fix/manual judgment concurrent execution (backport #4410) #4494

There are conflicts though. I fixed them and created two more PRs:

  1. fix(conflict): resolved formatting conflicts in a 1.28 back-port PR #4496
  2. fix(conflict): resolved formatting conflicts in a 1.29 back-port PR #4497

@dbyron-sf
Copy link
Contributor

Do you have permissions to push the fixes directly to the backport PRs? Doesn't it make sense to backport to release-1.30.x as well?

Still curious what this has to do with manual judgment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merged Merged automatically by a bot backport-candidate Add to PRs to designate release branch patch candidates. ready to merge Approved and ready for merge target-release/1.31
Projects
None yet
8 participants