New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(judgements): Fix/manual judgment concurrent execution #4410
fix(judgements): Fix/manual judgment concurrent execution #4410
Conversation
The following commits need their title changed:
Please format your commit title into the form:
This allows us to easily generate changelogs & determine semantic version numbers when cutting releases. You can read more about commit conventions here. |
…updates to execution's stages
4573ba5
to
aa8a9ba
Compare
Hi Chris, I would like to draw your attention to the fact that both MySQL and PostgreSQL offer support for transactions. As such, it is worth considering whether these options may be more advantageous than utilizing the use of a shared lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 🚀
orca-core/orca-core.gradle
Outdated
@@ -54,6 +54,8 @@ dependencies { | |||
implementation("com.jayway.jsonpath:json-path:2.2.0") | |||
implementation("org.yaml:snakeyaml") | |||
implementation("org.codehaus.groovy:groovy") | |||
implementation("net.javacrumbs.shedlock:shedlock-spring:2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide an overview of this dependency, and why it's a good choice, other alternatives, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative I know for distributed services locks:
- Raft consensus protocols
- jgroups
- Quartz (which is ... full CRON scheduler)
- This lib :) which is pure SQL it seems
I don't know that we really want raft (aka https://ratis.apache.org/) or jgroups (which.. keycloak uses.. and is kinda a geast). Quartz is a bit overkill. I'd be open to another distributed lock lib - ideally though this is one that replaces any OTHER distributed lock needs LIKE Igor's current system for polling locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there!
@jasonmcintosh
Thanks for sharing alternative options! All of them would get the job done!
I was looking for a simple library that focuses on providing a locking mechanism without any additional scheduling features, which I didn't need. After some research, I found a library that perfectly fits my needs: it is lightweight and easy to use, and it provides a reliable locking mechanism that can be integrated with Redis (maybe it can replace LockingManager?) and other storage options. This library has the potential to become the foundation of a new implementation that could replace the current messy locking situation. In the future, by centralizing the locking options within kork, we can achieve better code organization and reduce the risk of errors. Currently, the locking options are spread across the kork and apps repositories, which makes it hard to manage and maintain them. By using this library, we can simplify the locking process.
If you have any questions or concerns, please don't hesitate to reach out. I am happy to help or explain more!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ONLY other question is the 2.2.0 version. Which I'd expect to match spring maybe? Which seems like it's not quite matching? At that point, it might need a kork change :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped the version to 4.44. Cannot bump it to 5.2, because it requires Java 17.
orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt
Outdated
Show resolved
Hide resolved
SO biggest long term issue for me is ... right now there's a couple different lock manager type implementations crossing kork & igor. They're not great implementations today however. This sounds like a good start (assuming answers to the lock library), eventually extract to kork, and then get implemented consistently across the services. The challenge is of course are a few of the libs which do NOT use SQL (and should... looking at you Igor). ONLY other concern, and I can see why - is Primary beans cause exclusive locks preventing replacement by alternative implementations. But I'm ok with it for now. Ignoring THAT... this has some GREAT details, GREAT Explanation of what's going on/why/where. THANK YOU! |
Hi @jasonmcintosh , I totally understand where you're coming from. It would definitely be better to have fewer locking mechanisms and store them in the kork library. However, while working on this feature, I noticed that some of the locking options aren't compatible with each other, and it's kind of tricky to merge them all into one solution. For example, Redis uses keyName to create a lock in the cache, but this library sql library creates a lock per session. They aren't compatible with each other and they accomplish different goals. |
How far back to backport this? At the moment the oldest active branch is release-1.29.x. |
Still curious how far to backport this. But also, I can't quite see what this has to do with manual judgment. |
@Mergifyio backport release-1.29.x release-1.28.x |
✅ Backports have been created
|
* fix(manual_judgment): added distributed locking to ensure sequential updates to execution's stages * fix(manual_judgement): fixed test * fix(blue-green-deploy): added docs and reformatted files * fix(blue-green-deploy): bumped shedlock dependency version to 4.44.0 --------- Co-authored-by: Cameron Motevasselani <[email protected]> Co-authored-by: ovidiupopa07 <[email protected]> (cherry picked from commit 706ffb4) # Conflicts: # orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt # orca-queue/src/test/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandlerTest.kt
* fix(manual_judgment): added distributed locking to ensure sequential updates to execution's stages * fix(manual_judgement): fixed test * fix(blue-green-deploy): added docs and reformatted files * fix(blue-green-deploy): bumped shedlock dependency version to 4.44.0 --------- Co-authored-by: Cameron Motevasselani <[email protected]> Co-authored-by: ovidiupopa07 <[email protected]> (cherry picked from commit 706ffb4) # Conflicts: # orca-core/src/test/java/com/netflix/spinnaker/orca/pipeline/CompoundExecutionOperatorTest.java # orca-queue/src/main/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandler.kt # orca-queue/src/test/kotlin/com/netflix/spinnaker/orca/q/handler/RunTaskHandlerTest.kt
Hey @dbyron-sf, @ovidiupopa07 already opened backport PRs:
There are conflicts though. I fixed them and created two more PRs: |
Do you have permissions to push the fixes directly to the backport PRs? Doesn't it make sense to backport to release-1.30.x as well? Still curious what this has to do with manual judgment. |
This PR aims to address the issue of concurrent changes to execution stages. We have discovered a rather intriguing bug that arises from concurrent read/writes of stage data to storage. This bug is not limited to a particular storage solution, as it occurs with both Redis and SQL alternatives. Therefore, the solution must be implemented on the application code level.
Here's a brief summary of what's happening. Both CompoundExecutionOperator and RunTaskHandler read and write stage data to storage. However, if one thread executes the logic of CompoundExecutionOperator while another thread executes RunTaskHandler, there may be data inconsistency. This inconsistency can result in unexpected behavior.
To address this issue, we need to ensure that the application code is equipped to handle concurrent changes to execution stages. One possible solution is to implement locking mechanisms that prevent multiple threads from accessing the same stage data at the same time. This will ensure that data consistency is maintained, and the application runs as expected.
Below is a diagram that illustrates the problem.
The approach I've taken in order to ensure the reliability of the distributed locking pattern is to leverage externalised storage. By doing so, we can ensure that the locking mechanism is properly implemented and that the system is able to handle multiple requests at once without failing. In this regard, we have implemented two different solutions that are both effective in their own right.
The first solution is the Redis locking mechanism. This solution utilizes the LockManager interface and implementation that is already present in the kork library. The second solution is the SQL-based approach that is based on the spring shedlock. This solution is also effective in ensuring the reliability of the distributed locking pattern.
Below is a diagram that illustrates the solution.
Locking is disabled by default. It can be enabled by setting flags redis.external-lock.enabled=true or sql.external-lock.enabled=true. If none of the flags is set, application will fallback to default NoOp locking that doesn’t do any locking and executes Runnable/Callable instances.