-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 20 Read resources only after success #556
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for elated-stonebraker-105904 canceled.
|
in a constant state of 'processing', even as it succeeds and fails on multiple successive inputs. Carto should wait | ||
until an input has resulted in success or failure before updating the resource with new input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carto should wait until an input has resulted in success or failure before updating the resource with new input
This assumes an arbitrary resource will always reach a "success" or "failure" state. If such a state is never reached, the supply chain is now deadlocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that for some objects the concept of success is meaningless. The section on non-reconciling objects is meant to address those resource types. I'm interested in your reaction to the alternative approaches there.
- I may have misinterpreted your comment. Perhaps you mean, some resource becomes deadlocked, that could deadlock the entire chain. To such an objection I would say, misconfiguration of templates or bad values fed in to a resource could indeed gum up the works. But I do not think that is a critique of the works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you mean, some resource becomes deadlocked, that could deadlock the entire chain. To such an objection I would say, misconfiguration of templates or bad values fed in to a resource could indeed gum up the works. But I do not think that is a critique of the works.
I'm not talking about a misconfiguration of Cartographer resources, but the behavior of the resource being choreographed.
It becomes Cartographer's problem when a supply chain is no longer able to make progress for a workload until the workload is deleted. If Cartographer is going to be usable in the real world, it will need to gracefully handle other resources that don't always behave perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a risk perspective, what is the difference between:
- a resource that is expected to report "success", but misbehaves and never does
- a resource that is expected to report an output, but misbehaves and never does
I think that we currently face the risk of the latter. And I think it is functionally equivalent to the former. So it doesn't seem like this RFC takes on more risk of deadlock.
Tieing inputs to outputs also allows Carto to harden supply chains to tampering. Currently, an enemy could update a | ||
resource. Carto will not read said resource without first updating it with the proper spec. But the resource may still | ||
produce the enemy output (followed by the correct output). Carto must be able to associate inputs to outputs if it is | ||
not to propagate the enemy output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a malicious actor has the ability to update the resource, it can also write to its status with malicious values, you're sunk. This proposal can't change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the enemy is a controller that can overwrite the status of an object, we cannot observe or defend against that. But if the enemy is one with only the ability to change an object's spec, e.g. a user with kubectl
access, we can defend against them and there is value in doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the enemy is one with only the ability to change an object's spec
I don't understand how it would be possible to only have write access to a resource's spec and not also have access to its status using bog standard RBAC. Can you explain the API access patterns you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use kubectl to write to an object's status. (With the design docs archived and scheduled for deletion, there doesn't seem to be a canonical place where this is stated so I'll link to this issue)
2. The absence of either an ObservedCompletion or an ObservedMatch could be taken as indication that the object | ||
being created is immediately succesful. (This seems dangerous and ill-advised) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous and ill-advised
Is this the current behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is!
There are a few limitations to the current setup of observedCompletion and ObservedMatches: | ||
1. ObservedCompletion is limited to matching a single path and value. If more than one path must be interrogated, | ||
this spec is not sufficient. | ||
2. ObservedMatches cannot define a failure state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kapp
has prior art here https://carvel.dev/kapp/docs/latest/config/#waitrules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connecting an output of a resource to an input is necessary for establishing provenance. That is to state, "The app | ||
currently running on the cluster is a result of commit X," it is necessary to tie a resource output to the input that | ||
produced it. Waiting for success/failure before update achieves this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The app currently running on the cluster is a result of commit X
I don't believe this is reasonably knowable as there are many triggers that can result in a templated component updating it's output. Taking just a component like kpack, a new image could be produced because of a source change, or a buildpack update or a run image update.
Suppose I have a supply chain that runs unit tests on the source code. If a new commit is detected and the unit tests fail. Should the whole supply chain be haunted, or should downstream components continue to operate with the last known good values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't mean to starve the supply chain of the last known good value of a resource. I've added a section detailing a strategy forward on that. Link
- "I don't believe this is reasonably knowable" I am glad that you are pushing on this point. I think that this is knowable but have not yet completely convinced myself. I expect this RFC to either generate a counter-example that proves our hubris or to give us confidence through the stress testing of many eyes. Gaming out scenarios just now pushed me to a similar (but slightly different strategy). I've written up that thinking and scenarios. https://github.com/vmware-tanzu/cartographer/blob/rfc-0020-template-success/rfc/rfc-0020-resources-update-after-success.md#proposed-solution I believe this covers the scenario you posited above. I'm very interested in hearing about other corner cases!
@@ -0,0 +1,91 @@ | |||
# Draft RFC 20 Update Resources Only After Success/Failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Success" and "Failure" are states not typically associated with level-triggered reconciliation. Level reconciliation is fundamental to the nature of Kubernetes where the system will continuously try to align the current state with the desired state. Faults can come and go based on the same desired state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that:
- Conditions are an indication that k8s understands the importance of indicating whether a current state is "Good".
- We're trying to deal with resources that include an additional fudge to level-triggered reconciliation; the reporting of the most recent "good" output. This includes Runnable's outputs, kpacks
latestImage
. These are based not simply on the spec submitted and the state of the cluster; but also on historical state that is no longer observable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to deal with resources that include an additional fudge to level-triggered reconciliation
Every resource that a supply chain is directly templating needs to be level-triggered. Both the Ruannable and the kpack Image resources are level-triggered. Those resource internally manage edge-triggered state, but they themselves are not edge-triggered.
I think what you're struggling with is similar to a Deployments dual conditions of Available and Progressing. While the resource is progressing (rolling out an update), it may or may not also be available (able to receive traffic). While it is progressing, the Pods that are actually receiving traffic may be from the previous desired state or the new desired state.
There aren't a lot of conventions in this space, and they are widely applied which will make it very difficult to have confidence you're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was discussing this with Waciuma but I want to add it here:
With best efforts, a resource might repeatedly go into "failed" state but eventually be "successful". If we stop at "failed" we run the risk of having no way to have the entire supply chain become eventually consistant. We're up against specific implementations of resources, but our goal should be:
- Given a valid
workload
and latest source/image commit - where valid means the commit should traverse the supply chain (there's nothing wring with it)
- the supply chain is eventually consistent and provides final outputs.
This means the supply chain and the resources controlled by it should be resilient to partition errors and cluster misconfigurations, that when rectified, result in the commit becoming a config and image (or deployed app)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is consistent with the core proposal here, to read only when an object is in a successful state. Given that rule, there is no interference with an object's ability to eventually reconcile successfully.
It is a good argument not to use this strategy to rollback updates. This is really a separate concern, I'm going to split it into a separate RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup it was definitely muddying the waters. When you pull it out please add a link to my concern above in the "drawbacks" section. I think it's too easy to lose sight of eventual consistency when we dig into specific motiviations/user-stories.
@scothis a lot of your comments read like "this is impossible don't even attempt this". Do you have other recommendations for how we can bring better visibility in tracing how a commit flows through a supply chain? |
There are aspects of this RFC as written that I don't think are possible, or will amplify problems when they are inevitably encountered.
My advice for writing a well behaved controller is often to step away from the controller. Imagine you are an operator that manages these resources by hand. For a specific resource you're managing, what are you looking from that resource and what do you do as a result of certain conditions reported? Take a resource like the kpack Image, study all of the possible status conditions it can report and what they mean and understand how you want to react to each of those categories of outputs. Then pick another resource, do that same; and then for a third resource. How similar are the categories of outputs? How similar are the resulting actions? Once you have a degree of consistency then you can start to automate the behavior. Where there isn't consistency, then you either can't automate it generically or need to define configuration that allows the distinguishing elements to be handled with a normalized process. It seems like we're trying to define the configuration without a good handle of how the underlaying component actually behave, and how we want to react to them. |
I've mentioned this before, but I'm not sure how to put it into words. I agree with at least some of @scothis 's objections, that a proposal like this fights level triggering. We have observed that level triggered supply chains will not necessarily provide reliable provenance (that is, truth about the causality of level change). I feel like the simplicity of Cartographer is broken by a desire to have an orchestrator behave like a pipeline, and to impose Causality and SBOM attestation upon it. But what user is going to find it useful to see that “Resource C failed, but it has a successful output, but we can’t tell you which input led to the success”? |
I fully appreciate why this is hard, and is also why I don't have better suggestions for what a design should look like. I'd like to separate concerns of provenance and security, from that of observability and debugability. They are all important, and there will be some overlap between these concerns, but I also see them as distinct. We should all strive to more clearly indicate which concern we are optimizing for, while understanding the impact on the other concerns.
This is a large part of the tension that I feel. Cartographer is intentionally not a pipeline. I'm not sure how many of the good aspects of a pipeline we can gain without picking up the drawbacks as well. To an extent, I'm not sure a dynamic tool like Cartographer will be able to provide certainty over provenance. It may very well be less a concern of Cartographer and more a concern of the resources that Cartographer is choreographing. Observability and debug ability are concerns worth pursuing, even when they fall short of provenance. |
Is it possible to not block outputs, but to only add "sources" when known, using the same declarative "matches" api? |
cc6f99e
to
7211d3e
Compare
Yes. Ultimately the RFC boils down to
We could simply ratify the first statement and then change Carto's behavior to note when that condition occurs. More concretely, the RFC states
We could replace the second with:
How much value would we gain? The RFC points out that tracing helps us defend against enemies, but there's been little discussion of that. I'll propose a scenario, we have a supply chain using resource A. A enemy user gets RBAC permission to update resource A objects. The enemy updates object A with some bad value. Then the enemy flips the spec of object A back to a good value. Cartographer will merely see that the spec of the object is as expected. But at some point object A could output a This scenario can be guarded against by restricting reads to only occur when ready:true. There's an open question as to how valuable this defense is. |
@scothis you write:
And above you write:
I agree with both of these statements. And I think the primitives that are consistent are that k8s resources report "I'm done" and "I'm good". I simply want to leverage those two statements to achieve the goal of tracing. Those two statements are sufficient; given both statements I can say the current output is the result of the current input. |
|
||
## Summary | ||
|
||
Cartographer is currently unable to connect an output of a resource to the inputs of said resource. Cartographer is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a conversation, @scothis brought up the following scenario:
Consider a supply chain with a sourceTemplate and a buildTemplate (where the sourceTemplate output is an input for the buildTemplate). Imagine the workload changes, with 2 updated values, one value that is an input for the sourceTemplate (perhaps a new source repository) and another that is an input for the imageTemplate (perhaps a new build env). While the sourceTemplate is reconciling, the imageTemplate will start reconciling on the new value from the workload. Later the sourceTemplate will finish reconciling and will pass its output to the buildTemplate. Can we disambiguate image output that is a result of the new workload val and old source output from the image output that is a result of the new workload val and new source?
Yes, we can! Indeed, this is an example of a general case, templates that have multiple inputs.
For the general case, Cartographer is responsible for going to a set of oracles and asking questions. Oracles give true answers. Cartographer wants to match answers to questions. By waiting until the oracle has answered all the questions posed, Cartographer can assure that the final answer is the answer to the final question.
What is necessary to make this assurance:
- Cartographer must know the most recent question it has asked an oracle.
- The oracle must be able to acknowledge that it has heard the question.
- Having taken all questions and considered all observed states of the world, the oracle must have an "at rest" state where it shares an answer. This answer must be to the most recent question and must be correct based on the oracle's most recent observation of the world. If an oracle is in the process of answering a question and a new question is asked, the oracle is not "at rest" until it finishes answering the most recent question.
- The oracle must be able to indicate when it is "at rest".
- If the oracle can fail in answering the question, when "at rest" the oracle must indicate if it has succeeded or failed.
The case of Cartographer stamping an object from multiple inputs is simply asking a question with multiple parts. Ultimately we ask one question, we stamp one object. We know the input(s} that we used to stamp that object, we know the clause(s) of the question we ask. When an answer is provided by the oracle, when the object finishes reconciling, we can associate the answer with the question, the output with the input. There is nothing special about the input updating from the field of the workload.
Now, there are other concerns that an operator may have in the above scenario. Perhaps the two updates to the workload are meant to happen in lockstep. The intention is never to have an image produced with the new build env that doesn't also have the new source repository. That synchronization is out of scope of this RFC. Here we are only concerned with tracing; with giving users the ability to discover that a given output had certain inputs. This RFC would allow the operator to observe/discover the unsynchronized behavior, which is valuable.
Co-authored-by: Scott Andrews <[email protected]>
Co-authored-by: Rasheed Abdul-Aziz <[email protected]>
…ood candidate GVK???
0016dbd
to
d3a0680
Compare
Changes proposed by this PR
Introduce RFC 20
Readable Text
Release Note
PR Checklist
Note: Please do not remove items. Mark items as done
[x]
or usestrikethroughif you believe they are not relevant[ ] Linked to a relevant issue. Eg:Fixes #123
orUpdates #123
wip
commits[ ] Filled in the Release Note section above[ ] Modified the docs to match changes