Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 20 Read resources only after success #556

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

waciumawanjohi
Copy link
Contributor

@waciumawanjohi waciumawanjohi commented Jan 26, 2022

Changes proposed by this PR

Introduce RFC 20
Readable Text

Release Note

PR Checklist

Note: Please do not remove items. Mark items as done [x] or use strikethrough if you believe they are not relevant

  • [ ] Linked to a relevant issue. Eg: Fixes #123 or Updates #123
  • Removed non-atomic or wip commits
  • [ ] Filled in the Release Note section above
  • [ ] Modified the docs to match changes

@waciumawanjohi waciumawanjohi added the rfc Requests For Comment label Jan 26, 2022
@netlify
Copy link

netlify bot commented Jan 26, 2022

Deploy Preview for elated-stonebraker-105904 canceled.

Name Link
🔨 Latest commit d3a0680
🔍 Latest deploy log https://app.netlify.com/sites/elated-stonebraker-105904/deploys/624cb443f6d615000856362a

Comment on lines 7 to 8
in a constant state of 'processing', even as it succeeds and fails on multiple successive inputs. Carto should wait
until an input has resulted in success or failure before updating the resource with new input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carto should wait until an input has resulted in success or failure before updating the resource with new input

This assumes an arbitrary resource will always reach a "success" or "failure" state. If such a state is never reached, the supply chain is now deadlocked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree that for some objects the concept of success is meaningless. The section on non-reconciling objects is meant to address those resource types. I'm interested in your reaction to the alternative approaches there.
  2. I may have misinterpreted your comment. Perhaps you mean, some resource becomes deadlocked, that could deadlock the entire chain. To such an objection I would say, misconfiguration of templates or bad values fed in to a resource could indeed gum up the works. But I do not think that is a critique of the works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you mean, some resource becomes deadlocked, that could deadlock the entire chain. To such an objection I would say, misconfiguration of templates or bad values fed in to a resource could indeed gum up the works. But I do not think that is a critique of the works.

I'm not talking about a misconfiguration of Cartographer resources, but the behavior of the resource being choreographed.

It becomes Cartographer's problem when a supply chain is no longer able to make progress for a workload until the workload is deleted. If Cartographer is going to be usable in the real world, it will need to gracefully handle other resources that don't always behave perfectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a risk perspective, what is the difference between:

  • a resource that is expected to report "success", but misbehaves and never does
  • a resource that is expected to report an output, but misbehaves and never does

I think that we currently face the risk of the latter. And I think it is functionally equivalent to the former. So it doesn't seem like this RFC takes on more risk of deadlock.

Comment on lines 16 to 19
Tieing inputs to outputs also allows Carto to harden supply chains to tampering. Currently, an enemy could update a
resource. Carto will not read said resource without first updating it with the proper spec. But the resource may still
produce the enemy output (followed by the correct output). Carto must be able to associate inputs to outputs if it is
not to propagate the enemy output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a malicious actor has the ability to update the resource, it can also write to its status with malicious values, you're sunk. This proposal can't change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the enemy is a controller that can overwrite the status of an object, we cannot observe or defend against that. But if the enemy is one with only the ability to change an object's spec, e.g. a user with kubectl access, we can defend against them and there is value in doing so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the enemy is one with only the ability to change an object's spec

I don't understand how it would be possible to only have write access to a resource's spec and not also have access to its status using bog standard RBAC. Can you explain the API access patterns you're thinking of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use kubectl to write to an object's status. (With the design docs archived and scheduled for deletion, there doesn't seem to be a canonical place where this is stated so I'll link to this issue)

Comment on lines 63 to 64
2. The absence of either an ObservedCompletion or an ObservedMatch could be taken as indication that the object
being created is immediately succesful. (This seems dangerous and ill-advised)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous and ill-advised

Is this the current behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is!

There are a few limitations to the current setup of observedCompletion and ObservedMatches:
1. ObservedCompletion is limited to matching a single path and value. If more than one path must be interrogated,
this spec is not sufficient.
2. ObservedMatches cannot define a failure state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 12 to 14
Connecting an output of a resource to an input is necessary for establishing provenance. That is to state, "The app
currently running on the cluster is a result of commit X," it is necessary to tie a resource output to the input that
produced it. Waiting for success/failure before update achieves this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The app currently running on the cluster is a result of commit X

I don't believe this is reasonably knowable as there are many triggers that can result in a templated component updating it's output. Taking just a component like kpack, a new image could be produced because of a source change, or a buildpack update or a run image update.

Suppose I have a supply chain that runs unit tests on the source code. If a new commit is detected and the unit tests fail. Should the whole supply chain be haunted, or should downstream components continue to operate with the last known good values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't mean to starve the supply chain of the last known good value of a resource. I've added a section detailing a strategy forward on that. Link
  2. "I don't believe this is reasonably knowable" I am glad that you are pushing on this point. I think that this is knowable but have not yet completely convinced myself. I expect this RFC to either generate a counter-example that proves our hubris or to give us confidence through the stress testing of many eyes. Gaming out scenarios just now pushed me to a similar (but slightly different strategy). I've written up that thinking and scenarios. https://github.com/vmware-tanzu/cartographer/blob/rfc-0020-template-success/rfc/rfc-0020-resources-update-after-success.md#proposed-solution I believe this covers the scenario you posited above. I'm very interested in hearing about other corner cases!

@@ -0,0 +1,91 @@
# Draft RFC 20 Update Resources Only After Success/Failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Success" and "Failure" are states not typically associated with level-triggered reconciliation. Level reconciliation is fundamental to the nature of Kubernetes where the system will continuously try to align the current state with the desired state. Faults can come and go based on the same desired state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that:

  1. Conditions are an indication that k8s understands the importance of indicating whether a current state is "Good".
  2. We're trying to deal with resources that include an additional fudge to level-triggered reconciliation; the reporting of the most recent "good" output. This includes Runnable's outputs, kpacks latestImage. These are based not simply on the spec submitted and the state of the cluster; but also on historical state that is no longer observable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to deal with resources that include an additional fudge to level-triggered reconciliation

Every resource that a supply chain is directly templating needs to be level-triggered. Both the Ruannable and the kpack Image resources are level-triggered. Those resource internally manage edge-triggered state, but they themselves are not edge-triggered.

I think what you're struggling with is similar to a Deployments dual conditions of Available and Progressing. While the resource is progressing (rolling out an update), it may or may not also be available (able to receive traffic). While it is progressing, the Pods that are actually receiving traffic may be from the previous desired state or the new desired state.

There aren't a lot of conventions in this space, and they are widely applied which will make it very difficult to have confidence you're looking for.

Copy link
Member

@squeedee squeedee Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was discussing this with Waciuma but I want to add it here:

With best efforts, a resource might repeatedly go into "failed" state but eventually be "successful". If we stop at "failed" we run the risk of having no way to have the entire supply chain become eventually consistant. We're up against specific implementations of resources, but our goal should be:

  • Given a valid workload and latest source/image commit
  • where valid means the commit should traverse the supply chain (there's nothing wring with it)
  • the supply chain is eventually consistent and provides final outputs.

This means the supply chain and the resources controlled by it should be resilient to partition errors and cluster misconfigurations, that when rectified, result in the commit becoming a config and image (or deployed app)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is consistent with the core proposal here, to read only when an object is in a successful state. Given that rule, there is no interference with an object's ability to eventually reconcile successfully.

It is a good argument not to use this strategy to rollback updates. This is really a separate concern, I'm going to split it into a separate RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it was definitely muddying the waters. When you pull it out please add a link to my concern above in the "drawbacks" section. I think it's too easy to lose sight of eventual consistency when we dig into specific motiviations/user-stories.

@jwntrs jwntrs mentioned this pull request Jan 28, 2022
@jwntrs
Copy link
Contributor

jwntrs commented Jan 28, 2022

@scothis a lot of your comments read like "this is impossible don't even attempt this". Do you have other recommendations for how we can bring better visibility in tracing how a commit flows through a supply chain?

@scothis
Copy link
Contributor

scothis commented Jan 31, 2022

@scothis a lot of your comments read like "this is impossible don't even attempt this".

There are aspects of this RFC as written that I don't think are possible, or will amplify problems when they are inevitably encountered.

Do you have other recommendations for how we can bring better visibility in tracing how a commit flows through a supply chain?

My advice for writing a well behaved controller is often to step away from the controller. Imagine you are an operator that manages these resources by hand. For a specific resource you're managing, what are you looking from that resource and what do you do as a result of certain conditions reported? Take a resource like the kpack Image, study all of the possible status conditions it can report and what they mean and understand how you want to react to each of those categories of outputs. Then pick another resource, do that same; and then for a third resource. How similar are the categories of outputs? How similar are the resulting actions? Once you have a degree of consistency then you can start to automate the behavior. Where there isn't consistency, then you either can't automate it generically or need to define configuration that allows the distinguishing elements to be handled with a normalized process.

It seems like we're trying to define the configuration without a good handle of how the underlaying component actually behave, and how we want to react to them.

@squeedee
Copy link
Member

I've mentioned this before, but I'm not sure how to put it into words. I agree with at least some of @scothis 's objections, that a proposal like this fights level triggering. We have observed that level triggered supply chains will not necessarily provide reliable provenance (that is, truth about the causality of level change). I feel like the simplicity of Cartographer is broken by a desire to have an orchestrator behave like a pipeline, and to impose Causality and SBOM attestation upon it. But what user is going to find it useful to see that “Resource C failed, but it has a successful output, but we can’t tell you which input led to the success”?

@scothis
Copy link
Contributor

scothis commented Jan 31, 2022

We have observed that level triggered supply chains will not necessarily provide reliable provenance (that is, truth about the causality of level change).

I fully appreciate why this is hard, and is also why I don't have better suggestions for what a design should look like.

I'd like to separate concerns of provenance and security, from that of observability and debugability. They are all important, and there will be some overlap between these concerns, but I also see them as distinct. We should all strive to more clearly indicate which concern we are optimizing for, while understanding the impact on the other concerns.

I feel like the simplicity of Cartographer is broken by a desire to have an orchestrator behave like a pipeline, and to impose Causality and SBOM attestation upon it.

This is a large part of the tension that I feel. Cartographer is intentionally not a pipeline. I'm not sure how many of the good aspects of a pipeline we can gain without picking up the drawbacks as well. To an extent, I'm not sure a dynamic tool like Cartographer will be able to provide certainty over provenance. It may very well be less a concern of Cartographer and more a concern of the resources that Cartographer is choreographing.

Observability and debug ability are concerns worth pursuing, even when they fall short of provenance.

@squeedee
Copy link
Member

Is it possible to not block outputs, but to only add "sources" when known, using the same declarative "matches" api?

@waciumawanjohi waciumawanjohi force-pushed the rfc-0020-template-success branch from cc6f99e to 7211d3e Compare January 31, 2022 20:51
@waciumawanjohi
Copy link
Contributor Author

Is it possible to not block outputs, but to only add "sources" when known, using the same declarative "matches" api?

Yes. Ultimately the RFC boils down to

  • There is only one condition that allows us to establish that an output is tied to a particular input.
  • Change Carto's behavior to maximize that condition.

We could simply ratify the first statement and then change Carto's behavior to note when that condition occurs.

More concretely, the RFC states

  • It is only when status ready:true, consider the output to be the result of the current inputs
  • Don't read/pass forward outputs unless ready:true.

We could replace the second with:

  • Whenever ready:true note that the output is a result of the current input.

How much value would we gain?
In terms of tracing, it's not clear that users would be satisfied with, "sometimes a trace is available". As a user, I suspect I would become very frustrated with such a situation and would then look for a reliable method of tracing.

The RFC points out that tracing helps us defend against enemies, but there's been little discussion of that. I'll propose a scenario, we have a supply chain using resource A. A enemy user gets RBAC permission to update resource A objects. The enemy updates object A with some bad value. Then the enemy flips the spec of object A back to a good value. Cartographer will merely see that the spec of the object is as expected. But at some point object A could output a latestGoodOutput value from the bad input (though the object would still report ready:unknown as it processed the most recent good input).

This scenario can be guarded against by restricting reads to only occur when ready:true. There's an open question as to how valuable this defense is.

@waciumawanjohi
Copy link
Contributor Author

@scothis you write:

Once you have a degree of consistency then you can start to automate the behavior.

And above you write:

I think what you're struggling with is similar to a Deployments dual conditions of Available and Progressing.

I agree with both of these statements. And I think the primitives that are consistent are that k8s resources report "I'm done" and "I'm good". I simply want to leverage those two statements to achieve the goal of tracing. Those two statements are sufficient; given both statements I can say the current output is the result of the current input.

@waciumawanjohi
Copy link
Contributor Author

@sclevine
Our office hours conversation highlighted that adding an additional constraint of only update when ready:true could be useful to avoid starving the supply chain during fast updates. Updated the RFC to point this out here.


## Summary

Cartographer is currently unable to connect an output of a resource to the inputs of said resource. Cartographer is
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a conversation, @scothis brought up the following scenario:

Consider a supply chain with a sourceTemplate and a buildTemplate (where the sourceTemplate output is an input for the buildTemplate). Imagine the workload changes, with 2 updated values, one value that is an input for the sourceTemplate (perhaps a new source repository) and another that is an input for the imageTemplate (perhaps a new build env). While the sourceTemplate is reconciling, the imageTemplate will start reconciling on the new value from the workload. Later the sourceTemplate will finish reconciling and will pass its output to the buildTemplate. Can we disambiguate image output that is a result of the new workload val and old source output from the image output that is a result of the new workload val and new source?

Yes, we can! Indeed, this is an example of a general case, templates that have multiple inputs.

For the general case, Cartographer is responsible for going to a set of oracles and asking questions. Oracles give true answers. Cartographer wants to match answers to questions. By waiting until the oracle has answered all the questions posed, Cartographer can assure that the final answer is the answer to the final question.

What is necessary to make this assurance:

  1. Cartographer must know the most recent question it has asked an oracle.
  2. The oracle must be able to acknowledge that it has heard the question.
  3. Having taken all questions and considered all observed states of the world, the oracle must have an "at rest" state where it shares an answer. This answer must be to the most recent question and must be correct based on the oracle's most recent observation of the world. If an oracle is in the process of answering a question and a new question is asked, the oracle is not "at rest" until it finishes answering the most recent question.
  4. The oracle must be able to indicate when it is "at rest".
  5. If the oracle can fail in answering the question, when "at rest" the oracle must indicate if it has succeeded or failed.

The case of Cartographer stamping an object from multiple inputs is simply asking a question with multiple parts. Ultimately we ask one question, we stamp one object. We know the input(s} that we used to stamp that object, we know the clause(s) of the question we ask. When an answer is provided by the oracle, when the object finishes reconciling, we can associate the answer with the question, the output with the input. There is nothing special about the input updating from the field of the workload.


Now, there are other concerns that an operator may have in the above scenario. Perhaps the two updates to the workload are meant to happen in lockstep. The intention is never to have an image produced with the new build env that doesn't also have the new source repository. That synchronization is out of scope of this RFC. Here we are only concerned with tracing; with giving users the ability to discover that a given output had certain inputs. This RFC would allow the operator to observe/discover the unsynchronized behavior, which is valuable.

@waciumawanjohi waciumawanjohi changed the title RFC 20 Update resources only after success RFC 20 Read resources only after success Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Requests For Comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants