RFC 20 Read resources only after success #556

waciumawanjohi · 2022-01-26T19:14:11Z

Changes proposed by this PR

Release Note

PR Checklist

Note: Please do not remove items. Mark items as done [x] or use ~~strikethrough~~ if you believe they are not relevant

~~[ ] Linked to a relevant issue. Eg: Fixes #123 or Updates #123~~
Removed non-atomic or wip commits
~~[ ] Filled in the Release Note section above~~
~~[ ] Modified the docs to match changes~~

netlify · 2022-01-26T19:14:17Z

✅ Deploy Preview for elated-stonebraker-105904 canceled.

Name	Link
🔨 Latest commit	`d3a0680`
🔍 Latest deploy log	https://app.netlify.com/sites/elated-stonebraker-105904/deploys/624cb443f6d615000856362a

scothis · 2022-01-28T15:27:55Z

rfc/rfc-0020-resources-update-after-success.md

+in a constant state of 'processing', even as it succeeds and fails on multiple successive inputs. Carto should wait
+until an input has resulted in success or failure before updating the resource with new input. 


Carto should wait until an input has resulted in success or failure before updating the resource with new input

This assumes an arbitrary resource will always reach a "success" or "failure" state. If such a state is never reached, the supply chain is now deadlocked.

I agree that for some objects the concept of success is meaningless. The section on non-reconciling objects is meant to address those resource types. I'm interested in your reaction to the alternative approaches there.

I may have misinterpreted your comment. Perhaps you mean, some resource becomes deadlocked, that could deadlock the entire chain. To such an objection I would say, misconfiguration of templates or bad values fed in to a resource could indeed gum up the works. But I do not think that is a critique of the works.

Perhaps you mean, some resource becomes deadlocked, that could deadlock the entire chain. To such an objection I would say, misconfiguration of templates or bad values fed in to a resource could indeed gum up the works. But I do not think that is a critique of the works.

I'm not talking about a misconfiguration of Cartographer resources, but the behavior of the resource being choreographed.

It becomes Cartographer's problem when a supply chain is no longer able to make progress for a workload until the workload is deleted. If Cartographer is going to be usable in the real world, it will need to gracefully handle other resources that don't always behave perfectly.

From a risk perspective, what is the difference between:

a resource that is expected to report "success", but misbehaves and never does

a resource that is expected to report an output, but misbehaves and never does

I think that we currently face the risk of the latter. And I think it is functionally equivalent to the former. So it doesn't seem like this RFC takes on more risk of deadlock.

scothis · 2022-01-28T15:30:39Z

rfc/rfc-0020-resources-update-after-success.md

+Tieing inputs to outputs also allows Carto to harden supply chains to tampering. Currently, an enemy could update a
+resource. Carto will not read said resource without first updating it with the proper spec. But the resource may still
+produce the enemy output (followed by the correct output). Carto must be able to associate inputs to outputs if it is
+not to propagate the enemy output.


If a malicious actor has the ability to update the resource, it can also write to its status with malicious values, you're sunk. This proposal can't change that.

Yes, if the enemy is a controller that can overwrite the status of an object, we cannot observe or defend against that. But if the enemy is one with only the ability to change an object's spec, e.g. a user with kubectl access, we can defend against them and there is value in doing so.

But if the enemy is one with only the ability to change an object's spec

I don't understand how it would be possible to only have write access to a resource's spec and not also have access to its status using bog standard RBAC. Can you explain the API access patterns you're thinking of?

You can't use kubectl to write to an object's status. (With the design docs archived and scheduled for deletion, there doesn't seem to be a canonical place where this is stated so I'll link to this issue)

rfc/rfc-0020-resources-update-after-success.md

scothis · 2022-01-28T15:37:44Z

rfc/rfc-0020-resources-update-after-success.md

+2. The absence of either an ObservedCompletion or an ObservedMatch could be taken as indication that the object
+   being created is immediately succesful. (This seems dangerous and ill-advised)


This seems dangerous and ill-advised

Is this the current behavior?

scothis · 2022-01-28T15:39:12Z

rfc/rfc-0020-resources-update-after-success.md

+There are a few limitations to the current setup of observedCompletion and ObservedMatches:
+1. ObservedCompletion is limited to matching a single path and value. If more than one path must be interrogated,
+   this spec is not sufficient.
+2. ObservedMatches cannot define a failure state.


kapp has prior art here https://carvel.dev/kapp/docs/latest/config/#waitrules

Noted! Thanks!
https://github.com/vmware-tanzu/cartographer/blob/rfc-0020-template-success/rfc/rfc-0020-resources-update-after-success.md#cross-references-and-prior-art

scothis · 2022-01-28T15:47:29Z

rfc/rfc-0020-resources-update-after-success.md

+Connecting an output of a resource to an input is necessary for establishing provenance. That is to state, "The app 
+currently running on the cluster is a result of commit X," it is necessary to tie a resource output to the input that
+produced it. Waiting for success/failure before update achieves this.


The app currently running on the cluster is a result of commit X

I don't believe this is reasonably knowable as there are many triggers that can result in a templated component updating it's output. Taking just a component like kpack, a new image could be produced because of a source change, or a buildpack update or a run image update.

Suppose I have a supply chain that runs unit tests on the source code. If a new commit is detected and the unit tests fail. Should the whole supply chain be haunted, or should downstream components continue to operate with the last known good values?

I don't mean to starve the supply chain of the last known good value of a resource. I've added a section detailing a strategy forward on that. Link

"I don't believe this is reasonably knowable" I am glad that you are pushing on this point. I think that this is knowable but have not yet completely convinced myself. I expect this RFC to either generate a counter-example that proves our hubris or to give us confidence through the stress testing of many eyes. Gaming out scenarios just now pushed me to a similar (but slightly different strategy). I've written up that thinking and scenarios. https://github.com/vmware-tanzu/cartographer/blob/rfc-0020-template-success/rfc/rfc-0020-resources-update-after-success.md#proposed-solution I believe this covers the scenario you posited above. I'm very interested in hearing about other corner cases!

scothis · 2022-01-28T15:55:05Z

rfc/rfc-0020-resources-update-after-success.md

@@ -0,0 +1,91 @@
+# Draft RFC 20 Update Resources Only After Success/Failure


"Success" and "Failure" are states not typically associated with level-triggered reconciliation. Level reconciliation is fundamental to the nature of Kubernetes where the system will continuously try to align the current state with the desired state. Faults can come and go based on the same desired state.

I'd argue that:

Conditions are an indication that k8s understands the importance of indicating whether a current state is "Good".

We're trying to deal with resources that include an additional fudge to level-triggered reconciliation; the reporting of the most recent "good" output. This includes Runnable's outputs, kpacks latestImage. These are based not simply on the spec submitted and the state of the cluster; but also on historical state that is no longer observable.

We're trying to deal with resources that include an additional fudge to level-triggered reconciliation

Every resource that a supply chain is directly templating needs to be level-triggered. Both the Ruannable and the kpack Image resources are level-triggered. Those resource internally manage edge-triggered state, but they themselves are not edge-triggered.

I think what you're struggling with is similar to a Deployments dual conditions of Available and Progressing. While the resource is progressing (rolling out an update), it may or may not also be available (able to receive traffic). While it is progressing, the Pods that are actually receiving traffic may be from the previous desired state or the new desired state.

There aren't a lot of conventions in this space, and they are widely applied which will make it very difficult to have confidence you're looking for.

I was discussing this with Waciuma but I want to add it here:

With best efforts, a resource might repeatedly go into "failed" state but eventually be "successful". If we stop at "failed" we run the risk of having no way to have the entire supply chain become eventually consistant. We're up against specific implementations of resources, but our goal should be:

Given a valid workload and latest source/image commit

where valid means the commit should traverse the supply chain (there's nothing wring with it)

the supply chain is eventually consistent and provides final outputs.

This means the supply chain and the resources controlled by it should be resilient to partition errors and cluster misconfigurations, that when rectified, result in the commit becoming a config and image (or deployed app)

That is consistent with the core proposal here, to read only when an object is in a successful state. Given that rule, there is no interference with an object's ability to eventually reconcile successfully.

It is a good argument not to use this strategy to rollback updates. This is really a separate concern, I'm going to split it into a separate RFC.

Yup it was definitely muddying the waters. When you pull it out please add a link to my concern above in the "drawbacks" section. I think it's too easy to lose sight of eventual consistency when we dig into specific motiviations/user-stories.

jwntrs · 2022-01-28T21:43:10Z

@scothis a lot of your comments read like "this is impossible don't even attempt this". Do you have other recommendations for how we can bring better visibility in tracing how a commit flows through a supply chain?

scothis · 2022-01-31T15:11:32Z

@scothis a lot of your comments read like "this is impossible don't even attempt this".

There are aspects of this RFC as written that I don't think are possible, or will amplify problems when they are inevitably encountered.

Do you have other recommendations for how we can bring better visibility in tracing how a commit flows through a supply chain?

My advice for writing a well behaved controller is often to step away from the controller. Imagine you are an operator that manages these resources by hand. For a specific resource you're managing, what are you looking from that resource and what do you do as a result of certain conditions reported? Take a resource like the kpack Image, study all of the possible status conditions it can report and what they mean and understand how you want to react to each of those categories of outputs. Then pick another resource, do that same; and then for a third resource. How similar are the categories of outputs? How similar are the resulting actions? Once you have a degree of consistency then you can start to automate the behavior. Where there isn't consistency, then you either can't automate it generically or need to define configuration that allows the distinguishing elements to be handled with a normalized process.

It seems like we're trying to define the configuration without a good handle of how the underlaying component actually behave, and how we want to react to them.

rfc/rfc-0020-resources-update-after-success.md

squeedee · 2022-01-31T18:36:32Z

I've mentioned this before, but I'm not sure how to put it into words. I agree with at least some of @scothis 's objections, that a proposal like this fights level triggering. We have observed that level triggered supply chains will not necessarily provide reliable provenance (that is, truth about the causality of level change). I feel like the simplicity of Cartographer is broken by a desire to have an orchestrator behave like a pipeline, and to impose Causality and SBOM attestation upon it. But what user is going to find it useful to see that “Resource C failed, but it has a successful output, but we can’t tell you which input led to the success”?

scothis · 2022-01-31T18:59:12Z

We have observed that level triggered supply chains will not necessarily provide reliable provenance (that is, truth about the causality of level change).

I fully appreciate why this is hard, and is also why I don't have better suggestions for what a design should look like.

I'd like to separate concerns of provenance and security, from that of observability and debugability. They are all important, and there will be some overlap between these concerns, but I also see them as distinct. We should all strive to more clearly indicate which concern we are optimizing for, while understanding the impact on the other concerns.

I feel like the simplicity of Cartographer is broken by a desire to have an orchestrator behave like a pipeline, and to impose Causality and SBOM attestation upon it.

This is a large part of the tension that I feel. Cartographer is intentionally not a pipeline. I'm not sure how many of the good aspects of a pipeline we can gain without picking up the drawbacks as well. To an extent, I'm not sure a dynamic tool like Cartographer will be able to provide certainty over provenance. It may very well be less a concern of Cartographer and more a concern of the resources that Cartographer is choreographing.

Observability and debug ability are concerns worth pursuing, even when they fall short of provenance.

squeedee · 2022-01-31T19:46:11Z

Is it possible to not block outputs, but to only add "sources" when known, using the same declarative "matches" api?

waciumawanjohi · 2022-01-31T21:27:49Z

Is it possible to not block outputs, but to only add "sources" when known, using the same declarative "matches" api?

Yes. Ultimately the RFC boils down to

There is only one condition that allows us to establish that an output is tied to a particular input.
Change Carto's behavior to maximize that condition.

We could simply ratify the first statement and then change Carto's behavior to note when that condition occurs.

More concretely, the RFC states

It is only when status ready:true, consider the output to be the result of the current inputs
Don't read/pass forward outputs unless ready:true.

We could replace the second with:

Whenever ready:true note that the output is a result of the current input.

How much value would we gain?
In terms of tracing, it's not clear that users would be satisfied with, "sometimes a trace is available". As a user, I suspect I would become very frustrated with such a situation and would then look for a reliable method of tracing.

The RFC points out that tracing helps us defend against enemies, but there's been little discussion of that. I'll propose a scenario, we have a supply chain using resource A. A enemy user gets RBAC permission to update resource A objects. The enemy updates object A with some bad value. Then the enemy flips the spec of object A back to a good value. Cartographer will merely see that the spec of the object is as expected. But at some point object A could output a latestGoodOutput value from the bad input (though the object would still report ready:unknown as it processed the most recent good input).

This scenario can be guarded against by restricting reads to only occur when ready:true. There's an open question as to how valuable this defense is.

waciumawanjohi · 2022-01-31T21:43:26Z

@scothis you write:

Once you have a degree of consistency then you can start to automate the behavior.

And above you write:

I think what you're struggling with is similar to a Deployments dual conditions of Available and Progressing.

I agree with both of these statements. And I think the primitives that are consistent are that k8s resources report "I'm done" and "I'm good". I simply want to leverage those two statements to achieve the goal of tracing. Those two statements are sufficient; given both statements I can say the current output is the result of the current input.

waciumawanjohi · 2022-01-31T21:50:57Z

@sclevine
Our office hours conversation highlighted that adding an additional constraint of only update when ready:true could be useful to avoid starving the supply chain during fast updates. Updated the RFC to point this out here.

rfc/rfc-0020-resources-update-after-success.md

waciumawanjohi · 2022-02-02T17:55:07Z

rfc/rfc-0020-resources-update-after-success.md

+
+## Summary
+
+Cartographer is currently unable to connect an output of a resource to the inputs of said resource. Cartographer is


In a conversation, @scothis brought up the following scenario:

Consider a supply chain with a sourceTemplate and a buildTemplate (where the sourceTemplate output is an input for the buildTemplate). Imagine the workload changes, with 2 updated values, one value that is an input for the sourceTemplate (perhaps a new source repository) and another that is an input for the imageTemplate (perhaps a new build env). While the sourceTemplate is reconciling, the imageTemplate will start reconciling on the new value from the workload. Later the sourceTemplate will finish reconciling and will pass its output to the buildTemplate. Can we disambiguate image output that is a result of the new workload val and old source output from the image output that is a result of the new workload val and new source?

Yes, we can! Indeed, this is an example of a general case, templates that have multiple inputs.

For the general case, Cartographer is responsible for going to a set of oracles and asking questions. Oracles give true answers. Cartographer wants to match answers to questions. By waiting until the oracle has answered all the questions posed, Cartographer can assure that the final answer is the answer to the final question.

What is necessary to make this assurance:

Cartographer must know the most recent question it has asked an oracle.

The oracle must be able to acknowledge that it has heard the question.

Having taken all questions and considered all observed states of the world, the oracle must have an "at rest" state where it shares an answer. This answer must be to the most recent question and must be correct based on the oracle's most recent observation of the world. If an oracle is in the process of answering a question and a new question is asked, the oracle is not "at rest" until it finishes answering the most recent question.

The oracle must be able to indicate when it is "at rest".

If the oracle can fail in answering the question, when "at rest" the oracle must indicate if it has succeeded or failed.

The case of Cartographer stamping an object from multiple inputs is simply asking a question with multiple parts. Ultimately we ask one question, we stamp one object. We know the input(s} that we used to stamp that object, we know the clause(s) of the question we ask. When an answer is provided by the oracle, when the object finishes reconciling, we can associate the answer with the question, the output with the input. There is nothing special about the input updating from the field of the workload.

Now, there are other concerns that an operator may have in the above scenario. Perhaps the two updates to the workload are meant to happen in lockstep. The intention is never to have an image produced with the new build env that doesn't also have the new source repository. That synchronization is out of scope of this RFC. Here we are only concerned with tracing; with giving users the ability to discover that a given output had certain inputs. This RFC would allow the operator to observe/discover the unsynchronized behavior, which is valuable.

Co-authored-by: Scott Andrews <[email protected]>

Co-authored-by: Rasheed Abdul-Aziz <[email protected]>

…ood candidate GVK???

waciumawanjohi added the rfc Requests For Comment label Jan 26, 2022

waciumawanjohi mentioned this pull request Jan 26, 2022

Introduce RFC 18 Workload Report Artifact Provenance #519

Draft

scothis reviewed Jan 28, 2022

View reviewed changes

jwntrs mentioned this pull request Jan 28, 2022

Artifact tracing #542

Open

squeedee reviewed Jan 31, 2022

View reviewed changes

rfc/rfc-0020-resources-update-after-success.md Outdated Show resolved Hide resolved

waciumawanjohi force-pushed the rfc-0020-template-success branch from cc6f99e to 7211d3e Compare January 31, 2022 20:51

squeedee reviewed Jan 31, 2022

View reviewed changes

rfc/rfc-0020-resources-update-after-success.md Outdated Show resolved Hide resolved

waciumawanjohi commented Feb 2, 2022

View reviewed changes

waciumawanjohi changed the title ~~RFC 20 Update resources only after success~~ RFC 20 Read resources only after success Feb 7, 2022

jwntrs mentioned this pull request Mar 4, 2022

R4RFC: Allow resources to report status #694

Closed

waciumawanjohi and others added 6 commits April 5, 2022 16:49

Introduce RFC 20 Update resources only after success

211d319

Update rfc/rfc-0020-resources-update-after-success.md

34ddc6c

Co-authored-by: Scott Andrews <[email protected]>

Prove necessity of reading only when success condition met

7b20605

Enhancements and tradeoffs

ff96e9f

Update summary to align with proposal

0c78f5d

Change provenance/attestation to tracing

4d548d5

waciumawanjohi and others added 8 commits April 5, 2022 16:49

Highlight usefullness of further constraint

9ff18e1

Fix language

b37739a

Update rfc/rfc-0020-resources-update-after-success.md

35a1d7e

Co-authored-by: Rasheed Abdul-Aziz <[email protected]>

Add discussion points on starvation and deadlock

f19abda

Put all open questions in the open questions section

948f56f

Alternate approach to tracing

4380f44

move rfc to 0

66b9ed4

goland text format. started fleshing out observedMatches. whats the g…

d3a0680

…ood candidate GVK???

idoru force-pushed the rfc-0020-template-success branch from 0016dbd to d3a0680 Compare April 5, 2022 21:01

idoru mentioned this pull request Apr 12, 2022

Introduce RFC to correlate resource inputs with output #799

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 20 Read resources only after success #556

RFC 20 Read resources only after success #556

waciumawanjohi commented Jan 26, 2022 •

edited

Loading

netlify bot commented Jan 26, 2022 •

edited

Loading

scothis Jan 28, 2022

waciumawanjohi Jan 29, 2022

scothis Jan 31, 2022

waciumawanjohi Jan 31, 2022

scothis Jan 28, 2022

waciumawanjohi Jan 28, 2022

scothis Jan 31, 2022

waciumawanjohi Jan 31, 2022

scothis Jan 28, 2022

waciumawanjohi Jan 28, 2022

scothis Jan 28, 2022

waciumawanjohi Jan 29, 2022

scothis Jan 28, 2022

waciumawanjohi Jan 29, 2022

scothis Jan 28, 2022

waciumawanjohi Jan 28, 2022

scothis Jan 31, 2022

squeedee Feb 3, 2022 •

edited

Loading

waciumawanjohi Feb 3, 2022

squeedee Feb 3, 2022

jwntrs commented Jan 28, 2022

scothis commented Jan 31, 2022

squeedee commented Jan 31, 2022

scothis commented Jan 31, 2022

squeedee commented Jan 31, 2022

waciumawanjohi commented Jan 31, 2022

waciumawanjohi commented Jan 31, 2022

waciumawanjohi commented Jan 31, 2022

waciumawanjohi Feb 2, 2022

		in a constant state of 'processing', even as it succeeds and fails on multiple successive inputs. Carto should wait
		until an input has resulted in success or failure before updating the resource with new input.

		2. The absence of either an ObservedCompletion or an ObservedMatch could be taken as indication that the object
		being created is immediately succesful. (This seems dangerous and ill-advised)

		@@ -0,0 +1,91 @@
		# Draft RFC 20 Update Resources Only After Success/Failure


		## Summary

		Cartographer is currently unable to connect an output of a resource to the inputs of said resource. Cartographer is

RFC 20 Read resources only after success #556

Are you sure you want to change the base?

RFC 20 Read resources only after success #556

Conversation

waciumawanjohi commented Jan 26, 2022 • edited Loading

Changes proposed by this PR

Release Note

PR Checklist

netlify bot commented Jan 26, 2022 • edited Loading

✅ Deploy Preview for elated-stonebraker-105904 canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squeedee Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwntrs commented Jan 28, 2022

scothis commented Jan 31, 2022

squeedee commented Jan 31, 2022

scothis commented Jan 31, 2022

squeedee commented Jan 31, 2022

waciumawanjohi commented Jan 31, 2022

waciumawanjohi commented Jan 31, 2022

waciumawanjohi commented Jan 31, 2022

Choose a reason for hiding this comment

waciumawanjohi commented Jan 26, 2022 •

edited

Loading

netlify bot commented Jan 26, 2022 •

edited

Loading

squeedee Feb 3, 2022 •

edited

Loading