Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RFC 18 Workload Report Artifact Provenance #519

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

waciumawanjohi
Copy link
Contributor

@waciumawanjohi waciumawanjohi commented Jan 12, 2022


name: RFC 18
about: Workload Report Artifact Provenance
Readable


Changes proposed by this PR

Opens RFC 18

Release Note

PR Checklist

Note: Please do not remove items. Mark items as done [x] or use strikethrough if you believe they are not relevant

- [ ] Linked to a relevant issue. Eg: Fixes #123 or Updates #123
- [ ] Removed non-atomic or wip commits
- [ ] Filled in the Release Note section above
- [ ] Modified the docs to match changes

@netlify
Copy link

netlify bot commented Jan 12, 2022

✔️ Deploy Preview for elated-stonebraker-105904 canceled.

🔨 Explore the source changes: 2dd4ee0

🔍 Inspect the deploy log: https://app.netlify.com/sites/elated-stonebraker-105904/deploys/61fb0b5bc32ed7000746e187

- id: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
- config:
id: 22fe69c924b55327ce3e8ca079d7295a947a3641b5ab74bdf7541dc680258c81
config: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could get pretty big. RFC 14 proposed that this should be shasum(giant-yaml)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't object to encoding the yaml in order to shrink it down. But I wouldn't hash the value. This RFC should make the flow of information through the supply chain observable. Hashing the value makes it possible to see that some change has occurred, but not what that change is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a 1MB limit on the size of a single resource, and the values might be repeated twice due to the addition of managedFields. I'd be careful about including large other documents directly for that reason. It seems like it would be better to include a resourceVersion or generation field referencing the resource's version, rather than the entire contents.

@jwntrs
Copy link
Contributor

jwntrs commented Jan 12, 2022

Copying over my comments from RFC14 that are still relevant:

In the office hours this week (Jan 10th, 2022), we talked about how we could extend this type of artifact tracing to track artifacts across clusters.

Assuming you started with a SupplyChain that looked something like the following (from git-writer example):

[source-provider] -> [image-builder] -> [config] -> [git-writer]

But you made a small change and wrapped the git-writer in the SupplyChain with a ClusterSourceTemplate, it would let you emit the revision/url of the committed config. By doing this you would end up with an artifact representing the commit to the config-repo as the last item in the status of your workload.

So your workload status might end up looking like this:

artifacts:
  - source:
      id: 000
      url: https://github.com/my-org/my-source
      revision: abc123
      passed:
      - resource-name: source-provider
        apiVersion: source.toolkit.fluxcd.io/v1beta1
        kind: GitRepository
  - image:
      id: 111
      image: gcri.io/asdfsdfasdf
      passed:
      - resource-name: image-builder
        apiVersion: kpack.io/v1alpha2
        kind: Image
      from:
      - id: 000
  - config:
      id: 222
      config: shasum(of-some-config)
      passed:
      - resource-name: my-config
        apiVersion: v1
        kind: ConfigMap
      from:
      - id: 111
  - source:
      id: 333
      url: https://github.com/my-org/my-config
      revision: xyz789
      passed:
      - resource-name: config-writer
        apiVersion: carto.run/v1alpha1
        kind: Runnable
      from:
      - id: 222

Now lets assume you are following the basic-delivery example on another cluster.

You would end up with a Deliverable with the following status:

artifacts:
  - source:
      id: 333
      url: https://github.com/my-org/my-config
      revision: xyz789
      passed:
      - resource-name: config-provider
        apiVersion: source.toolkit.fluxcd.io/v1beta1
        kind: GitRepository
 - source: some other-artifact
   ...

You can see how we could correlate these two completely independent resources, based solely on their artifacts.

@jwntrs
Copy link
Contributor

jwntrs commented Jan 12, 2022

Copying over my comments from RFC14 that are still relevant:

@rawlingsj made a really insightful comment in another unrelated thread that we should consider here:

Customers will likely not be willing to give access from a dev cluster (and developer users) to others such as staging / production. Instead it will probably be easier to send events out of those locked down clusters to a centralized place for users to get visibility into their environments.

My comment here definitely assumes read access to all clusters. So maybe we should start thinking about writing out artifacts as k8s events at the same time as writing them to the workload/deliverable status. That way if someone is already emitting k8s events from their prod clusters, we can integrate into their existing tooling.

Copy link
Contributor

@scothis scothis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of capturing a reference to the resources created for a Workload on the status, but I'm skeptical that capturing the output values will give users the information they think they are getting. There's little point in merely duplicating state that could instead be looked up on the child resources directly unless the aggregation provides some value. It's not clear what questions you're tying to answer by providing this metadata.

# namespace of the resource
namespace: <:string>
# resource version of the object
resourceVersion: <:string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource version is highly volatile. Any change (or even no change for an "empty" update/patch) to the resource will create a new version. This value is typically only used for direct comparisons (are these two resource the same).

Unless you are very careful, it's easy to get into an infinite loop. As updating the Workload status will cause the Workload to be reprocessed, which will cause the templated resource to be updated, even if that update is a noop the resrouceVersion will bump, which need to be reflected on the Workload's status...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the underlying User Story/Axiom for wanting the RV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As updating the Workload status will cause the Workload to be reprocessed, which will cause the templated resource to be updated, even if that update is a noop the resrouceVersion will bump

When the workload reconciles, Cartographer first checks for changes to a stamped object. If there is no proposed change to the spec, Cartographer will not update the object. This should protect us from the infinite loop scenario posited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the underlying User Story/Axiom for wanting the RV?

Given an artifact, I would like to know the object and spec that created it. The other information can tell us what GVK/Name/Namespace the object was. But that object could have changed in arbitrary ways. The resourceVersion will point to a unique spec for the given object.

Leveraging this knowledge in practice will require additional historical persistence of objects, as etcd is only gauranteed to keep the resourceVersions for 5 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As updating the Workload status will cause the Workload to be reprocessed, which will cause the templated resource to be updated, even if that update is a noop the resourceVersion will bump

When the workload reconciles, Cartographer first checks for changes to a stamped object. If there is no proposed change to the spec, Cartographer will not update the object. This should protect us from the infinite loop scenario posited.

I think Scott's point is that resourceVersion can change even if the underlying resource's spec does not. For CRDs, the metadata.generation field tracks spec and metadata changes (but not status changes).

- id: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
- config:
id: 22fe69c924b55327ce3e8ca079d7295a947a3641b5ab74bdf7541dc680258c81
config: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this example be a map[string]string instead of a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config artifact has one field: config which is a string. Similar to the image artifact.

@squeedee
Copy link
Member

But you made a small change and wrapped the git-writer in the SupplyChain with a ClusterSourceTemplate

@jwntrs I think perhaps this thinking is solid, and should be enforced, that all steps, no matter what, produce an artifact. Even in the case of a kpack deploy, we should have a kpack deploy ID or something of that nature, as an artifact. That way, every Supply Chain Resource has representation in the artifacts list.

@jwntrs
Copy link
Contributor

jwntrs commented Jan 13, 2022

@scothis

I'm in favor of capturing a reference to the resources created for a Workload on the status, but I'm skeptical that capturing the output values will give users the information they think they are getting. There's little point in merely duplicating state that could instead be looked up on the child resources directly unless the aggregation provides some value. It's not clear what questions you're tying to answer by providing this metadata.

The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc). If I was building a tool on top of Cartographer to show this information, and all I had was the k8s object graph, I would need to reimplement a lot of logic that already exists in Cartographer. For instance, you would need to cross reference each template to know what fields to show from each resource. Cartographer already has this information, so we want to write it out somewhere so it can be consumed more easily.

@scothis
Copy link
Contributor

scothis commented Jan 18, 2022

The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc)

You'll never actually know that "commit-123 produces image-abc" unless you also have a deep understanding of how the resource being templated behaves and also have knowledge of the how the template is implemented. At best you can know that a new commit (commit-123) triggered a process that emitted an image (image-abc).

@jwntrs
Copy link
Contributor

jwntrs commented Jan 18, 2022

Some open questions that I've heard raised:

  • Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?
  • Should we write pending artifacts (e.g. waiting on observedGeneration: 5) and move on? If we don't we might starve the supply chain if we're blocking until a specific observedGeneration appears on the status.
  • Should we write an empty artifact (only has an id) when using ClusterTemplate

@scothis
Copy link
Contributor

scothis commented Jan 18, 2022

Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?

No. A resource should reflect the current known state of the world on its status to the best of its ability. There will inherently be latency that's ok and expected. The status should indicate the recency of the status' content. (Generation/ObservedGeneration, LastTransitionTime, etc)

Should we write pending artifacts (e.g. waiting on observedGeneration: 5) and move on? If we don't we might starve the supply chain if we're blocking until a specific observedGeneration appears on the status.

A controller should almost never wait for an async condition to become true. The general pattern is to reflect the current state and watch "interesting" resources for changes. When one of the watched resources changes, reprocess the whole control loop and pickup the latest "current" values.

@jwntrs jwntrs added the rfc Requests For Comment label Jan 19, 2022
@jwntrs jwntrs mentioned this pull request Jan 23, 2022
@waciumawanjohi
Copy link
Contributor Author

waciumawanjohi commented Jan 26, 2022

I think perhaps this thinking is solid, and should be enforced, that all steps, no matter what, produce an artifact.

I would support such an RFC. @squeedee Do you think this should be a separate RFC, or folded into this one.


I'm going to fold it into this one!

@waciumawanjohi
Copy link
Contributor Author

The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc)

You'll never actually know that "commit-123 produces image-abc" unless you also have a deep understanding of how the resource being templated behaves and also have knowledge of the how the template is implemented. At best you can know that a new commit (commit-123) triggered a process that emitted an image (image-abc).

@scothis @jwntrs This is an excellent discussion to move to RFC 20, our attempt to ensure that we can tie output N to input M.

@jwntrs
Copy link
Contributor

jwntrs commented Jan 26, 2022

@waciumawanjohi any thoughts on this point?

Should we write an empty artifact (only has an id) when using ClusterTemplate

@waciumawanjohi
Copy link
Contributor Author

Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?

No. A resource should reflect the current known state of the world on its status to the best of its ability. There will inherently be latency that's ok and expected. The status should indicate the recency of the status' content. (Generation/ObservedGeneration, LastTransitionTime, etc)

Cartographer's current behavior is to continue promoting artifacts that were stamped with a previous workload spec. As long as that is the behavior, we should not wipe out artifacts from the workload status simply because the workload spec changes.

What would be the implication of no longer promoting artifacts built from previous workloads? Let's consider an example: a workload paired with a simple source (flux gitrepo) -> image (kpack) -> deploy (knative) supply chain. The workload has been doing work and there are valid artifacts at each stage. The user then updates the .spec.build.env field on the workload. What are the implications if we don't want old artifacts to be promoted?

  1. The source step is unaffected. Cartographer creates a proposed stamp of the object, sees that the spec is the same, and there's a no-op. Cartographer notes the output of the clusterSourceTemplate and moves to the next resource.
  2. For the imageTemplate, there is a change. Cartographer creates the proposed stamp, sees that the spec is different from what is on the cluster, and submits the new object definition. But this is a kpack image. It continues to expose latestImage. But "we don't want old artifacts to be promoted". Cartographer is unaware that this is an "old" artifact, or that it is connected to a previous workload spec! It would pick up this output and promote it to Knative.

To handle this scenario we would need to do one of two things.

  1. When the workload spec updates, before we update objects, delete them.
  2. Wait for objects to be in a successful state before promoting them. This begins to touch on RFC 20.

What would we gain by adopting one of these strategies? Well one thing that isn't gained is determinism about what is deployed to prod. The old workload spec would have delivered an arbitrary number of revisions through the supply chain. What we gain is clarity about which of those revisions were driven by the old workload spec and which were driven by the new workload spec.

There is a simpler way to achieve that goal: consider the workload version as a field to be captured in the from field of each artifact. As we have control over the fact that workload respects generation and observedGeneration (and we will presumably control the persistence layer when we take on historical record keeping), we should leverage those fields. This will protect us from the changes to the workload's resourceVersion that will occur every time we reconcile the workload.

@jwntrs @scothis

@waciumawanjohi
Copy link
Contributor Author

Should we write an empty artifact (only has an id) when using ClusterTemplate

Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.

@waciumawanjohi
Copy link
Contributor Author

Should we write an empty artifact (only has an id) when using ClusterTemplate

Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.

The rfc specifies that artifacts are oneOf(source,image,config) and it's not clear what to call the things created by ClusterTemplate. I'm going with object, and I'm very open to better suggestions.

@waciumawanjohi waciumawanjohi force-pushed the rfc-0018-workload-report-artifact-provenance branch from d9f124d to ef44ba4 Compare January 26, 2022 23:13
@waciumawanjohi
Copy link
Contributor Author

To handle this scenario we would need to do one of two things...
Wait for objects to be in a successful state before promoting them. This begins to touch on RFC 20.

RFC 20 now proposes exactly this. Assuming RFC 20 is accepted, we could then either wipe the artifact field for a workload when the spec is updated or note the generation of the workload in the from field of each artifact. From an information point of view they would be equivalent. There may be implications for persistence, but without a clearer picture of what we may do there I don't have a preference between these strategies.

@scothis
Copy link
Contributor

scothis commented Feb 2, 2022

Borrowing from the intent behind this RFC and 20, I'd like to propose an alternate status shape for Workloads. Not all of the fields are expected to be populatable immediately, but it gives a direction to grow towards while leaving room for additions over time. Other changes to the cartographer model will be required to be able to surface some of these fields.

Key differences:

  • focus on resources rather than artifacts. Resources are the nodes of the graph, the artifacts are an output of a node along an edge to another node.
  • normalize object type, they are so similar the previous oneOf adds more complexity than it adds clarity. Remember that users are not writing these types, they are reading them.
  • keeps extraneous fields out of object references, separates references to template and stamped resources
  • resources are linked by resource names rather than by values that happen to be the same
...
status:
  observedGeneration: # int64
  conditions: # []metav1.Condition
  supplyChainRef:
    apiVersion: # string
    kind: # string
    namespace: # string (empty for cluster scope)
    name: # string
  resources:
  - name: # string (resource name as defined by the supply chain)
    ref: # corev1.ObjectReference (stamped resource)
      apiVersion: # string
      kind: # string
      namespace: # string (empty for cluster scope)
      name: # string
    templateRef: # corev1.ObjectReference (template resource)
      apiVersion: # string
      kind: # string
      namespace: # string (empty for cluster scope)
      name: # string
    inputs: # (I'm not sure how practical this is today given Cartographer's loose relationship between templates)
    - name: # string (resource name)
      value: # optional json.RawExtension
    outputs:
    - name: # string
      value: # json.RawExtension (may be too large to reasonably include)
      lastTransitionTime: # metav1.Time

    # these fields are modeled after metav1.Condition
    type: # enum 'Source', 'Image', or 'Config' (consider dropping if redundant with templateRef.kind)
    status: # enum 'True', 'False', or 'Unknown' (based on metav1.ConditionStatus)
    observedGeneration: # int64 (alternate to resourceVersion)
    lastTransitionTime: # metav1.Time
    reason: # string
    message: # string

@cirocosta
Copy link
Contributor

cirocosta commented Feb 2, 2022

interesting @scothis, trying to make some sense of it making a little more
concrete with an example supplychain (source-provider -> image-builder):

status:
  resources:
    - name: source-provider
      ref:
        kind: GitRepository ...
      templateRef:
        kind: ClusterSourceTemplate ...
      inputs: []
      outputs:
        - name: ?
          value:  {url: git.io/foo/bar, ref: b4db33f}
          lastTransitionTime: 1
        - name: ?
          value:  {url: git.io/foo/bar, ref: c001b33f}
          lastTransitionTime: 2
      type: Source
      status: True
      lastTransitionTime: 2

    - name: image-builder
      ref:
        kind: Image
      templateRef:
        kind: ClusterImageTemplate
      inputs:
        - name: source-provider
          value: {url: git.io/foo/bar, ref: b4db33f
        - name: source-provider
          value: {url: git.io/foo/bar, ref: c001b33f
      outputs:
        - name: ?                                             # (!!)
          value:  {image: registry.io/foo@sha256:010101}
          lastTransitionTime: 3
      type: Image
      status: True
      lastTransitionTime: 3

in the example above, just looking at status.resources.image-builder.outputs
one wouldn't be able to tell if that output came from the first input
(b4db33f or the second c001b33f as the outputs are not linked to the
inputs) - am I thinking about this right?

thx!

@scothis
Copy link
Contributor

scothis commented Feb 2, 2022

@cirocosta inputs/outputs were not intended to be a historical record of how a value changed over time, but to capture the "current" output/input for a resource. The definition of current will likely evolve based on other proposals.

So what you sketched as:

      outputs:
        - name: ?
          value:  {url: git.io/foo/bar, ref: b4db33f}
          lastTransitionTime: 1
        - name: ?
          value:  {url: git.io/foo/bar, ref: c001b33f}
          lastTransitionTime: 2

would actually be

      outputs:
        - name: url
          value:  git.io/foo/bar
          lastTransitionTime: 1
        - name: ref
          value:  c001b33f
          lastTransitionTime: 2

Note: the transition times are different because the url didn't change, but the ref did in your example.

@waciumawanjohi
Copy link
Contributor Author

@scothis
What you've proposed is achievable. But I wonder what is the motivation for status that is proposed here. The original user motivation (written originally in RFC 14) stated: "it is important to know when both inputs have fully traversed the supply chain." This proposal doesn't provide information necessary to answer that question from the workload status.

@squeedee
Copy link
Member

squeedee commented Feb 2, 2022

Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.

Why do we want this to be typed? Aren't these two stipulations enough?

  1. A ClusterTemplate's outputRef can refer to any node of any type (string, array, deeply nested object)
  2. A ClusterTemplate's output is never a valid input. For valid inputs, you must use one of the typed Template kinds.

@waciumawanjohi
Copy link
Contributor Author

Why do we want this to be typed?

In the workshopped format, the type is the top level field. Shall we change that to:

status:
  artifacts:
    - # the sha256 of the ordered JSON of all other non-from fields
      id: <SHA:string>
      # optional: oneOf(source,image,config)
      source:
        # exposed fields - in this case url and revision
        uri: <:string>
        revision: <:string>
      # the object which produced this artifact
      resource:
        # name of the resource in the supply chain
        resource-name: <:string>
        # GVK of the resource
        kind: <:string>
        apiVersion: <:string>
        # name of the resource on the cluster
        name: <:string>
        # namespace of the resource
        namespace: <:string>
        # resource version of the object
        resourceVersion: <:string>
      from:
        - # id of any artifact(s) that were inputs to the template
          id: <SHA:string>

Copy link
Contributor

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes, but I'd be super-excited to even start with the list of resources / stages / artifacts managed by the Workload, and add in the "triggered by" separately.

Comment on lines +63 to +75
# the object which produced this artifact
resource:
# name of the resource in the supply chain
resource-name: <:string>
# GVK of the resource
kind: <:string>
apiVersion: <:string>
# name of the resource on the cluster
name: <:string>
# namespace of the resource
namespace: <:string>
# resource version of the object
resourceVersion: <:string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +59 to +62
id: <SHA:string>
# exposed fields - in this case url and revision
uri: <:string>
revision: <:string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be top-level fields, or possibly under an outputs key?

Comment on lines +76 to +78
from:
- # id of any artifact(s) that were inputs to the template
id: <SHA:string>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Cartographer track the from information -- does it require that each resource expose an observedGeneration or equivalent in status? If not, Cartographer may observe a resource's status before it has reconciled a spec change and assume that the status goes with the current spec, when it actually applies to an older spec.

- id: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
- config:
id: 22fe69c924b55327ce3e8ca079d7295a947a3641b5ab74bdf7541dc680258c81
config: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a 1MB limit on the size of a single resource, and the values might be repeated twice due to the addition of managedFields. I'd be careful about including large other documents directly for that reason. It seems like it would be better to include a resourceVersion or generation field referencing the resource's version, rather than the entire contents.

namespace: my-namespace
resourceVersion: "11125545"
from:
- id: 23156c7ac2170fe95f85a1ad42522c408e67038f8393eea6cd8551e07457c5d7 # <--- The id noted above, as the source-tester consumes the source artifact from source-provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a component like kpack can generate a new build based on external stimulus (i.e. addition of a new base image), what happens to from here?

@scothis scothis mentioned this pull request Feb 22, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Requests For Comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants