Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add max-object-count error to nomos vet #1531

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Jan 21, 2025

Max Object Count

  • Max defaults to 5000, but is configurable with --max-object-count
  • 5000 objects usually fits within 1.5MiB, as long as the managed object metadata isn't too long. In my tests with CronTab, 5k works, but 6k is too big. YMMV with different objects.

Example usage:

$ git clone https://github.com/config-sync-examples/crontab-crs
$ cd crontab-crs
$ wget https://raw.githubusercontent.com/GoogleContainerTools/kpt-config-sync/refs/heads/main/e2e/testdata/customresources/changed_schema_crds/old_schema_crd.yaml -O configs/crontab-crd.yaml
$ cd configs

$ nomos vet --source-format unstructured --max-inventory-size 0
Error: KNV1070: Maximum number of objects exceeded. Found 13064, but expected no more than 5000. Reduce the number of objects being synced to this cluster in your source of truth to prevent your ResourceGroup inventory object from exceeding the etcd object size limit. For instructions on how to break up a repository into multiple repositories, see https://cloud.google.com/kubernetes-engine/enterprise/config-sync/docs/how-to/breaking-up-repo

For more information, see https://g.co/cloud/acm-errors#knv1070

$ nomos vet --source-format unstructured --max-inventory-size 0 --max-object-count 20000
✅ No validation issues found.

$ nomos vet --source-format unstructured --max-inventory-size 0 --max-object-count 0 # disabled
✅ No validation issues found.

Max Inventory Size in Bytes

  • Max defaults to 1572864 bytes (aka 1.5 MiB), but is configurable with --max-inventory-size (e.g. 3MiB = 3145728 bytes)

Example usage:

$ git clone https://github.com/karlkfi/crontab-crs
$ cd crontab-crs
$ git checkout karl-test-max-object-count
$ cd configs

$ nomos vet --source-format unstructured --max-object-count 0
Error: KNV1071: Maximum inventory size exceeded. Estimated inventory object size is 1601479 bytes, but the maximum in etcd defaults to 1572864 bytes. To fix, split the resources across multiple repositories.https://cloud.google.com/kubernetes-engine/enterprise/config-sync/docs/how-to/breaking-up-repo

For more information, see https://g.co/cloud/acm-errors#knv1071

$ nomos vet --source-format unstructured --max-object-count 0 --max-inventory-size 3145728 # Max for GKE
✅ No validation issues found.

$ nomos vet --source-format unstructured --max-object-count 0 --max-inventory-size 0 # disabled
✅ No validation issues found.

TODO

  • Vet the interfaces & defaults with customers who frequently have this problem.

Building nomos

To build your own nomos CLI:

$ git clone https://github.com/karlkfi/kpt-config-sync
$ cd kpt-config-sync
$ git checkout karl-nomos-vet-limit
$ make build-cli

Builder outputs to ./.output/go/bin/${ARCH}/nomos

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team,

I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:

This pull request adds a new validation check to nomos vet to limit the maximum number of objects processed. The intent is to prevent errors caused by exceeding etcd's object size limits when syncing a large number of resources.

Key Changes:

  • Added a new flag --max-object-count to the nomos vet command. This flag allows users to specify the maximum number of objects allowed. It defaults to 5000 but is configurable.
  • If the number of objects exceeds the specified limit, nomos vet will now return a KNV1070 error, providing guidance on how to mitigate the issue by breaking up the repository.
  • Updated the vet.go file to incorporate the new flag and error handling.
  • Added a new error type (MaxObjectCountError) in the system package to represent the new validation error.
  • Modified the final validation step in pkg/validate/validate.go to use the new MaxObjectCount option.
  • Added a new file pkg/importer/analyzer/validation/system/max_object_count_validator.go to implement the new validation logic.
  • Updated the examples.go file to include an example of the new error.

Specific File Changes:

File Changes
cmd/nomos/vet/vet.go Added the --max-object-count flag and updated the runVet function to use the new VetOptions struct. Lines 20-23, 45-53, 64-79.
cmd/nomos/vet/vet_impl.go Created a new VetOptions struct and modified the runVet function to accept and use the new options. Lines 37-45, 52-67, 86-106.
cmd/nomoserrors/examples/examples.go Added an example of the new MaxObjectCountError. Line 321-325
pkg/importer/analyzer/validation/system/max_object_count_validator.go Added a new file to implement the MaxObjectCountError and DefaultMaxObjectCount constant. Lines 1-47
pkg/validate/final/validate/max_object_count_validator.go Added a new validator function to check for exceeding the maximum object count. Lines 1-40
pkg/validate/validate.go Updated the Hierarchical and Unstructured functions to use the new MaxObjectCount option in final validation. Lines 88-94, 158-172, 233-251

I'll be back soon with a more detailed review.

Meanwhile, here's a little something to brighten your day:

Code flows like a stream,
Errors hide in shadowed lines,
Reviews bring the light.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request adds a --max-object-count flag to nomos vet to limit the number of objects in a configuration. This is a useful addition to prevent exceeding etcd object size limits. The implementation is well-structured and easy to follow. I have a few minor suggestions for improvement.

I've referenced the following elements from the Google Go Style Guide in this review:

  • Named return values should only be used for short functions, or when they help document the meaning of the return values.
  • Error strings should not be capitalized or end with punctuation.

-- Max defaults to 5000, but is configurable with --max-object-count
@karlkfi karlkfi force-pushed the karl-nomos-vet-limit branch from 517150c to e619d0a Compare January 21, 2025 20:02
@karlkfi karlkfi requested a review from mikebz January 21, 2025 20:05
@karlkfi karlkfi changed the title Add max-object-count error to nomos vet [WIP] Add max-object-count error to nomos vet Jan 21, 2025
@mikebz
Copy link
Contributor

mikebz commented Jan 21, 2025

Looking at this it seems like the current count is just a number of resources. I think this helps with a set of problems (esp when the resource count is large), but could we also validate via some heuristic function that the status for all of these objects is going to exceed 1.5MB?

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 21, 2025

could we also validate via some heuristic function that the status for all of these objects is going to exceed 1.5MB?

Maybe.

The problem is that the object size is influenced by a number of factors:

  1. The number of objects (x2 for spec + status references)
  2. The string lengths of the name, namespace, kind, and API group for each managed object
  3. The string lengths of name and namespace of the RSync
  4. Any metadata added to the ResourceGroup by other cluster addons, fleet features, and third party controllers.
  5. Any possible error messages encountered when syncing or reconciling these objects (when status is enabled)
  • 1 + 2 we can calculate in nomos vet from source.
  • 3 we can guess or use a hard-coded max length.
  • 4 is difficult to guess, but usually not huge, unless giant annotations are added, but we could also hard-code a guess.
  • 5 is unbounded. It's theoretically possible for a single error to be so long it exceeds 1.5MB, but this is very unlikely. However, if already close to the limit, it's likely that it works fine until there's an error or too many errors. Errors have no current max length. Most errors are larger than a single object reference. So total error message bytes size can easily eclipse the size of the object references in the spec & status, even if only a dozen have errors.

So we could make a heuristic, but it would be a guess, and probably not particularly accurate.

As with the max object count, we can tune it over time as needed, to make it more or less likely to trigger a false positive, but the higher we go, the more likely it is to give false negatives.

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 21, 2025

Looks like I was wrong about number 5.

While the RG CRD does have fields for conditions on each of the object statuses, the code doesn't populate them. The individual errors from the applier are only reported in the RSync status. So only the following status fields are populated for each object:

  • group
  • kind
  • namespace
  • name
  • status
  • strategy
  • actuation
  • reconcile

All of those fields have implicit max lengths. So we might be able to make a useful heuristic.

However, I also discovered that GKE currently reports 3MiB as the new limit, not just 1.5MiB. I'm not sure when that was changed, but I think it's GKE-specific. It probably doesn't apply to OSS K8s or EKS/AKS.

@mikebz
Copy link
Contributor

mikebz commented Jan 21, 2025

thank you for the detailed breakdown. My guess is that as long as the disclaimer is there the heuristic function works better than anything a user would come up with. You already know all the things that go into the length. I think this is sort of like memory and disk space on a server, you want to probably have a warning when the load gets to X% of the etcd row. I think right now we don't have any way to warn the user or to even explain what can happen.

Also it might be worthwhile for ResourceGroup to deal with excessive error values. I assume those are mirrored from the resource statuses, so maybe something over x-many characters can be truncated with "..."

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 21, 2025

Also it might be worthwhile for ResourceGroup to deal with excessive error values. I assume those are mirrored from the resource statuses, so maybe something over x-many characters can be truncated with "..."

I was wrong about the error messages. The ResourceGroup doesn't actually store them like I thought they did. There's fields for it in the CRD, but the inventory client from kpt doesn't populate them. So the inventory size should actually be much easier to predict in nomos vet.

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 22, 2025

I added another validator that builds a fake inventory object in memory, serializes to json, and counts the bytes. It's not perfect, but it might be more accurate. Verifying that it's accurate is challenging though.

- Use 1.5MiB as the max value (GKE etcd allows up to 3MiB)
- Allow setting and disabling the validation with the
  --max-inventory-size flag (in bytes)
@karlkfi karlkfi force-pushed the karl-nomos-vet-limit branch from efccf6f to 92356be Compare January 22, 2025 01:01
@mikebz
Copy link
Contributor

mikebz commented Jan 22, 2025

@karlkfi is the description of the PR current? Thinking that instead of a Google doc we might want to just send the PR to customers and let them comment on this.

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 22, 2025

is the description of the PR current?

Yes.

Thinking that instead of a Google doc we might want to just send the PR to customers and let them comment on this.

That's fine.

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 22, 2025

For the record, I also wrote a bash script that can count objects in a directory, but it doesn't handle hydration or cluster selectors the way nomos vet does. So this vet solution is superior when working with "dry" sources, but the script works fine on pre-hydrated "wet" sources. https://gist.github.com/karlkfi/22d3da7603ceac4239c157a180454ea2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants