-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add max-object-count error to nomos vet #1531
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide feedback shortly. In the meantime, here's a summary for you and other reviewers to quickly get up to speed:
This pull request adds a new validation check to nomos vet
to limit the maximum number of objects processed. The intent is to prevent errors caused by exceeding etcd's object size limits when syncing a large number of resources.
Key Changes:
- Added a new flag
--max-object-count
to thenomos vet
command. This flag allows users to specify the maximum number of objects allowed. It defaults to 5000 but is configurable. - If the number of objects exceeds the specified limit,
nomos vet
will now return aKNV1070
error, providing guidance on how to mitigate the issue by breaking up the repository. - Updated the
vet.go
file to incorporate the new flag and error handling. - Added a new error type (
MaxObjectCountError
) in thesystem
package to represent the new validation error. - Modified the final validation step in
pkg/validate/validate.go
to use the newMaxObjectCount
option. - Added a new file
pkg/importer/analyzer/validation/system/max_object_count_validator.go
to implement the new validation logic. - Updated the
examples.go
file to include an example of the new error.
Specific File Changes:
File | Changes |
---|---|
cmd/nomos/vet/vet.go |
Added the --max-object-count flag and updated the runVet function to use the new VetOptions struct. Lines 20-23, 45-53, 64-79. |
cmd/nomos/vet/vet_impl.go |
Created a new VetOptions struct and modified the runVet function to accept and use the new options. Lines 37-45, 52-67, 86-106. |
cmd/nomoserrors/examples/examples.go |
Added an example of the new MaxObjectCountError . Line 321-325 |
pkg/importer/analyzer/validation/system/max_object_count_validator.go |
Added a new file to implement the MaxObjectCountError and DefaultMaxObjectCount constant. Lines 1-47 |
pkg/validate/final/validate/max_object_count_validator.go |
Added a new validator function to check for exceeding the maximum object count. Lines 1-40 |
pkg/validate/validate.go |
Updated the Hierarchical and Unstructured functions to use the new MaxObjectCount option in final validation. Lines 88-94, 158-172, 233-251 |
I'll be back soon with a more detailed review.
Meanwhile, here's a little something to brighten your day:
Code flows like a stream,
Errors hide in shadowed lines,
Reviews bring the light.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request adds a --max-object-count
flag to nomos vet
to limit the number of objects in a configuration. This is a useful addition to prevent exceeding etcd object size limits. The implementation is well-structured and easy to follow. I have a few minor suggestions for improvement.
I've referenced the following elements from the Google Go Style Guide in this review:
- Named return values should only be used for short functions, or when they help document the meaning of the return values.
- Error strings should not be capitalized or end with punctuation.
-- Max defaults to 5000, but is configurable with --max-object-count
517150c
to
e619d0a
Compare
Looking at this it seems like the current count is just a number of resources. I think this helps with a set of problems (esp when the resource count is large), but could we also validate via some heuristic function that the status for all of these objects is going to exceed 1.5MB? |
Maybe. The problem is that the object size is influenced by a number of factors:
So we could make a heuristic, but it would be a guess, and probably not particularly accurate. As with the max object count, we can tune it over time as needed, to make it more or less likely to trigger a false positive, but the higher we go, the more likely it is to give false negatives. |
Looks like I was wrong about number 5. While the RG CRD does have fields for conditions on each of the object statuses, the code doesn't populate them. The individual errors from the applier are only reported in the RSync status. So only the following status fields are populated for each object:
All of those fields have implicit max lengths. So we might be able to make a useful heuristic. However, I also discovered that GKE currently reports 3MiB as the new limit, not just 1.5MiB. I'm not sure when that was changed, but I think it's GKE-specific. It probably doesn't apply to OSS K8s or EKS/AKS. |
thank you for the detailed breakdown. My guess is that as long as the disclaimer is there the heuristic function works better than anything a user would come up with. You already know all the things that go into the length. I think this is sort of like memory and disk space on a server, you want to probably have a warning when the load gets to X% of the etcd row. I think right now we don't have any way to warn the user or to even explain what can happen. Also it might be worthwhile for ResourceGroup to deal with excessive error values. I assume those are mirrored from the resource statuses, so maybe something over x-many characters can be truncated with "..." |
I was wrong about the error messages. The ResourceGroup doesn't actually store them like I thought they did. There's fields for it in the CRD, but the inventory client from kpt doesn't populate them. So the inventory size should actually be much easier to predict in nomos vet. |
I added another validator that builds a fake inventory object in memory, serializes to json, and counts the bytes. It's not perfect, but it might be more accurate. Verifying that it's accurate is challenging though. |
- Use 1.5MiB as the max value (GKE etcd allows up to 3MiB) - Allow setting and disabling the validation with the --max-inventory-size flag (in bytes)
efccf6f
to
92356be
Compare
@karlkfi is the description of the PR current? Thinking that instead of a Google doc we might want to just send the PR to customers and let them comment on this. |
Yes.
That's fine. |
For the record, I also wrote a bash script that can count objects in a directory, but it doesn't handle hydration or cluster selectors the way |
Max Object Count
Example usage:
Max Inventory Size in Bytes
Example usage:
TODO
Building nomos
To build your own nomos CLI:
Builder outputs to
./.output/go/bin/${ARCH}/nomos