-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix state store #4441
base: main
Are you sure you want to change the base?
Fix state store #4441
Conversation
* use golden file for store migration test * remove withFile function * migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path
* simplify fleetapi.Actions.UnmarshalJSON * add test to ensure the state store is correctly loaded from disk * skip state store migration tests, they will be fixes on a follow-up PR as part of #3912
It modifies the state store API to match the current needs.
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
NOTE: |
- take bugfix/3912-borken-stat-store as the correct one for any store change - the code ill likely be broken after this merge
fe0f59c
to
f61bbdf
Compare
@@ -21,7 +21,7 @@ var ( | |||
) | |||
|
|||
type fleetActionWithAgents struct { | |||
ActionID string `json:"action_id"` // Note the action_id here, since the signed action uses action_id for id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleksmaus is it the ES schema or the API schema? The API schema is id
and now the whole agent also used id
instead of action_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember, it is ES field name, and I think the signed JSON payload uses action_id as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understood correctly, for those actions,there is an action inside the action. Is that correct?
Here you're deserialising a whole action from the data
field.
Is it just a copy of the action or something else?
Another question. Is there any reason for copying the action model to this package instead of using the models from fleetapi
. Part of this PR consists on unifying action serialisation/deserialisation on the fleetapi
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some old example I found of the signed action:
{
"@timestamp": "2023-06-12T21:02:08.783Z",
"expiration": "2023-07-12T21:02:08.783Z",
"agents": [
"f19a072f-49c9-4a1d-94f7-9c1c519eab19"
],
"action_id": "164a6819-5c58-40f7-a33c-821c98ab0a8c",
"data": {
"version": "8.9.0"
},
"type": "UPGRADE",
"traceparent": "00-baf54d4e5b51487c5c7dd520eef3ba6f-6959cd1c34a1bfd9-01",
"signed": {
"data": "eyJAdGltZXN0YW1wIjoiMjAyMy0wNi0xMlQyMTowMjowOC43ODNaIiwiZXhwaXJhdGlvbiI6IjIwMjMtMDctMTJUMjE6MDI6MDguNzgzWiIsImFnZW50cyI6WyJmMTlhMDcyZi00OWM5LTRhMWQtOTRmNy05YzFjNTE5ZWFiMTkiXSwiYWN0aW9uX2lkIjoiMTY0YTY4MTktNWM1OC00MGY3LWEzM2MtODIxYzk4YWIwYThjIiwiZGF0YSI6eyJ2ZXJzaW9uIjoiOC45LjAifSwidHlwZSI6IlVQR1JBREUiLCJ0cmFjZXBhcmVudCI6IjAwLWJhZjU0ZDRlNWI1MTQ4N2M1YzdkZDUyMGVlZjNiYTZmLTY5NTljZDFjMzRhMWJmZDktMDEifQ==",
"signature": "MEQCIGxsrI742xKL6OSIfT6abkBfH+plFiTSO26RHNV1pHE8AiBxXgt3LlIowCiLZ/ayFqQWk0ge+c/J1L58csqg6R9nyA=="
}
}
The data
field contains the base64 encoded action JSON that is signed by Kibana. This is the source of truth for the signed actions for the tamper protection for Endpoint for example. The agent initially had signature validation, but it was commented out for the initial release due to risks concerns raised by PM. So the Agent just passes the action through to the beats/Endpoint leaving it up to them to parse and validate the signed payload if it is present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question. Is there any reason for copying the action model to this package instead of using the models from
fleetapi
. Part of this PR consists on unifying action serialisation/deserialisation on thefleetapi
package.
As far as I remember (my memories are vague), the schema came after the initial implementation and there were some incompatibilities with serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is Kibana only by design for end-to-end signing. Only Kibana has the key(s) to sign the policy and actions. The goal was to prevent any tampering in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need some good end to end test coverage with Endpoint here in order to make sure the schema changes will not break Endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need some good end to end test coverage with Endpoint here in order to make sure the schema changes will not break Endpoint.
You mean future schema changes or the ones in this PR?
The changes here should not have affected the actions schema outside the state/action store. The changes are to match fleet-server schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested, the signed actions aren't affected. They still working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
return nil | ||
} | ||
|
||
if action.Type != fleetapi.ActionTypePolicyChange { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we ignore everything that isn't a policy change?
I went to test this manually, and realized that none of our integration tests are actually testing the migration code. We can conclude this because:
That actions still work with the changes in this PR can be tested with an agent installed for the first time. I'm not sure how you'd test the migration logic here prior to merging without relying entirely on unit tests. |
Thinking a bit more about how to handle the failure cases, I think the current implementation works with the condition that every error during the migration causes the agent to exit. The key learning for me is that the migration only happens on upgrade, and the action state store only exists in the versioned home directory. This means that if we get the agent to exit on failure, it will roll back to the unmodified version of the state store in the original agent, and the failed or corrupted store in the new agent will be removed. We can then try again or deliver a bug fix with a new patch version. Ideally there would be a test proving it behaves like this, that errors cause a rollback and the original agent is unmodified. This is complicated by the observation above that that we can't test the migration logic in its current form in integration tests at all because until this code is in a snapshot build. Perhaps a unit test can prove this. The other risk to mitigate with this that the mechanical changes to the on disk format have broken or degraded existing actions in some subtle way. The prior art for this PR is #398 where we introduced state encryption, which went on to break upgrades #578. I see similar discussion around |
Adding @michel-laterman since he has the most context on the Fleet Server action implementation. @michel-laterman the goal here is to make the action store in the agent have the same on disk format as fleet server (JSON with the same schema). |
…astic-agent into bugfix/3912-borken-stat-store
we can do an upgrade to the PR build if we use a custom binary source pointing to a local http server which will serve the PR build. The issue I see is to force the agent to add actions to the store and then check it after the migration.
For the migration itself, there are already unit test in place for the migration, including tests for the function called by agent: https://github.com/elastic/elastic-agent/pull/4441/files#diff-c4620db6b6efe5946032ce0788dcb60462c88220840108d12e616a7e2f7057abR541-R602 also there are more test for the function which initiates the migration. The only difference between this one and the above is the creation of the encrypted store: Or do you meant to test more than that? If so, what do you have in mind? |
I was doing some experimentation, we can even assert the stored action after upgrade. Anyway the agent would not be quite healthy if the state store data is lost. It's just the queue I don't see how to check after the upgrade as the upgrade action is removed from the queue as soon as the upgrade starts. We do have a elastic-agent/internal/pkg/agent/application/managed_mode.go Lines 163 to 165 in 4fa6bb6
The agent ends up connected to fleet, healthy on fleet, but forever waiting configuration. The good thing is that a policy change fixes that. So even losing the store, as long as it isn't corrupted, it's possible to remotely fix it. I was testing the current release, 8.13, it's the status if the store is erased and the agent restarted
|
…astic-agent into bugfix/3912-borken-stat-store
You still won't be able to get past the binary signature check. The Fleet upgrade action doesn't support skip verify. This is a common pain point but I haven't thought of a way to get around it that doesn't weaken security. |
We could maybe try to verify this transitively. For example if we:
Then we will have proved that an error during state store migration causes a rollback if it happens during an upgrade, without actually having to test the state store migration in an upgrade by itself. You will still be left with figuring out how to get the agent to do a state store migration when not upgrading, but I don't think there is any actual dependency on an upgrade. The migration happens if the agent sees the old action store file(s) exist when it starts, it doesn't care how it got there. Can you just artificially set this up? Usually it would happen because the upgrade copied the store files, but you could just do that part manually. |
Exactly, the store migration happens every time the agent starts up. So as long as the new agent reads the old store, the migration will happen. Thus it should not be an issue to set it up. It'd be the following:
I'll manually test it now. @cmacknz, do you want it to be an integration test or just the manual test is fine for you? |
We need an automated test for this. In 8.6 we actually left the unit tests for the encrypted store in place but forget to keep the calls to the logic for it in the agent code, so we need proof that the agent binary itself can do the migration. #2249 (comment) for reference. |
Specifically, we need proof that the agent's entrypoint from the run command can do this. If you can do this with a unit test that's fine, it doesn't have to be an integration test, but that would be fine too. |
At the end, after discussing with the team we agreed the more complete solution is to be able to upgrade to a agent built from the PR. This issue was created to add such test: |
Quality Gate passedThe SonarQube Quality Gate passed, but some issues were introduced. 4 New issues |
This pull request is now in conflicts. Could you fix it? 🙏
|
What does this PR do?
Fixes the agent's state store. It's the following PRs combined and a few needed changes to fix the conflicts with
main
.Why is it important?
Currently, the agent loses data when loading actions from the state store. With the fix, it won't loose data anymore and now the action schema and serialisation are the same for the state store and the fleetapi
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testHow to test this PR locally
reproduce the issue decribed on #3912. Now it should not fail.
Related issues
error parsing version ""
if agent restarts before the upgrade starts #3912Questions to ask yourself