Fix state store #4441

AndersonQ · 2024-03-20T06:10:49Z

What does this PR do?

Fixes the agent's state store. It's the following PRs combined and a few needed changes to fix the conflicts with main.

Why is it important?

Currently, the agent loses data when loading actions from the state store. With the fix, it won't loose data anymore and now the action schema and serialisation are the same for the state store and the fleetapi

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~

How to test this PR locally

reproduce the issue decribed on #3912. Now it should not fail.

Related issues

Closes Scheduled upgrade fails with error parsing version "" if agent restarts before the upgrade starts #3912

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

* use golden file for store migration test * remove withFile function * migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path

* simplify fleetapi.Actions.UnmarshalJSON * add test to ensure the state store is correctly loaded from disk * skip state store migration tests, they will be fixes on a follow-up PR as part of #3912

It modifies the state store API to match the current needs.

elasticmachine · 2024-03-20T06:10:51Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2024-03-20T06:11:32Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bugfix/3912-borken-stat-store upstream/bugfix/3912-borken-stat-store
git merge upstream/main
git push upstream bugfix/3912-borken-stat-store

mergify · 2024-03-20T06:11:33Z

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

- take bugfix/3912-borken-stat-store as the correct one for any store change - the code ill likely be broken after this merge

AndersonQ · 2024-04-02T12:29:18Z

internal/pkg/agent/protection/action.go

@@ -21,7 +21,7 @@ var (
 )

 type fleetActionWithAgents struct {
- ActionID string `json:"action_id"` // Note the action_id here, since the signed action uses action_id for id


@aleksmaus is it the ES schema or the API schema? The API schema is id and now the whole agent also used id instead of action_id.

As far as I remember, it is ES field name, and I think the signed JSON payload uses action_id as well.

So if I understood correctly, for those actions,there is an action inside the action. Is that correct?
Here you're deserialising a whole action from the data field.
Is it just a copy of the action or something else?

Another question. Is there any reason for copying the action model to this package instead of using the models from fleetapi. Part of this PR consists on unifying action serialisation/deserialisation on the fleetapi package.

Here is some old example I found of the signed action:

{ "@timestamp": "2023-06-12T21:02:08.783Z", "expiration": "2023-07-12T21:02:08.783Z", "agents": [ "f19a072f-49c9-4a1d-94f7-9c1c519eab19" ], "action_id": "164a6819-5c58-40f7-a33c-821c98ab0a8c", "data": { "version": "8.9.0" }, "type": "UPGRADE", "traceparent": "00-baf54d4e5b51487c5c7dd520eef3ba6f-6959cd1c34a1bfd9-01", "signed": { "data": "eyJAdGltZXN0YW1wIjoiMjAyMy0wNi0xMlQyMTowMjowOC43ODNaIiwiZXhwaXJhdGlvbiI6IjIwMjMtMDctMTJUMjE6MDI6MDguNzgzWiIsImFnZW50cyI6WyJmMTlhMDcyZi00OWM5LTRhMWQtOTRmNy05YzFjNTE5ZWFiMTkiXSwiYWN0aW9uX2lkIjoiMTY0YTY4MTktNWM1OC00MGY3LWEzM2MtODIxYzk4YWIwYThjIiwiZGF0YSI6eyJ2ZXJzaW9uIjoiOC45LjAifSwidHlwZSI6IlVQR1JBREUiLCJ0cmFjZXBhcmVudCI6IjAwLWJhZjU0ZDRlNWI1MTQ4N2M1YzdkZDUyMGVlZjNiYTZmLTY5NTljZDFjMzRhMWJmZDktMDEifQ==", "signature": "MEQCIGxsrI742xKL6OSIfT6abkBfH+plFiTSO26RHNV1pHE8AiBxXgt3LlIowCiLZ/ayFqQWk0ge+c/J1L58csqg6R9nyA==" } }

The data field contains the base64 encoded action JSON that is signed by Kibana. This is the source of truth for the signed actions for the tamper protection for Endpoint for example. The agent initially had signature validation, but it was commented out for the initial release due to risks concerns raised by PM. So the Agent just passes the action through to the beats/Endpoint leaving it up to them to parse and validate the signed payload if it is present.

Another question. Is there any reason for copying the action model to this package instead of using the models from fleetapi. Part of this PR consists on unifying action serialisation/deserialisation on the fleetapi package.

As far as I remember (my memories are vague), the schema came after the initial implementation and there were some incompatibilities with serialization.

Yes it is Kibana only by design for end-to-end signing. Only Kibana has the key(s) to sign the policy and actions. The goal was to prevent any tampering in the middle.

Would need some good end to end test coverage with Endpoint here in order to make sure the schema changes will not break Endpoint.

Would need some good end to end test coverage with Endpoint here in order to make sure the schema changes will not break Endpoint.

You mean future schema changes or the ones in this PR?

The changes here should not have affected the actions schema outside the state/action store. The changes are to match fleet-server schema.

I tested, the signed actions aren't affected. They still working

cmacknz · 2024-04-04T18:55:37Z

internal/pkg/agent/storage/store/migrations.go

+ return nil
+ }
+
+ if action.Type != fleetapi.ActionTypePolicyChange {


Why do we ignore everything that isn't a policy change?

cmacknz · 2024-04-04T20:04:21Z

I went to test this manually, and realized that none of our integration tests are actually testing the migration code. We can conclude this because:

Actions and the state store require an agent to be enrolled in Fleet to exist.
To trigger the migration, you must upgrade from any previously released version to the version in this PR.
You can't upgrade to the version in this PR with a Fleet managed agent because it isn't GPG signed.

That actions still work with the changes in this PR can be tested with an agent installed for the first time. I'm not sure how you'd test the migration logic here prior to merging without relying entirely on unit tests.

cmacknz · 2024-04-04T22:48:19Z

Thinking a bit more about how to handle the failure cases, I think the current implementation works with the condition that every error during the migration causes the agent to exit.

The key learning for me is that the migration only happens on upgrade, and the action state store only exists in the versioned home directory.

This means that if we get the agent to exit on failure, it will roll back to the unmodified version of the state store in the original agent, and the failed or corrupted store in the new agent will be removed. We can then try again or deliver a bug fix with a new patch version.

Ideally there would be a test proving it behaves like this, that errors cause a rollback and the original agent is unmodified. This is complicated by the observation above that that we can't test the migration logic in its current form in integration tests at all because until this code is in a snapshot build. Perhaps a unit test can prove this.

The other risk to mitigate with this that the mechanical changes to the on disk format have broken or degraded existing actions in some subtle way.

The prior art for this PR is #398 where we introduced state encryption, which went on to break upgrades #578. I see similar discussion around action_id vs id in that PR https://github.com/elastic/elastic-agent/pull/578/files#r901834206, I wonder if there is an explicit test we can build for this interaction with Fleet server.

cmacknz · 2024-04-04T22:50:12Z

Adding @michel-laterman since he has the most context on the Fleet Server action implementation. @michel-laterman the goal here is to make the action store in the agent have the same on disk format as fleet server (JSON with the same schema).

…stat-store

…astic-agent into bugfix/3912-borken-stat-store

AndersonQ · 2024-04-05T13:00:06Z

Ideally there would be a test proving it behaves like this, that errors cause a rollback and the original agent is unmodified. This is complicated by the observation above that that we can't test the migration logic in its current form in integration tests at all because until this code is in a snapshot build. Perhaps a unit test can prove this.

we can do an upgrade to the PR build if we use a custom binary source pointing to a local http server which will serve the PR build.

The issue I see is to force the agent to add actions to the store and then check it after the migration.
The queue is for upgrade actions, which will get cleaned in the upgrade process. But also if the migration breaks, we'll get an error, so perhaps a feasible test is to ensure there are no errors related to the migration. during the

Perhaps a unit test can prove this

For the migration itself, there are already unit test in place for the migration, including tests for the function called by agent: https://github.com/elastic/elastic-agent/pull/4441/files#diff-c4620db6b6efe5946032ce0788dcb60462c88220840108d12e616a7e2f7057abR541-R602

also there are more test for the function which initiates the migration. The only difference between this one and the above is the creation of the encrypted store:
https://github.com/elastic/elastic-agent/pull/4441/files#diff-c4620db6b6efe5946032ce0788dcb60462c88220840108d12e616a7e2f7057abR334-R539

Or do you meant to test more than that? If so, what do you have in mind?

AndersonQ · 2024-04-05T13:51:20Z

I was doing some experimentation, we can even assert the stored action after upgrade. Anyway the agent would not be quite healthy if the state store data is lost. It's just the queue I don't see how to check after the upgrade as the upgrade action is removed from the queue as soon as the upgrade starts.

We do have a TODO to address the agent starting without the policy (action policy change) persisted in the state store:

elastic-agent/internal/pkg/agent/application/managed_mode.go

Lines 163 to 165 in 4fa6bb6

 // TODO(ph) We will need an improvement on fleet, if there is an error while dispatching a 

 // persisted action on disk we should be able to ask Fleet to get the latest configuration. 

 // But at the moment this is not possible because the policy change was acked.

The agent ends up connected to fleet, healthy on fleet, but forever waiting configuration. The good thing is that a policy change fixes that. So even losing the store, as long as it isn't corrupted, it's possible to remotely fix it.

I was testing the current release, 8.13, it's the status if the store is erased and the agent restarted

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (STARTING) Waiting for initial configuration and composable variables
   └─ info
      ├─ id: f1d81fd9-3cbc-49f8-a96c-63a5fd00468b
      ├─ version: 8.13.1
      └─ commit: 379dfc0203e9f5629740f6ac2bd7bf3f47861d56

…astic-agent into bugfix/3912-borken-stat-store

cmacknz · 2024-04-05T17:34:50Z

we can do an upgrade to the PR build if we use a custom binary source pointing to a local http server which will serve the PR build.

You still won't be able to get past the binary signature check. The Fleet upgrade action doesn't support skip verify. This is a common pain point but I haven't thought of a way to get around it that doesn't weaken security.

cmacknz · 2024-04-05T17:37:59Z

We could maybe try to verify this transitively. For example if we:

Prove that the agent exiting with an error during an upgrade always leads to a rollback.
Prove that the state store migration failing for any reason always causes the agent to exit with an error.

Then we will have proved that an error during state store migration causes a rollback if it happens during an upgrade, without actually having to test the state store migration in an upgrade by itself.

You will still be left with figuring out how to get the agent to do a state store migration when not upgrading, but I don't think there is any actual dependency on an upgrade. The migration happens if the agent sees the old action store file(s) exist when it starts, it doesn't care how it got there. Can you just artificially set this up? Usually it would happen because the upgrade copied the store files, but you could just do that part manually.

AndersonQ · 2024-04-09T09:58:58Z

The migration happens if the agent sees the old action store file(s) exist when it starts, it doesn't care how it got there

Exactly, the store migration happens every time the agent starts up. So as long as the new agent reads the old store, the migration will happen. Thus it should not be an issue to set it up. It'd be the following:

start an older agent
copy the state.enc and fleet.enc
install a new agent (or just move everything to the new agent's folder. That might be easier)
stop the new agent
override the new agent's state.enc and fleet.enc
start the new agent
confirm it's healthy, connected to fleet and all the components are running
check state.enc to ensure it's in JSON

I'll manually test it now. @cmacknz, do you want it to be an integration test or just the manual test is fine for you?

cmacknz · 2024-04-09T18:03:00Z

@cmacknz, do you want it to be an integration test or just the manual test is fine for you?

We need an automated test for this. In 8.6 we actually left the unit tests for the encrypted store in place but forget to keep the calls to the logic for it in the agent code, so we need proof that the agent binary itself can do the migration. #2249 (comment) for reference.

cmacknz · 2024-04-09T18:04:18Z

so we need proof that the agent binary itself can do the migration

Specifically, we need proof that the agent's entrypoint from the run command can do this. If you can do this with a unit test that's fine, it doesn't have to be an integration test, but that would be fine too.

AndersonQ · 2024-04-11T15:00:24Z

At the end, after discussing with the team we agreed the more complete solution is to be able to upgrade to a agent built from the PR. This issue was created to add such test:

Add an upgrade integration test upgrading to the PR build #4560

elastic-sonarqube · 2024-04-11T15:30:07Z

Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

4 New issues
0 Security Hotspots
62.4% 62.4% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

mergify · 2024-05-07T21:16:01Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bugfix/3912-borken-stat-store upstream/bugfix/3912-borken-stat-store
git merge upstream/main
git push upstream bugfix/3912-borken-stat-store

AndersonQ added 4 commits February 15, 2024 14:22

Fix action store -> state store migration tests (#4235)

db6c7a6

* use golden file for store migration test * remove withFile function * migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path

update action model to match fleet-server schema (#4240)

19c8fd0

* simplify fleetapi.Actions.UnmarshalJSON * add test to ensure the state store is correctly loaded from disk * skip state store migration tests, they will be fixes on a follow-up PR as part of #3912

refactor state store (#4253)

a7b7bca

It modifies the state store API to match the current needs.

add migrations for action and state stores (#4305)

9f7ccc2

AndersonQ added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Mar 20, 2024

AndersonQ self-assigned this Mar 20, 2024

AndersonQ requested a review from a team as a code owner March 20, 2024 06:10

AndersonQ requested review from michalpristas and pchila March 20, 2024 06:10

mergify bot added the backport-skip label Mar 20, 2024

AndersonQ mentioned this pull request Mar 26, 2024

Scheduled upgrade fails with error parsing version "" if agent restarts before the upgrade starts #3912

Open

6 tasks

AndersonQ added 3 commits March 27, 2024 16:29

merge 'upstream/main' into bugfix/3912-borken-stat-store

1af2c59

- take bugfix/3912-borken-stat-store as the correct one for any store change - the code ill likely be broken after this merge

fix stores conflicts

1c3dfe0

adjust actions in protection package and diagnostics action

f61bbdf

AndersonQ force-pushed the bugfix/3912-borken-stat-store branch from fe0f59c to f61bbdf Compare April 2, 2024 12:05

AndersonQ added 4 commits April 2, 2024 14:06

Merge branch 'main' into bugfix/3912-borken-stat-store

ac8591e

fix action diagnostics handler

9589a5d

it should not be there

4032005

use .json instead of .yml/.yaml

bd1f715

AndersonQ commented Apr 2, 2024

View reviewed changes

AndersonQ requested review from blakerouse and removed request for michalpristas April 2, 2024 12:37

AndersonQ added 4 commits April 2, 2024 14:58

fix more tests

27920c3

split actionDiagnosticsData into its own type and fix another test

cc4d6b4

add changelog

85324a5

fix signed action data schema

27a235e

cmacknz reviewed Apr 4, 2024

View reviewed changes

cmacknz requested a review from michel-laterman April 4, 2024 22:49

AndersonQ added 6 commits April 5, 2024 07:18

use unprivileged vault

6cdad71

Merge remote-tracking branch 'upstream/main' into bugfix/3912-borken-…

b4e4077

…stat-store

Merge branch 'bugfix/3912-borken-stat-store' of github.com:elastic/el…

ff3d90e

…astic-agent into bugfix/3912-borken-stat-store

another mac fix

f89a3ab

remove debug log

ce05de0

fix typo and comment

177693d

AndersonQ requested a review from cmacknz April 5, 2024 13:51

AndersonQ added 3 commits April 5, 2024 15:52

Merge branch 'main' into bugfix/3912-borken-stat-store

9c07d92

add notice

4f05e97

Merge branch 'bugfix/3912-borken-stat-store' of github.com:elastic/el…

0862675

…astic-agent into bugfix/3912-borken-stat-store

AndersonQ added 2 commits April 9, 2024 11:59

Merge branch 'main' into bugfix/3912-borken-stat-store

3820c15

Merge branch 'main' into bugfix/3912-borken-stat-store

dde9111

AndersonQ mentioned this pull request Apr 10, 2024

Add an upgrade integration test upgrading to the PR build #4560

Open

Merge branch 'main' into bugfix/3912-borken-stat-store

f5a0438

AndersonQ marked this pull request as draft April 12, 2024 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix state store #4441

Fix state store #4441

AndersonQ commented Mar 20, 2024 •

edited

elasticmachine commented Mar 20, 2024

mergify bot commented Mar 20, 2024

mergify bot commented Mar 20, 2024

AndersonQ Apr 2, 2024

aleksmaus Apr 2, 2024

AndersonQ Apr 2, 2024

aleksmaus Apr 2, 2024

aleksmaus Apr 2, 2024

aleksmaus Apr 2, 2024

aleksmaus Apr 2, 2024

AndersonQ Apr 2, 2024

AndersonQ Apr 4, 2024

aleksmaus Apr 4, 2024

cmacknz Apr 4, 2024

cmacknz commented Apr 4, 2024

cmacknz commented Apr 4, 2024 •

edited

cmacknz commented Apr 4, 2024

AndersonQ commented Apr 5, 2024

AndersonQ commented Apr 5, 2024 •

edited

cmacknz commented Apr 5, 2024

cmacknz commented Apr 5, 2024 •

edited

AndersonQ commented Apr 9, 2024

cmacknz commented Apr 9, 2024 •

edited

cmacknz commented Apr 9, 2024

AndersonQ commented Apr 11, 2024

elastic-sonarqube bot commented Apr 11, 2024

mergify bot commented May 7, 2024

Fix state store #4441

Are you sure you want to change the base?

Fix state store #4441

Conversation

AndersonQ commented Mar 20, 2024 • edited

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Questions to ask yourself

elasticmachine commented Mar 20, 2024

mergify bot commented Mar 20, 2024

mergify bot commented Mar 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz commented Apr 4, 2024

cmacknz commented Apr 4, 2024 • edited

cmacknz commented Apr 4, 2024

AndersonQ commented Apr 5, 2024

AndersonQ commented Apr 5, 2024 • edited

cmacknz commented Apr 5, 2024

cmacknz commented Apr 5, 2024 • edited

AndersonQ commented Apr 9, 2024

cmacknz commented Apr 9, 2024 • edited

cmacknz commented Apr 9, 2024

AndersonQ commented Apr 11, 2024

elastic-sonarqube bot commented Apr 11, 2024

Quality Gate passed

mergify bot commented May 7, 2024

AndersonQ commented Mar 20, 2024 •

edited

cmacknz commented Apr 4, 2024 •

edited

AndersonQ commented Apr 5, 2024 •

edited

cmacknz commented Apr 5, 2024 •

edited

cmacknz commented Apr 9, 2024 •

edited