Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix state store #4441

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Fix state store #4441

wants to merge 31 commits into from

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Mar 20, 2024

What does this PR do?

Fixes the agent's state store. It's the following PRs combined and a few needed changes to fix the conflicts with main.

Why is it important?

Currently, the agent loses data when loading actions from the state store. With the fix, it won't loose data anymore and now the action schema and serialisation are the same for the state store and the fleetapi

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

How to test this PR locally

reproduce the issue decribed on #3912. Now it should not fail.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

* use golden file for store migration test
* remove withFile function
* migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path
* simplify fleetapi.Actions.UnmarshalJSON
* add test to ensure the state store is correctly loaded from disk
* skip state store migration tests, they will be fixes on a follow-up PR as part of #3912
It modifies the state store API to match the current needs.
@AndersonQ AndersonQ added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Mar 20, 2024
@AndersonQ AndersonQ self-assigned this Mar 20, 2024
@AndersonQ AndersonQ requested a review from a team as a code owner March 20, 2024 06:10
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Mar 20, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bugfix/3912-borken-stat-store upstream/bugfix/3912-borken-stat-store
git merge upstream/main
git push upstream bugfix/3912-borken-stat-store

Copy link
Contributor

mergify bot commented Mar 20, 2024

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

 - take bugfix/3912-borken-stat-store as the correct one for any store change
 - the code ill likely be broken after this merge
@AndersonQ AndersonQ force-pushed the bugfix/3912-borken-stat-store branch from fe0f59c to f61bbdf Compare April 2, 2024 12:05
@@ -21,7 +21,7 @@ var (
)

type fleetActionWithAgents struct {
ActionID string `json:"action_id"` // Note the action_id here, since the signed action uses action_id for id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksmaus is it the ES schema or the API schema? The API schema is id and now the whole agent also used id instead of action_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember, it is ES field name, and I think the signed JSON payload uses action_id as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understood correctly, for those actions,there is an action inside the action. Is that correct?
Here you're deserialising a whole action from the data field.
Is it just a copy of the action or something else?

Another question. Is there any reason for copying the action model to this package instead of using the models from fleetapi. Part of this PR consists on unifying action serialisation/deserialisation on the fleetapi package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some old example I found of the signed action:

{
    "@timestamp": "2023-06-12T21:02:08.783Z",
    "expiration": "2023-07-12T21:02:08.783Z",
    "agents": [
        "f19a072f-49c9-4a1d-94f7-9c1c519eab19"
    ],
    "action_id": "164a6819-5c58-40f7-a33c-821c98ab0a8c",
    "data": {
        "version": "8.9.0"
    },
    "type": "UPGRADE",
    "traceparent": "00-baf54d4e5b51487c5c7dd520eef3ba6f-6959cd1c34a1bfd9-01",
    "signed": {
        "data": "eyJAdGltZXN0YW1wIjoiMjAyMy0wNi0xMlQyMTowMjowOC43ODNaIiwiZXhwaXJhdGlvbiI6IjIwMjMtMDctMTJUMjE6MDI6MDguNzgzWiIsImFnZW50cyI6WyJmMTlhMDcyZi00OWM5LTRhMWQtOTRmNy05YzFjNTE5ZWFiMTkiXSwiYWN0aW9uX2lkIjoiMTY0YTY4MTktNWM1OC00MGY3LWEzM2MtODIxYzk4YWIwYThjIiwiZGF0YSI6eyJ2ZXJzaW9uIjoiOC45LjAifSwidHlwZSI6IlVQR1JBREUiLCJ0cmFjZXBhcmVudCI6IjAwLWJhZjU0ZDRlNWI1MTQ4N2M1YzdkZDUyMGVlZjNiYTZmLTY5NTljZDFjMzRhMWJmZDktMDEifQ==",
        "signature": "MEQCIGxsrI742xKL6OSIfT6abkBfH+plFiTSO26RHNV1pHE8AiBxXgt3LlIowCiLZ/ayFqQWk0ge+c/J1L58csqg6R9nyA=="
    }
} 

The data field contains the base64 encoded action JSON that is signed by Kibana. This is the source of truth for the signed actions for the tamper protection for Endpoint for example. The agent initially had signature validation, but it was commented out for the initial release due to risks concerns raised by PM. So the Agent just passes the action through to the beats/Endpoint leaving it up to them to parse and validate the signed payload if it is present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question. Is there any reason for copying the action model to this package instead of using the models from fleetapi. Part of this PR consists on unifying action serialisation/deserialisation on the fleetapi package.

As far as I remember (my memories are vague), the schema came after the initial implementation and there were some incompatibilities with serialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is Kibana only by design for end-to-end signing. Only Kibana has the key(s) to sign the policy and actions. The goal was to prevent any tampering in the middle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need some good end to end test coverage with Endpoint here in order to make sure the schema changes will not break Endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need some good end to end test coverage with Endpoint here in order to make sure the schema changes will not break Endpoint.

You mean future schema changes or the ones in this PR?

The changes here should not have affected the actions schema outside the state/action store. The changes are to match fleet-server schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested, the signed actions aren't affected. They still working

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@AndersonQ AndersonQ requested review from blakerouse and removed request for michalpristas April 2, 2024 12:37
return nil
}

if action.Type != fleetapi.ActionTypePolicyChange {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore everything that isn't a policy change?

@cmacknz
Copy link
Member

cmacknz commented Apr 4, 2024

I went to test this manually, and realized that none of our integration tests are actually testing the migration code. We can conclude this because:

  1. Actions and the state store require an agent to be enrolled in Fleet to exist.
  2. To trigger the migration, you must upgrade from any previously released version to the version in this PR.
  3. You can't upgrade to the version in this PR with a Fleet managed agent because it isn't GPG signed.

That actions still work with the changes in this PR can be tested with an agent installed for the first time. I'm not sure how you'd test the migration logic here prior to merging without relying entirely on unit tests.

@cmacknz
Copy link
Member

cmacknz commented Apr 4, 2024

Thinking a bit more about how to handle the failure cases, I think the current implementation works with the condition that every error during the migration causes the agent to exit.

The key learning for me is that the migration only happens on upgrade, and the action state store only exists in the versioned home directory.

This means that if we get the agent to exit on failure, it will roll back to the unmodified version of the state store in the original agent, and the failed or corrupted store in the new agent will be removed. We can then try again or deliver a bug fix with a new patch version.

Ideally there would be a test proving it behaves like this, that errors cause a rollback and the original agent is unmodified. This is complicated by the observation above that that we can't test the migration logic in its current form in integration tests at all because until this code is in a snapshot build. Perhaps a unit test can prove this.

The other risk to mitigate with this that the mechanical changes to the on disk format have broken or degraded existing actions in some subtle way.

The prior art for this PR is #398 where we introduced state encryption, which went on to break upgrades #578. I see similar discussion around action_id vs id in that PR https://github.com/elastic/elastic-agent/pull/578/files#r901834206, I wonder if there is an explicit test we can build for this interaction with Fleet server.

@cmacknz
Copy link
Member

cmacknz commented Apr 4, 2024

Adding @michel-laterman since he has the most context on the Fleet Server action implementation. @michel-laterman the goal here is to make the action store in the agent have the same on disk format as fleet server (JSON with the same schema).

@AndersonQ
Copy link
Member Author

Ideally there would be a test proving it behaves like this, that errors cause a rollback and the original agent is unmodified. This is complicated by the observation above that that we can't test the migration logic in its current form in integration tests at all because until this code is in a snapshot build. Perhaps a unit test can prove this.

we can do an upgrade to the PR build if we use a custom binary source pointing to a local http server which will serve the PR build.

The issue I see is to force the agent to add actions to the store and then check it after the migration.
The queue is for upgrade actions, which will get cleaned in the upgrade process. But also if the migration breaks, we'll get an error, so perhaps a feasible test is to ensure there are no errors related to the migration. during the

Perhaps a unit test can prove this

For the migration itself, there are already unit test in place for the migration, including tests for the function called by agent: https://github.com/elastic/elastic-agent/pull/4441/files#diff-c4620db6b6efe5946032ce0788dcb60462c88220840108d12e616a7e2f7057abR541-R602

also there are more test for the function which initiates the migration. The only difference between this one and the above is the creation of the encrypted store:
https://github.com/elastic/elastic-agent/pull/4441/files#diff-c4620db6b6efe5946032ce0788dcb60462c88220840108d12e616a7e2f7057abR334-R539

Or do you meant to test more than that? If so, what do you have in mind?

@AndersonQ
Copy link
Member Author

AndersonQ commented Apr 5, 2024

I was doing some experimentation, we can even assert the stored action after upgrade. Anyway the agent would not be quite healthy if the state store data is lost. It's just the queue I don't see how to check after the upgrade as the upgrade action is removed from the queue as soon as the upgrade starts.

We do have a TODO to address the agent starting without the policy (action policy change) persisted in the state store:

// TODO(ph) We will need an improvement on fleet, if there is an error while dispatching a
// persisted action on disk we should be able to ask Fleet to get the latest configuration.
// But at the moment this is not possible because the policy change was acked.

The agent ends up connected to fleet, healthy on fleet, but forever waiting configuration. The good thing is that a policy change fixes that. So even losing the store, as long as it isn't corrupted, it's possible to remotely fix it.

I was testing the current release, 8.13, it's the status if the store is erased and the agent restarted

┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (STARTING) Waiting for initial configuration and composable variables
   └─ info
      ├─ id: f1d81fd9-3cbc-49f8-a96c-63a5fd00468b
      ├─ version: 8.13.1
      └─ commit: 379dfc0203e9f5629740f6ac2bd7bf3f47861d56

@AndersonQ AndersonQ requested a review from cmacknz April 5, 2024 13:51
@cmacknz
Copy link
Member

cmacknz commented Apr 5, 2024

we can do an upgrade to the PR build if we use a custom binary source pointing to a local http server which will serve the PR build.

You still won't be able to get past the binary signature check. The Fleet upgrade action doesn't support skip verify. This is a common pain point but I haven't thought of a way to get around it that doesn't weaken security.

@cmacknz
Copy link
Member

cmacknz commented Apr 5, 2024

We could maybe try to verify this transitively. For example if we:

  1. Prove that the agent exiting with an error during an upgrade always leads to a rollback.
  2. Prove that the state store migration failing for any reason always causes the agent to exit with an error.

Then we will have proved that an error during state store migration causes a rollback if it happens during an upgrade, without actually having to test the state store migration in an upgrade by itself.

You will still be left with figuring out how to get the agent to do a state store migration when not upgrading, but I don't think there is any actual dependency on an upgrade. The migration happens if the agent sees the old action store file(s) exist when it starts, it doesn't care how it got there. Can you just artificially set this up? Usually it would happen because the upgrade copied the store files, but you could just do that part manually.

@AndersonQ
Copy link
Member Author

The migration happens if the agent sees the old action store file(s) exist when it starts, it doesn't care how it got there

Exactly, the store migration happens every time the agent starts up. So as long as the new agent reads the old store, the migration will happen. Thus it should not be an issue to set it up. It'd be the following:

  • start an older agent
  • copy the state.enc and fleet.enc
  • install a new agent (or just move everything to the new agent's folder. That might be easier)
  • stop the new agent
  • override the new agent's state.enc and fleet.enc
  • start the new agent
  • confirm it's healthy, connected to fleet and all the components are running
  • check state.enc to ensure it's in JSON

I'll manually test it now. @cmacknz, do you want it to be an integration test or just the manual test is fine for you?

@cmacknz
Copy link
Member

cmacknz commented Apr 9, 2024

@cmacknz, do you want it to be an integration test or just the manual test is fine for you?

We need an automated test for this. In 8.6 we actually left the unit tests for the encrypted store in place but forget to keep the calls to the logic for it in the agent code, so we need proof that the agent binary itself can do the migration. #2249 (comment) for reference.

@cmacknz
Copy link
Member

cmacknz commented Apr 9, 2024

so we need proof that the agent binary itself can do the migration

Specifically, we need proof that the agent's entrypoint from the run command can do this. If you can do this with a unit test that's fine, it doesn't have to be an integration test, but that would be fine too.

@AndersonQ
Copy link
Member Author

At the end, after discussing with the team we agreed the more complete solution is to be able to upgrade to a agent built from the PR. This issue was created to add such test:

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

4 New issues
0 Security Hotspots
62.4% 62.4% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@AndersonQ AndersonQ marked this pull request as draft April 12, 2024 13:19
Copy link
Contributor

mergify bot commented May 7, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bugfix/3912-borken-stat-store upstream/bugfix/3912-borken-stat-store
git merge upstream/main
git push upstream bugfix/3912-borken-stat-store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduled upgrade fails with error parsing version "" if agent restarts before the upgrade starts
4 participants