Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent/beats grpc comms over domain socket/named pipe #4249

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

aleksmaus
Copy link
Member

@aleksmaus aleksmaus commented Feb 12, 2024

What does this PR do?

Implements configurable GRPC allowing the agent to use domain socket/named pipe for comms with beats.

The core of the support for this is already merged in the agent client lib elastic/elastic-agent-client#91 and is tagged as v7.8.0. Beats just work automagically since they already picked up this tag
https://github.com/elastic/beats/blob/main/go.mod#L72

Changes:

  • Made the option configurable with the newLocal GRPC bool flag. Agent still uses IP socket by default
  • Refactored the code that created the domain socket/named pipe socket, so it could be shared for GRPC comms
  • Propagate the server domain socket/named pipe address to the beats.

I tested on all 3 platforms: darwin, linux and winderz.
Will reach out to Endpoint team, since they likely would need to adjust comms on Endpoint side.

This is the first cut, can change based on the feedback.

Open for feedback/opinions on:

  1. Agent on the new GRPC configuration option, currently it's a boolean flag.
  2. The comms socket is derived from the control domain socket/named pipe path if GRPC is set to Local. Wanted to have it consistent in the same location as control domain socket path, with the same length limits fallback.

Why is it important?

Addresses #4248

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

@aleksmaus aleksmaus added the enhancement New feature or request label Feb 12, 2024
@aleksmaus aleksmaus requested a review from a team as a code owner February 12, 2024 23:51
Copy link
Contributor

mergify bot commented Feb 12, 2024

This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Feb 13, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@aleksmaus
Copy link
Member Author

aleksmaus commented Feb 13, 2024

Talked to @jrmolin, one issue left is the connection info discovery is still as defined in the endpoint config an IP port
https://github.com/elastic/elastic-agent/blob/main/specs/endpoint-security.spec.yml#L24

We could potentially allow some base path in the endpoint spect for domain socket/named pipe for connection info API.
If the agent starts with “local” GRPC mode then it would use that location to create the domain socket server for connection info API, otherwise it will use the port.
The question now, that the endpoint doesn’t know what mode the agent is started in, so it would have to check both IP and domain socket/pipe for the connection info server.

Thinking what other options we could use within the current implementation.

@blakerouse @cmacknz

Thoughts?

@cmacknz
Copy link
Member

cmacknz commented Feb 14, 2024

The question now, that the endpoint doesn’t know what mode the agent is started in, so it would have to check both IP and domain socket/pipe for the connection info server.

It seems simpler to just align this change in both endpoint and agent rather than introduce complexity to deal with this that we would remove as soon as endpoint supports domain sockets. The endpoint version is always aligned with the agent version so we ideally wouldn't have to release a version where endpoint doesn't know what to expect.

Talked to @jrmolin, one issue left is the connection info discovery is still as defined in the endpoint config an IP port
https://github.com/elastic/elastic-agent/blob/main/specs/endpoint-security.spec.yml#L24

We could potentially allow some base path in the endpoint spect for domain socket/named pipe for connection info API.

We could use a fixed base path for the unix socket connection instead of the a fixed port. The socket/pipe would be owned by root/admin so it is already more secure than an arbitrary port.

@aleksmaus
Copy link
Member Author

aleksmaus commented Feb 14, 2024

It seems simpler to just align this change in both endpoint and agent rather than introduce complexity to deal with this that we would remove as soon as endpoint supports domain sockets. The endpoint version is always aligned with the agent version so we ideally wouldn't have to release a version where endpoint doesn't know what to expect.

I made this mode configurable, there might be the cases where you might want to use IP socket, I suspect, like possibly running agent and beats in the separate pods. So the Endpoint would NOT know in advance if IP or domain socket is used for RPC.

@cmacknz
Copy link
Member

cmacknz commented Feb 14, 2024

Looking at the code we can use the agent.grpc.local as a feature flag initially that defaults to false and is only flipped to true once all of the inputs are confirmed to work with unix sockets/named pipes.

The Go based inputs all need to be updated to https://github.com/elastic/elastic-agent-client/releases/tag/v7.8.0 first which shouldn't be that hard to coordinate, endpoint is slightly more complicated.

Thinking more about this it would actually be best if endpoint could handle either a TCP port or a domain socket/pipe. This way we can keep the configuration and if we hit an unexpected problem we have the ability to easily go back.

So I'd say the path forward is:

  1. We define how we are going to specify the domain socket location for endpoint to connect to.
  2. Endpoint looks for this location, and if it doesn't exist it falls back to checking the TCP port. Maybe it just looks for both at the same time knowing agent will only support one of them.
  3. We confirm all inputs are updated to at least https://github.com/elastic/elastic-agent-client/releases/tag/v7.8.0 and support domain sockets/pipes.
  4. We can flip the agent.grpc.local flag to true by default.

@cmacknz
Copy link
Member

cmacknz commented Feb 14, 2024

I made this mode configurable, there might be the cases where you might want to use IP socket, I suspect, like possibly running agent and beats in the separate pods. So the Endpoint would have to know in advance if IP or domain socket is used for RPC.

Yes it is also a way to run the agent twice on the same machine since you can more easily change the port than the unix socket location.

@blakerouse
Copy link
Contributor

@aleksmaus @cmacknz

Endpoint will not install when the Elastic Agent is not installed in the default location: https://github.com/elastic/elastic-agent/blob/main/specs/endpoint-security.spec.yml#L21 This is because Endpoint protects the Elastic Agent directory and its hard coded to that directory.

In that case the Elastic Agent should just create a domain unix socket inside of its default installation directory for Endpoint to connect to and get the connection information. Windows is even easier it should just use a defined npipe.

I also agree that Endpoint should check the unix socket/npipe first and then fallback checking for the network port for the information.

I also prefer that this just become the default.

@intxgo
Copy link
Contributor

intxgo commented Mar 6, 2024

Taking into account all the basic hardening we've added to Endpoint recently (policy signing, Tamper Protection), we ended with a flow:

  1. there's a clean machine
  2. user installs Agent
  3. Agent installs Endpoint (with no config)
  4. Endpoint gets it's first config from Agent
  5. Endpoint is "hardened" and from that point it's hard to mock with it...

so for this question "how to tell Endpoint how it should talk to Agent" it's a matter of the bootstrap phase 1-4. Maybe the npipe/socket should be a plain Endpoint command line install parameter? I assume once Endpoint is configured with a given comms method it won't change.

@aleksmaus
Copy link
Member Author

I assume once Endpoint is configured with a given comms method it won't change.

This is configurable, so it's possible to change, don't know why would anybody do that, but possible.

@intxgo
Copy link
Contributor

intxgo commented Mar 6, 2024

This is configurable, so it's possible to change, don't know why would anybody do that, but possible.

ok. Then it's not so straightforward. Maybe we could define more constrains, like it's possible to change the comms method but it will require Endpoint re-install? The latter would ensure that sufficient prerequisites are met to actually be able to uninstall Endpoint.

@cmacknz
Copy link
Member

cmacknz commented Mar 6, 2024

ok. Then it's not so straightforward. Maybe we could define more constrains, like it's possible to change the comms method but it will require Endpoint re-install?

Changing the comms method will require every component to restart at minimum I think. I don't love the complexity this brings however I also don't like the idea of changing this without giving users who experience unexpected problems with it an escape hatch. We could require an agent restart but there's no way to initiate this from Fleet right now. Perhaps the agent automatically re-execs itself whenever this changes in the configuration.

One thing that isn't touched in the PR yet is that the bootstrap process for endpoint today also involves local TCP. We should change this to also use a unix/named pipe at a fixed location instead of a fixed TCP port so that you have to be root to connect to it.

conn, err := listener.Accept()
if err != nil {
log.Errorf("failed accept conn info connection: %v", err)
break
}
log.Debugf("client connected, sending connection info")
err = comm.WriteStartUpInfo(conn)

The process today is:

  1. Endpoint connects to this connection information server. Agent responds with the gRPC address and the TLS certificates for it.
  2. Endpoint then connects to the gRPC server and is given its initial configuration.

We want both the bootstrapping and the control protocol gRPC address to be unix/named pipes.

As Blake suggests for the bootstrapping we can just used a fixed location for the bootstrapping pipe. That can then communicate the address of the gRPC pipe.

We could also try to eliminate the need to bootstrap endpoint entirely since we can use a fixed pipe address for gRPC, but we'd still need a way to give endpoint the certs for mTLS.

@aleksmaus
Copy link
Member Author

aleksmaus commented Mar 7, 2024

We want both the bootstrapping and the control protocol gRPC address to be unix/named pipes.

Yes, that's one of the remaining things for this feature to complete it and is on my TODO list.

@aleksmaus
Copy link
Member Author

Addressed code review blocker, made the connection info socket name configurable through the service spec, consistent with connection info port "cport".

if isLocal {
if socket == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should error, and not start. The field should be required for a service. We don't have a default when CPort is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes and getting this to completion. Ready to have this in the next 8.14.

@gabriellandau
Copy link

The C++ grpc library that Endpoint uses doesn't support domain sockets or named pipes. There's a longstanding FR, but it's not implemented. We don't think we can land full gRPC-over-docket/npipe in 8.14, but we do think we think we can land the updated bootstrap.

Would it be possible for Agent to support the new comms for bootstrap, while maintaining localhost gRPC TCP for Endpoint in 8.14?

@aleksmaus
Copy link
Member Author

Would it be possible for Agent to support the new comms for bootstrap, while maintaining localhost gRPC TCP for Endpoint in 8.14?

This would require some agent code rework, don't know the scope of changes yet, need to dig more.
I assume it could be possible to have two "GRPC servers" in 8.14: TCP one for Endpoint only, and the local domain sockets/names pipes for the rest of the agent/beats comms and "connection info" endpoint.
Kind of odd but that's one way to work around the current limitation on Endpoint side.
Thoughts? @blakerouse @cmacknz

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

2 New issues
0 Security Hotspots
54.7% 54.7% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@cmacknz
Copy link
Member

cmacknz commented Apr 10, 2024

Kind of odd but that's one way to work around the current limitation on Endpoint side.
Thoughts? @blakerouse @cmacknz

There are security benefits to using unix sockets/named pipes only for the bootstrap portion and leaving the rest unchanged. So there is value in doing this.

I wouldn't bother using TCP only for gRPC communication with endpoint, we need to switch all the components over at once. I don't think we want to deal with two gRPC servers.

I would think of the endpoint bootstrap improvement as a separate change from the change to the control protocol server. They are two different parts of agent.

@aleksmaus
Copy link
Member Author

Added the commit that uses TCP gRPC for comms and forces domain socket/named pipe for connection info server.

@blakerouse I could not rely on the gRPC Port config (-1) in this case, since we were asked to keep gRPC on TCP socket. And only use domain socket for connection info server. With this change the connection info server is always forced to use domain socket.
In order to make it configurable I would have to introduce a new configuration option for connection info server only. And then it could become confusing when we just start using Port -1 for setting everything to use domain sockets.
@blakerouse @cmacknz
Let me know if you still need to make connection info server configurable (able to switch between domain socket and TCP socket) for 8.14.

@cmacknz
Copy link
Member

cmacknz commented Apr 11, 2024

Let me know if you still need to make connection info server configurable (able to switch between domain socket and TCP socket) for 8.14.

We have been using unix sockets and named pipes without issue for collecting monitoring information from the Beats that run as part of agent for a long time (it is also used for the control socket) so the risk overall with this change is low. The scope of changing just the bootstrap process is also lower. Generally the only problems we've seen are hitting the arbitrary OS limits on the lengths of the names for the pipes.

I also think the security benefits of this change are not as strong if we add a configuration flag to undo it. So for now I'd say we don't need a configuration flag for this as long as we can align it with endpoint merging the change at the same time. I could be argued into changing this opinion.

@gabriellandau
Copy link

I also think the security benefits of this change are not as strong if we add a configuration flag to undo it. So for now I'd say we don't need a configuration flag for this as long as we can align it with endpoint merging the change at the same time.

I agree.

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 29, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
8 participants