Agent/beats grpc comms over domain socket/named pipe #4249

aleksmaus · 2024-02-12T23:51:22Z

What does this PR do?

Implements configurable GRPC allowing the agent to use domain socket/named pipe for comms with beats.

The core of the support for this is already merged in the agent client lib elastic/elastic-agent-client#91 and is tagged as v7.8.0. Beats just work automagically since they already picked up this tag
https://github.com/elastic/beats/blob/main/go.mod#L72

Changes:

Made the option configurable with the newLocal GRPC bool flag. Agent still uses IP socket by default
Refactored the code that created the domain socket/named pipe socket, so it could be shared for GRPC comms
Propagate the server domain socket/named pipe address to the beats.

I tested on all 3 platforms: darwin, linux and winderz.
Will reach out to Endpoint team, since they likely would need to adjust comms on Endpoint side.

This is the first cut, can change based on the feedback.

Open for feedback/opinions on:

Agent on the new GRPC configuration option, currently it's a boolean flag.
The comms socket is derived from the control domain socket/named pipe path if GRPC is set to Local. Wanted to have it consistent in the same location as control domain socket path, with the same length limits fallback.

Why is it important?

Addresses #4248

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Related issues

mergify · 2024-02-12T23:51:59Z

This pull request does not have a backport label. Could you fix it @aleksmaus? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2024-02-13T07:37:08Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

internal/pkg/agent/application/application.go

Co-authored-by: Leszek Kubik <[email protected]>

aleksmaus · 2024-02-13T18:41:24Z

Talked to @jrmolin, one issue left is the connection info discovery is still as defined in the endpoint config an IP port
https://github.com/elastic/elastic-agent/blob/main/specs/endpoint-security.spec.yml#L24

We could potentially allow some base path in the endpoint spect for domain socket/named pipe for connection info API.
If the agent starts with “local” GRPC mode then it would use that location to create the domain socket server for connection info API, otherwise it will use the port.
The question now, that the endpoint doesn’t know what mode the agent is started in, so it would have to check both IP and domain socket/pipe for the connection info server.

Thinking what other options we could use within the current implementation.

@blakerouse @cmacknz

Thoughts?

cmacknz · 2024-02-14T21:30:21Z

The question now, that the endpoint doesn’t know what mode the agent is started in, so it would have to check both IP and domain socket/pipe for the connection info server.

It seems simpler to just align this change in both endpoint and agent rather than introduce complexity to deal with this that we would remove as soon as endpoint supports domain sockets. The endpoint version is always aligned with the agent version so we ideally wouldn't have to release a version where endpoint doesn't know what to expect.

Talked to @jrmolin, one issue left is the connection info discovery is still as defined in the endpoint config an IP port
https://github.com/elastic/elastic-agent/blob/main/specs/endpoint-security.spec.yml#L24

We could potentially allow some base path in the endpoint spect for domain socket/named pipe for connection info API.

We could use a fixed base path for the unix socket connection instead of the a fixed port. The socket/pipe would be owned by root/admin so it is already more secure than an arbitrary port.

aleksmaus · 2024-02-14T21:37:54Z

It seems simpler to just align this change in both endpoint and agent rather than introduce complexity to deal with this that we would remove as soon as endpoint supports domain sockets. The endpoint version is always aligned with the agent version so we ideally wouldn't have to release a version where endpoint doesn't know what to expect.

I made this mode configurable, there might be the cases where you might want to use IP socket, I suspect, like possibly running agent and beats in the separate pods. So the Endpoint would NOT know in advance if IP or domain socket is used for RPC.

cmacknz · 2024-02-14T21:39:34Z

Looking at the code we can use the agent.grpc.local as a feature flag initially that defaults to false and is only flipped to true once all of the inputs are confirmed to work with unix sockets/named pipes.

The Go based inputs all need to be updated to https://github.com/elastic/elastic-agent-client/releases/tag/v7.8.0 first which shouldn't be that hard to coordinate, endpoint is slightly more complicated.

Thinking more about this it would actually be best if endpoint could handle either a TCP port or a domain socket/pipe. This way we can keep the configuration and if we hit an unexpected problem we have the ability to easily go back.

So I'd say the path forward is:

We define how we are going to specify the domain socket location for endpoint to connect to.
Endpoint looks for this location, and if it doesn't exist it falls back to checking the TCP port. Maybe it just looks for both at the same time knowing agent will only support one of them.
We confirm all inputs are updated to at least https://github.com/elastic/elastic-agent-client/releases/tag/v7.8.0 and support domain sockets/pipes.
We can flip the agent.grpc.local flag to true by default.

cmacknz · 2024-02-14T21:40:12Z

I made this mode configurable, there might be the cases where you might want to use IP socket, I suspect, like possibly running agent and beats in the separate pods. So the Endpoint would have to know in advance if IP or domain socket is used for RPC.

Yes it is also a way to run the agent twice on the same machine since you can more easily change the port than the unix socket location.

blakerouse · 2024-03-06T15:08:34Z

@aleksmaus @cmacknz

Endpoint will not install when the Elastic Agent is not installed in the default location: https://github.com/elastic/elastic-agent/blob/main/specs/endpoint-security.spec.yml#L21 This is because Endpoint protects the Elastic Agent directory and its hard coded to that directory.

In that case the Elastic Agent should just create a domain unix socket inside of its default installation directory for Endpoint to connect to and get the connection information. Windows is even easier it should just use a defined npipe.

I also agree that Endpoint should check the unix socket/npipe first and then fallback checking for the network port for the information.

I also prefer that this just become the default.

intxgo · 2024-03-06T18:03:40Z

Taking into account all the basic hardening we've added to Endpoint recently (policy signing, Tamper Protection), we ended with a flow:

there's a clean machine
user installs Agent
Agent installs Endpoint (with no config)
Endpoint gets it's first config from Agent
Endpoint is "hardened" and from that point it's hard to mock with it...

so for this question "how to tell Endpoint how it should talk to Agent" it's a matter of the bootstrap phase 1-4. Maybe the npipe/socket should be a plain Endpoint command line install parameter? I assume once Endpoint is configured with a given comms method it won't change.

aleksmaus · 2024-03-06T19:00:10Z

I assume once Endpoint is configured with a given comms method it won't change.

This is configurable, so it's possible to change, don't know why would anybody do that, but possible.

intxgo · 2024-03-06T19:06:41Z

This is configurable, so it's possible to change, don't know why would anybody do that, but possible.

ok. Then it's not so straightforward. Maybe we could define more constrains, like it's possible to change the comms method but it will require Endpoint re-install? The latter would ensure that sufficient prerequisites are met to actually be able to uninstall Endpoint.

cmacknz · 2024-03-06T19:48:49Z

ok. Then it's not so straightforward. Maybe we could define more constrains, like it's possible to change the comms method but it will require Endpoint re-install?

Changing the comms method will require every component to restart at minimum I think. I don't love the complexity this brings however I also don't like the idea of changing this without giving users who experience unexpected problems with it an escape hatch. We could require an agent restart but there's no way to initiate this from Fleet right now. Perhaps the agent automatically re-execs itself whenever this changes in the configuration.

One thing that isn't touched in the PR yet is that the bootstrap process for endpoint today also involves local TCP. We should change this to also use a unix/named pipe at a fixed location instead of a fixed TCP port so that you have to be root to connect to it.

elastic-agent/pkg/component/runtime/conn_info_server.go

Lines 43 to 49 in d075097

    
           conn, err := listener.Accept() 
        
           if err != nil { 
        
           	log.Errorf("failed accept conn info connection: %v", err) 
        
           	break 
        
           } 
        
           log.Debugf("client connected, sending connection info") 
        
           err = comm.WriteStartUpInfo(conn)

The process today is:

Endpoint connects to this connection information server. Agent responds with the gRPC address and the TLS certificates for it.
Endpoint then connects to the gRPC server and is given its initial configuration.

We want both the bootstrapping and the control protocol gRPC address to be unix/named pipes.

As Blake suggests for the bootstrapping we can just used a fixed location for the bootstrapping pipe. That can then communicate the address of the gRPC pipe.

We could also try to eliminate the need to bootstrap endpoint entirely since we can use a fixed pipe address for gRPC, but we'd still need a way to give endpoint the certs for mTLS.

aleksmaus · 2024-03-07T01:12:37Z

We want both the bootstrapping and the control protocol gRPC address to be unix/named pipes.

Yes, that's one of the remaining things for this feature to complete it and is on my TODO list.

gabriellandau · 2024-04-09T18:12:25Z

The C++ grpc library that Endpoint uses doesn't support domain sockets or named pipes. There's a longstanding FR, but it's not implemented. We don't think we can land full gRPC-over-docket/npipe in 8.14, but we do think we think we can land the updated bootstrap.

Would it be possible for Agent to support the new comms for bootstrap, while maintaining localhost gRPC TCP for Endpoint in 8.14?

aleksmaus · 2024-04-10T13:35:59Z

Would it be possible for Agent to support the new comms for bootstrap, while maintaining localhost gRPC TCP for Endpoint in 8.14?

This would require some agent code rework, don't know the scope of changes yet, need to dig more.
I assume it could be possible to have two "GRPC servers" in 8.14: TCP one for Endpoint only, and the local domain sockets/names pipes for the rest of the agent/beats comms and "connection info" endpoint.
Kind of odd but that's one way to work around the current limitation on Endpoint side.
Thoughts? @blakerouse @cmacknz

elastic-sonarqube · 2024-04-10T15:55:54Z

Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

2 New issues
0 Security Hotspots
54.7% 54.7% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

cmacknz · 2024-04-10T19:04:28Z

Kind of odd but that's one way to work around the current limitation on Endpoint side.
Thoughts? @blakerouse @cmacknz

There are security benefits to using unix sockets/named pipes only for the bootstrap portion and leaving the rest unchanged. So there is value in doing this.

I wouldn't bother using TCP only for gRPC communication with endpoint, we need to switch all the components over at once. I don't think we want to deal with two gRPC servers.

I would think of the endpoint bootstrap improvement as a separate change from the change to the control protocol server. They are two different parts of agent.

…ction info server.

aleksmaus · 2024-04-11T15:29:25Z

Added the commit that uses TCP gRPC for comms and forces domain socket/named pipe for connection info server.

@blakerouse I could not rely on the gRPC Port config (-1) in this case, since we were asked to keep gRPC on TCP socket. And only use domain socket for connection info server. With this change the connection info server is always forced to use domain socket.
In order to make it configurable I would have to introduce a new configuration option for connection info server only. And then it could become confusing when we just start using Port -1 for setting everything to use domain sockets.
@blakerouse @cmacknz
Let me know if you still need to make connection info server configurable (able to switch between domain socket and TCP socket) for 8.14.

cmacknz · 2024-04-11T19:21:21Z

Let me know if you still need to make connection info server configurable (able to switch between domain socket and TCP socket) for 8.14.

We have been using unix sockets and named pipes without issue for collecting monitoring information from the Beats that run as part of agent for a long time (it is also used for the control socket) so the risk overall with this change is low. The scope of changing just the bootstrap process is also lower. Generally the only problems we've seen are hitting the arbitrary OS limits on the lengths of the names for the pipes.

I also think the security benefits of this change are not as strong if we add a configuration flag to undo it. So for now I'd say we don't need a configuration flag for this as long as we can align it with endpoint merging the change at the same time. I could be argued into changing this opinion.

gabriellandau · 2024-04-11T19:33:31Z

I also think the security benefits of this change are not as strong if we add a configuration flag to undo it. So for now I'd say we don't need a configuration flag for this as long as we can align it with endpoint merging the change at the same time.

I agree.

elasticmachine · 2024-04-29T19:50:42Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

ycombinator · 2024-06-07T14:14:43Z

👋 @aleksmaus, just wondering what your plans are for this PR, given the last activity on it was just over a month ago? Do you want to keep iterating on it or should we close it out?

aleksmaus · 2024-06-07T20:11:09Z

This PR is done and was waiting on Endpoint PR merge in order pass the integration tests. The Endpoint was merged today
https://github.com/elastic/endpoint-dev/pull/14386#event-13077664540
but missed the build window. So I think it should be available by next week in theory, then we can kick off the integration tests again, pass the CI validation and merge.

…is disabled

elastic-sonarqube · 2024-06-10T13:40:39Z

Quality Gate passed

Issues
2 New issues
1 Fixed issue
0 Accepted issues

Measures
0 Security Hotspots
42.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

Agent/beats grpc comms over domain socket/named pipe

885dee9

aleksmaus added the enhancement New feature or request label Feb 12, 2024

aleksmaus requested review from cmacknz and blakerouse February 12, 2024 23:51

aleksmaus requested a review from a team as a code owner February 12, 2024 23:51

mergify bot assigned aleksmaus Feb 12, 2024

mergify bot added the backport-skip label Feb 12, 2024

aleksmaus added 2 commits February 12, 2024 18:52

Add changelog fragment

980a55e

Merge branch 'main' into feature/local_rpc

a1cb851

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Feb 13, 2024

intxgo reviewed Feb 13, 2024

View reviewed changes

internal/pkg/agent/application/application.go Outdated Show resolved Hide resolved

pierrehilbert and others added 3 commits February 13, 2024 14:27

Merge branch 'main' into feature/local_rpc

eb49f9e

Fix log message typo

ec75a93

Co-authored-by: Leszek Kubik <[email protected]>

Merge branch 'main' into feature/local_rpc

c334184

aleksmaus added 2 commits February 13, 2024 13:48

Merge branch 'main' into feature/local_rpc

e0c58ac

Merge branch 'main' into feature/local_rpc

175ba0e

Merge branch 'main' into feature/local_rpc

e54e07c

Merge branch 'main' into feature/local_rpc

0a5d26f

Use TCP gRPC for comms and local (domain socket/named pipe) for conne…

0b8ca7b

…ction info server.

aleksmaus added 2 commits April 15, 2024 09:00

Merge branch 'main' into feature/local_rpc

c313535

Merge branch 'main' into feature/local_rpc

8d0c5e5

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 29, 2024

aleksmaus and others added 7 commits May 6, 2024 09:34

Merge branch 'main' into feature/local_rpc

3d78cc1

Merge branch 'main' into feature/local_rpc

e83d6b6

Merge branch 'main' into feature/local_rpc

e0fa46a

Merge branch 'main' into feature/local_rpc

ace82ff

Merge branch 'main' into feature/local_rpc

68792b0

Merge branch 'main' into feature/local_rpc

d5900c4

Merge branch 'main' into feature/local_rpc

e50ff66

pierrehilbert and others added 3 commits June 8, 2024 18:01

Merge branch 'main' into feature/local_rpc

2286461

Merge branch 'main' into feature/local_rpc

a0aac67

Rollback the unit test changes, because the local gRPC configuration …

37f8aa2

…is disabled

aleksmaus merged commit efd9ee6 into elastic:main Jun 10, 2024
14 checks passed

aleksmaus mentioned this pull request Jun 10, 2024

Agent/beats gRPC over domain sockets/named pipes #4899

Open

cmacknz mentioned this pull request Jan 23, 2025

Change the default gRPC port to 0 when in a container #6585

Merged

mergify bot mentioned this pull request Jan 24, 2025

[8.x](backport #6585) Change the default gRPC port to 0 when in a container #6597

Merged

Agent/beats grpc comms over domain socket/named pipe #4249

Agent/beats grpc comms over domain socket/named pipe #4249

Uh oh!

Conversation

aleksmaus commented Feb 12, 2024 • edited by cmacknz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Related issues

Uh oh!

mergify bot commented Feb 12, 2024

Uh oh!

elasticmachine commented Feb 13, 2024

Uh oh!

Uh oh!

aleksmaus commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacknz commented Feb 14, 2024

Uh oh!

aleksmaus commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacknz commented Feb 14, 2024

Uh oh!

cmacknz commented Feb 14, 2024 • edited by aleksmaus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blakerouse commented Mar 6, 2024

Uh oh!

intxgo commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aleksmaus commented Mar 6, 2024

Uh oh!

intxgo commented Mar 6, 2024

Uh oh!

cmacknz commented Mar 6, 2024

Uh oh!

aleksmaus commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabriellandau commented Apr 9, 2024

Uh oh!

aleksmaus commented Apr 10, 2024

Uh oh!

elastic-sonarqube bot commented Apr 10, 2024

Quality Gate passed

Uh oh!

cmacknz commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aleksmaus commented Apr 11, 2024

Uh oh!

cmacknz commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabriellandau commented Apr 11, 2024

Uh oh!

elasticmachine commented Apr 29, 2024

Uh oh!

ycombinator commented Jun 7, 2024

Uh oh!

aleksmaus commented Jun 7, 2024

Uh oh!

elastic-sonarqube bot commented Jun 10, 2024

Quality Gate passed

Uh oh!

Uh oh!

Uh oh!

aleksmaus commented Feb 12, 2024 •

edited by cmacknz

Loading

aleksmaus commented Feb 13, 2024 •

edited

Loading

aleksmaus commented Feb 14, 2024 •

edited

Loading

cmacknz commented Feb 14, 2024 •

edited by aleksmaus

Loading

intxgo commented Mar 6, 2024 •

edited

Loading

aleksmaus commented Mar 7, 2024 •

edited

Loading

cmacknz commented Apr 10, 2024 •

edited

Loading

cmacknz commented Apr 11, 2024 •

edited

Loading