Merge pull request redpanda-data#3744 from jcsp/kip-308-rfc

jcsp · web-flow · commit 7c352705e7a4 · 2022-04-08T18:11:21.000+01:00
docs/rfcs: create 20220207_connections_per_ip.md
diff --git a/docs/rfcs/20220207_connections_per_ip.md b/docs/rfcs/20220207_connections_per_ip.md
@@ -0,0 +1,212 @@
+- Feature Name: Client conection limits
+- Status: implemented
+- Start Date: 2022-02-07
+- Authors: John Spray
+- Issue: 
+
+# Executive Summary
+
+Add per-client state to track and limit the number of connections open.
+
+Similar to KIP-308, adapted to Redpanda's thread per core model.
+
+## What is being proposed
+
+New configuration properties:
+- kafka_connections_max (optional integer, unset by default)
+- kafka_connections_max_per_ip (optional integer, unset by default)
+- kafka_connections_max_overrides (list of strings, default empty)
+
+New prometheus metric:
+- vectorized_kafka_rpc_connections_rejected
+
+When a connection is accepted which exceeds one of the configured
+bounds, it is dropped before reading or writing anything to the socket.
+
+## Motivation
+
+* Avoid hitting system resource limits (e.g. open file handles)
+  related to port counts
+* Prevent rogue clients from consuming unbounded memory (we allocate
+  some userspace buffer space to all connections, and the kernel
+  allocates some buffers too).
+* Help users notice if a client is acting strangely: an application
+  author opening large numbers of connections will tend to notice
+  themselves when they hit the limit, rather than the operator
+  of the cluster noticing when the cluster performance is impacted.
+* Provide a client allow-listing/deny-listing mechanism (e.g.
+  kafka_max.connections_per_ip is set to zero and
+  kafka_max_connections_per_ip_overrides specifies
+  permitted clients)
+* Continuity of experience for Kafka users accustomed to
+  max.connections.per.ip
+
+## How
+
+### Accepting a connection
+
+If both configuration properties are unset, then no action is taken: this
+preserves the existing unlimited behaviour.
+
+A sharded service storing per-client tokens for open connections, clients are
+mapped to shards by hash of their IP.
+
+In kafka::protocol::apply, acquire a token at start and relinquish it at end,
+probably using a RAII structure, although this function already provides a
+convenient localized place to take+release a resource explicitly.
+
+On first connection from a client, the shard servicing the connection makes a
+cross-shard call to the core that holds state for this client IP hash.  This
+initializes the client's state:
+
+```c++
+struct client_conn_quota {
+  // The max per connection.  This is set by applying
+  // the combination of per_ip and per_ip_override settings.
+  uint32_t total;
+
+  // How many connections are currently open.  This may be at
+  // most `total`, although it may exceed it transiently if
+  // configuration properties are decreased during operation.
+  uint32_t in_use;
+}
+```
+
+The connection quota state is stored in a absl::btree_map: since the overall
+number of unique clients maybe high (so a flat store is risky), but the entries
+are small (so a std::map would generate a lot of tiny allocations).
+
+### Optimization: caching tokens
+
+In the case where the client is not close to exceeding its limits, we can avoid
+the overhead of a cross-core call by caching some per-client tokens on each
+shard.
+
+This cache can be pre-primed on the first client connection, if the total
+tokens for the client is greater than the core count: that way, subsequent
+connections from the same client will usually be accepted without the need for
+a cross-core call.
+
+### Handling a configuration change
+
+Changes to the configuration properties affect all outstanding
+client_conn_quota objects: the `total` attribute must be recalculated.
+
+If `total` is decreased, then it is also necessary to call out to all other
+shards and revoke any cached tokens that are held in excess of the new limit.
+
+## Impact
+
+When this feature is enabled, a per-connection latency overhead is added, as
+the shard handling the connection must dispatch a call to the shard that owns
+the client IP state.  A cross core call can take up to a few microseconds,
+although the impact on P99 latency can be worse if the target core is very busy.
+
+On a system where the connection count limit is significantly larger than the
+core count, this overhead will be avoided in most cases, as each core will
+store some locally cached tokens to allow them to accept connections.
+
+
+# Guide-level explanation
+
+## How do we teach this?
+
+* Presume reader knows what a TCP connection is
+* Explain how Kafka clients map to TCP connections: that a producer or consumer
+  usually opens a connection and uses it over a long period of time, but
+  a program that constructs many Kafka clients in parallel might open more
+  connections.  The relationship between clients and connections is not 1:1,
+  as clients may e.g. be both a producer and a consumer, and the client may
+  open additonal connections for reading metadata.
+* Explain why it's bad to have an unbounded number of connections from a
+  client:
+  * Each connection has some memory overhead that might exhaust redpanda memory
+  * Each connection counts against OS-level limits (`ulimit`)
+  * It's probably not what the user intended for efficient operation, as the
+    Kafka protocol generally re-uses a smaller number of connections.
+* Introduce the new configuration properties, including the syntax of the
+  `overrides` property
+  * Mention the caveat that decreasing the limits does not terminate any
+    currently open connections.
+
+## Interaction with other features
+
+* Dependency on centralized config for live updates
+
+## Drawbacks
+
+### Latency overhead on first message in a new connection:
+
+Mitigating factors:
+* Sensible workloads generally operate on established connections
+* No overhead if feature is not enabled.
+
+## Alternatives Considered
+
+### Do nothing
+
+An application may exhaust server resources by opening a very large number of
+connections (e.g. by instantiating many Kafka clients).  This can be mitigated
+in environments where the application design is tightly controlled and the
+server cluster is sized to match, but in real-life environments this is
+unlikely to be the case & it is relatively easy for a naively-written application
+to open unbounded numbers of Kafka client connections.  For example, consider
+a web service that opens a Kafka connection for each request it serves, and
+does not have a limit on its own concurrent request count.
+
+### Total connection count, rather than per-client-IP
+
+To protect redpanda from resource exhaustion, it is not necessary to
+discriminate between clients: we could just have a single overall connection
+count.
+
+This prevents redpanda from having resources exhausted, but does not help
+preserve availability of the system: one misbehaving server can crowd out
+other clients.
+
+### Shard-local limits
+
+We could avoid some complexity and latency by setting a per-core connection
+limit, rather than a total connection limit across all cores.  This would
+enable all cores to apply their limits using local state only.
+
+This is reasonable at small core counts, but tends to fall down at larger core
+counts: consider a 128 core system, where a per-core limit to tolerate
+reasonable applications (e.g. 10 connections) would result in a total limit of
+1280, much higher than the user wanted.
+
+### Per-user limits
+
+In many environments, the client IP is not a meaningful discriminator between
+workloads -- workloads are more likely to be identifiable by their service
+account username.
+
+Strictly speaking, it is not possible to limit connections by username, because
+we have to accept a connection and receive some bytes to learn the username, but
+it could still be useful to drop connections immediately after authentication
+if a user has too many connections already.
+
+Adding per-user limits in future is not ruled out, but the scope of this RFC is
+limited to the more basic IP-based limits, to get parity with KIP-308.
+
+### Kernel-space (eBPF/BPF/iptables) enforcement
+
+Where the connection count for a particular source IP has been exhausted, and/or
+was set to zero, we could more efficiently drop connections by configuring the
+kernel to drop them for us, rather than doing it in userspace.  This kind of
+efficiency is important in filtering systems that guard against DDoS situations.
+
+The connection count limits in this RFC are primarily meant to protect against
+misbehaving (but benign) applications rather than active attack, so the overhead
+of enforcing in userspace is acceptable.  Avoiding kernel mechanisms also helps
+with portability (including to older enterprise linux distros) & potential
+permissions issues in locked-down environments that might forbid userspace
+processes from using some kernel space mechanisms.
+
+All that said, extending our userspace enforcement with lower level mechanisms
+in future could be a worthwhile optimization.  In Kubernetes environments,
+we would need to consider whether to implement this type of enforcement within
+Redpanda pods, or within the Kubernetes network stack where external IPs are
+exposed.
+
+