|
| 1 | +- Feature Name: Client conection limits |
| 2 | +- Status: implemented |
| 3 | +- Start Date: 2022-02-07 |
| 4 | +- Authors: John Spray |
| 5 | +- Issue: |
| 6 | + |
| 7 | +# Executive Summary |
| 8 | + |
| 9 | +Add per-client state to track and limit the number of connections open. |
| 10 | + |
| 11 | +Similar to KIP-308, adapted to Redpanda's thread per core model. |
| 12 | + |
| 13 | +## What is being proposed |
| 14 | + |
| 15 | +New configuration properties: |
| 16 | +- kafka_connections_max (optional integer, unset by default) |
| 17 | +- kafka_connections_max_per_ip (optional integer, unset by default) |
| 18 | +- kafka_connections_max_overrides (list of strings, default empty) |
| 19 | + |
| 20 | +New prometheus metric: |
| 21 | +- vectorized_kafka_rpc_connections_rejected |
| 22 | + |
| 23 | +When a connection is accepted which exceeds one of the configured |
| 24 | +bounds, it is dropped before reading or writing anything to the socket. |
| 25 | + |
| 26 | +## Motivation |
| 27 | + |
| 28 | +* Avoid hitting system resource limits (e.g. open file handles) |
| 29 | + related to port counts |
| 30 | +* Prevent rogue clients from consuming unbounded memory (we allocate |
| 31 | + some userspace buffer space to all connections, and the kernel |
| 32 | + allocates some buffers too). |
| 33 | +* Help users notice if a client is acting strangely: an application |
| 34 | + author opening large numbers of connections will tend to notice |
| 35 | + themselves when they hit the limit, rather than the operator |
| 36 | + of the cluster noticing when the cluster performance is impacted. |
| 37 | +* Provide a client allow-listing/deny-listing mechanism (e.g. |
| 38 | + kafka_max.connections_per_ip is set to zero and |
| 39 | + kafka_max_connections_per_ip_overrides specifies |
| 40 | + permitted clients) |
| 41 | +* Continuity of experience for Kafka users accustomed to |
| 42 | + max.connections.per.ip |
| 43 | + |
| 44 | +## How |
| 45 | + |
| 46 | +### Accepting a connection |
| 47 | + |
| 48 | +If both configuration properties are unset, then no action is taken: this |
| 49 | +preserves the existing unlimited behaviour. |
| 50 | + |
| 51 | +A sharded service storing per-client tokens for open connections, clients are |
| 52 | +mapped to shards by hash of their IP. |
| 53 | + |
| 54 | +In kafka::protocol::apply, acquire a token at start and relinquish it at end, |
| 55 | +probably using a RAII structure, although this function already provides a |
| 56 | +convenient localized place to take+release a resource explicitly. |
| 57 | + |
| 58 | +On first connection from a client, the shard servicing the connection makes a |
| 59 | +cross-shard call to the core that holds state for this client IP hash. This |
| 60 | +initializes the client's state: |
| 61 | + |
| 62 | +```c++ |
| 63 | +struct client_conn_quota { |
| 64 | + // The max per connection. This is set by applying |
| 65 | + // the combination of per_ip and per_ip_override settings. |
| 66 | + uint32_t total; |
| 67 | + |
| 68 | + // How many connections are currently open. This may be at |
| 69 | + // most `total`, although it may exceed it transiently if |
| 70 | + // configuration properties are decreased during operation. |
| 71 | + uint32_t in_use; |
| 72 | +} |
| 73 | +``` |
| 74 | +
|
| 75 | +The connection quota state is stored in a absl::btree_map: since the overall |
| 76 | +number of unique clients maybe high (so a flat store is risky), but the entries |
| 77 | +are small (so a std::map would generate a lot of tiny allocations). |
| 78 | +
|
| 79 | +### Optimization: caching tokens |
| 80 | +
|
| 81 | +In the case where the client is not close to exceeding its limits, we can avoid |
| 82 | +the overhead of a cross-core call by caching some per-client tokens on each |
| 83 | +shard. |
| 84 | +
|
| 85 | +This cache can be pre-primed on the first client connection, if the total |
| 86 | +tokens for the client is greater than the core count: that way, subsequent |
| 87 | +connections from the same client will usually be accepted without the need for |
| 88 | +a cross-core call. |
| 89 | +
|
| 90 | +### Handling a configuration change |
| 91 | +
|
| 92 | +Changes to the configuration properties affect all outstanding |
| 93 | +client_conn_quota objects: the `total` attribute must be recalculated. |
| 94 | +
|
| 95 | +If `total` is decreased, then it is also necessary to call out to all other |
| 96 | +shards and revoke any cached tokens that are held in excess of the new limit. |
| 97 | +
|
| 98 | +## Impact |
| 99 | +
|
| 100 | +When this feature is enabled, a per-connection latency overhead is added, as |
| 101 | +the shard handling the connection must dispatch a call to the shard that owns |
| 102 | +the client IP state. A cross core call can take up to a few microseconds, |
| 103 | +although the impact on P99 latency can be worse if the target core is very busy. |
| 104 | +
|
| 105 | +On a system where the connection count limit is significantly larger than the |
| 106 | +core count, this overhead will be avoided in most cases, as each core will |
| 107 | +store some locally cached tokens to allow them to accept connections. |
| 108 | +
|
| 109 | +
|
| 110 | +# Guide-level explanation |
| 111 | +
|
| 112 | +## How do we teach this? |
| 113 | +
|
| 114 | +* Presume reader knows what a TCP connection is |
| 115 | +* Explain how Kafka clients map to TCP connections: that a producer or consumer |
| 116 | + usually opens a connection and uses it over a long period of time, but |
| 117 | + a program that constructs many Kafka clients in parallel might open more |
| 118 | + connections. The relationship between clients and connections is not 1:1, |
| 119 | + as clients may e.g. be both a producer and a consumer, and the client may |
| 120 | + open additonal connections for reading metadata. |
| 121 | +* Explain why it's bad to have an unbounded number of connections from a |
| 122 | + client: |
| 123 | + * Each connection has some memory overhead that might exhaust redpanda memory |
| 124 | + * Each connection counts against OS-level limits (`ulimit`) |
| 125 | + * It's probably not what the user intended for efficient operation, as the |
| 126 | + Kafka protocol generally re-uses a smaller number of connections. |
| 127 | +* Introduce the new configuration properties, including the syntax of the |
| 128 | + `overrides` property |
| 129 | + * Mention the caveat that decreasing the limits does not terminate any |
| 130 | + currently open connections. |
| 131 | +
|
| 132 | +## Interaction with other features |
| 133 | +
|
| 134 | +* Dependency on centralized config for live updates |
| 135 | +
|
| 136 | +## Drawbacks |
| 137 | +
|
| 138 | +### Latency overhead on first message in a new connection: |
| 139 | +
|
| 140 | +Mitigating factors: |
| 141 | +* Sensible workloads generally operate on established connections |
| 142 | +* No overhead if feature is not enabled. |
| 143 | +
|
| 144 | +## Alternatives Considered |
| 145 | +
|
| 146 | +### Do nothing |
| 147 | +
|
| 148 | +An application may exhaust server resources by opening a very large number of |
| 149 | +connections (e.g. by instantiating many Kafka clients). This can be mitigated |
| 150 | +in environments where the application design is tightly controlled and the |
| 151 | +server cluster is sized to match, but in real-life environments this is |
| 152 | +unlikely to be the case & it is relatively easy for a naively-written application |
| 153 | +to open unbounded numbers of Kafka client connections. For example, consider |
| 154 | +a web service that opens a Kafka connection for each request it serves, and |
| 155 | +does not have a limit on its own concurrent request count. |
| 156 | +
|
| 157 | +### Total connection count, rather than per-client-IP |
| 158 | +
|
| 159 | +To protect redpanda from resource exhaustion, it is not necessary to |
| 160 | +discriminate between clients: we could just have a single overall connection |
| 161 | +count. |
| 162 | +
|
| 163 | +This prevents redpanda from having resources exhausted, but does not help |
| 164 | +preserve availability of the system: one misbehaving server can crowd out |
| 165 | +other clients. |
| 166 | +
|
| 167 | +### Shard-local limits |
| 168 | +
|
| 169 | +We could avoid some complexity and latency by setting a per-core connection |
| 170 | +limit, rather than a total connection limit across all cores. This would |
| 171 | +enable all cores to apply their limits using local state only. |
| 172 | +
|
| 173 | +This is reasonable at small core counts, but tends to fall down at larger core |
| 174 | +counts: consider a 128 core system, where a per-core limit to tolerate |
| 175 | +reasonable applications (e.g. 10 connections) would result in a total limit of |
| 176 | +1280, much higher than the user wanted. |
| 177 | +
|
| 178 | +### Per-user limits |
| 179 | +
|
| 180 | +In many environments, the client IP is not a meaningful discriminator between |
| 181 | +workloads -- workloads are more likely to be identifiable by their service |
| 182 | +account username. |
| 183 | +
|
| 184 | +Strictly speaking, it is not possible to limit connections by username, because |
| 185 | +we have to accept a connection and receive some bytes to learn the username, but |
| 186 | +it could still be useful to drop connections immediately after authentication |
| 187 | +if a user has too many connections already. |
| 188 | +
|
| 189 | +Adding per-user limits in future is not ruled out, but the scope of this RFC is |
| 190 | +limited to the more basic IP-based limits, to get parity with KIP-308. |
| 191 | +
|
| 192 | +### Kernel-space (eBPF/BPF/iptables) enforcement |
| 193 | +
|
| 194 | +Where the connection count for a particular source IP has been exhausted, and/or |
| 195 | +was set to zero, we could more efficiently drop connections by configuring the |
| 196 | +kernel to drop them for us, rather than doing it in userspace. This kind of |
| 197 | +efficiency is important in filtering systems that guard against DDoS situations. |
| 198 | +
|
| 199 | +The connection count limits in this RFC are primarily meant to protect against |
| 200 | +misbehaving (but benign) applications rather than active attack, so the overhead |
| 201 | +of enforcing in userspace is acceptable. Avoiding kernel mechanisms also helps |
| 202 | +with portability (including to older enterprise linux distros) & potential |
| 203 | +permissions issues in locked-down environments that might forbid userspace |
| 204 | +processes from using some kernel space mechanisms. |
| 205 | +
|
| 206 | +All that said, extending our userspace enforcement with lower level mechanisms |
| 207 | +in future could be a worthwhile optimization. In Kubernetes environments, |
| 208 | +we would need to consider whether to implement this type of enforcement within |
| 209 | +Redpanda pods, or within the Kubernetes network stack where external IPs are |
| 210 | +exposed. |
| 211 | +
|
| 212 | +
|
0 commit comments