You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tries to summarize the root cause of a problem already detected by several users and in several issues. The point is not to replace the original issues but to discuss a solution, which apparently would require an architectural change in the operation of the p2p communication layer.
Context
Two nodes in a CometBFT network are connected using a single, authenticated, encrypted connection. The baseline connection is a raw TCP connection, which is encapsulated into a secret connection, that is responsible for authenticating the end-points and encrypting the traffic.
The (secret) connection is managed by a multiplex connection (MConnection) that implement the abstraction of multiple channels between the two nodes. Ideally, channels should work as independent connections between peers, not interfering with each other.
Sender side
On the sender side of a multiplex connection, although some issues have been raised, we have a behavior the matches the abstraction. Each channel has a sending queue, with a configurable length, and a priority. The multiplex connection logic selects from the existing channels, based on their priorities, the next packet to be sent through the (real) connection. Messages are split into packets, with a maximum size, so that priorities apply to packets (data length), not number of messages (there is some criticism regarding this choice, e.g., #2954). But in any case, the interference between channels is reduced, as in a channel with multiple outstanding massages to send does not prevent messages of different channels from being sent.
Receiver side
On the receiver side, however, the behavior is different. There is a single receive buffer per channel, to which are added received packets belonging to that channel. When a receive buffer contains a full message, the multiplex connection delivers the received message using the onReceive callback. The call to onReceive is synchronous, meaning that no other raw message (packet or control message) is received or processed while the callback function does not return. This would not be a problem if the onReceive implementation was non-blocking and fast. This assumption, however, has turned out to not be observed with the evolution and added complexity of the code base, which turns out to be a problem, first described in #2685.
Receiving callback
The onReceive implementation provided to the multiplex connection is defined in the Peer instance that abstracts the connection of with the other node (the peer). Although not optimal and really modular, this implementation has one main goal: to identify the Reactor responsible for the incoming message's channel and to deliver the message to the reactor by calling the reactor's Receive(Envelope) method. Again, this call is blocking, meaning that reactors are expected to process the message in swift manner, otherwise the full connection with the peer will be blocked, and in the worst case dropped (#2533).
Forwarding the received message to the right reactor was originally a relatively fast operation. There is a map of channel IDs to reactors, which is static, so no contention involved, and the reactor Receive() method was invoked.
The processing became more complex when it was decided that reactors should not process raw message (bytes), but instead proto.Messages. This was a decision made for release v0.35.x, that was retracted, but eventually incorporated into v0.34.x via tendermint/tendermint#9622. The problem is that unmarshalling the raw message into a proto message is a costly operation, that was introduced in the critical path of message delivery. While it is true that the reactors had to unmarshall the raw message in any case, they could (should) do that in parallel, not blocking the main receive routine of the multiplex connection. In addition to that, the receiving callback also compute metrics, based on message type, which have a relevant cost (e.g., #2840).
In summary, the onReceive callback was supposed to be fast and non-blocking but now includes expensive operations.
Channel isolation
Even without the added complexity detailed in the previous section, we still have the problem that while a message is not delivered to a reactor, other messages in the inbound connection (i.e., from the same peer) cannot be processed and delivered. This is fine if the blocked messages are from the same channel, or even from a different channel but handled by the same reactor. The reactor, in this case, is applying back pressure on the channel, i.e., preventing new messages from be received when the receive rate is superior to the rate with which it is able to process the messages.
The problem are messages belonging to other channels and that are supposed to be delivered to different reactors. Here we have a reactor that is blocking the operation of another reactor, which violates the assumption of channels isolation.
At the end, what we have at the receiver side is that messages from all channels are converging to the same stream of messages, being processed in order, with the back-pressure produced by one reactor also affecting other reactors, and this is the issue.
Solutions
This is an incomplete list of generic approaches to solve this problem, the order is arbitrary:
Change the multiplex connection to have a receive queue per channel, as we have in the sender side
Change the onReceive callback to handle in parallel messages from different channels and/or destined to different reactors
Remove the onReceive callback and offer a Receive() method used by clients to consume received messages.
Any other generic approach to be defined
Notice that all (defined) approaches are similar in terms of the need of queues/channels of pending messages and routines consuming them to unmarshall raw messages, collect metrics, and deliver them to the destination reactor. Since channels/buffers are always finite, at some point the connection will block if it receives a message of a channel or destined to a reactor whose receive channel/buffer is full. So what are we doing in this case? Do we block or do we allow messages to be dropped?
The text was updated successfully, but these errors were encountered:
This issue tries to summarize the root cause of a problem already detected by several users and in several issues. The point is not to replace the original issues but to discuss a solution, which apparently would require an architectural change in the operation of the p2p communication layer.
Context
Two nodes in a CometBFT network are connected using a single, authenticated, encrypted connection. The baseline connection is a raw TCP connection, which is encapsulated into a secret connection, that is responsible for authenticating the end-points and encrypting the traffic.
The (secret) connection is managed by a multiplex connection (
MConnection
) that implement the abstraction of multiple channels between the two nodes. Ideally, channels should work as independent connections between peers, not interfering with each other.Sender side
On the sender side of a multiplex connection, although some issues have been raised, we have a behavior the matches the abstraction. Each channel has a sending queue, with a configurable length, and a priority. The multiplex connection logic selects from the existing channels, based on their priorities, the next packet to be sent through the (real) connection. Messages are split into packets, with a maximum size, so that priorities apply to packets (data length), not number of messages (there is some criticism regarding this choice, e.g., #2954). But in any case, the interference between channels is reduced, as in a channel with multiple outstanding massages to send does not prevent messages of different channels from being sent.
Receiver side
On the receiver side, however, the behavior is different. There is a single receive buffer per channel, to which are added received packets belonging to that channel. When a receive buffer contains a full message, the multiplex connection delivers the received message using the
onReceive
callback. The call toonReceive
is synchronous, meaning that no other raw message (packet or control message) is received or processed while the callback function does not return. This would not be a problem if theonReceive
implementation was non-blocking and fast. This assumption, however, has turned out to not be observed with the evolution and added complexity of the code base, which turns out to be a problem, first described in #2685.Receiving callback
The
onReceive
implementation provided to the multiplex connection is defined in thePeer
instance that abstracts the connection of with the other node (the peer). Although not optimal and really modular, this implementation has one main goal: to identify theReactor
responsible for the incoming message's channel and to deliver the message to the reactor by calling the reactor'sReceive(Envelope)
method. Again, this call is blocking, meaning that reactors are expected to process the message in swift manner, otherwise the full connection with the peer will be blocked, and in the worst case dropped (#2533).Forwarding the received message to the right reactor was originally a relatively fast operation. There is a map of channel IDs to reactors, which is static, so no contention involved, and the reactor
Receive()
method was invoked.The processing became more complex when it was decided that reactors should not process raw message (bytes), but instead
proto.Message
s. This was a decision made for releasev0.35.x
, that was retracted, but eventually incorporated intov0.34.x
via tendermint/tendermint#9622. The problem is that unmarshalling the raw message into a proto message is a costly operation, that was introduced in the critical path of message delivery. While it is true that the reactors had to unmarshall the raw message in any case, they could (should) do that in parallel, not blocking the main receive routine of the multiplex connection. In addition to that, the receiving callback also compute metrics, based on message type, which have a relevant cost (e.g., #2840).In summary, the
onReceive
callback was supposed to be fast and non-blocking but now includes expensive operations.Channel isolation
Even without the added complexity detailed in the previous section, we still have the problem that while a message is not delivered to a reactor, other messages in the inbound connection (i.e., from the same peer) cannot be processed and delivered. This is fine if the blocked messages are from the same channel, or even from a different channel but handled by the same reactor. The reactor, in this case, is applying back pressure on the channel, i.e., preventing new messages from be received when the receive rate is superior to the rate with which it is able to process the messages.
The problem are messages belonging to other channels and that are supposed to be delivered to different reactors. Here we have a reactor that is blocking the operation of another reactor, which violates the assumption of channels isolation.
At the end, what we have at the receiver side is that messages from all channels are converging to the same stream of messages, being processed in order, with the back-pressure produced by one reactor also affecting other reactors, and this is the issue.
Solutions
This is an incomplete list of generic approaches to solve this problem, the order is arbitrary:
onReceive
callback to handle in parallel messages from different channels and/or destined to different reactorsonReceive
callback and offer aReceive()
method used by clients to consume received messages.Notice that all (defined) approaches are similar in terms of the need of queues/channels of pending messages and routines consuming them to unmarshall raw messages, collect metrics, and deliver them to the destination reactor. Since channels/buffers are always finite, at some point the connection will block if it receives a message of a channel or destined to a reactor whose receive channel/buffer is full. So what are we doing in this case? Do we block or do we allow messages to be dropped?
The text was updated successfully, but these errors were encountered: