-
-
Notifications
You must be signed in to change notification settings - Fork 264
Description
Dear Travis,
Hi! First of all, we are active users of your library — it works great and is superior compared to other Go Kafka libraries we have experience with. Thanks a lot for your awesome work.
Over the last several months, we have been investigating occasional message loss that occurred in our system during message production.
The top-level symptoms were: we see a message successfully produced with an offset assigned and no error returned, but then this message cannot be found in the Kafka topic. This usually happened when there were issues with Kafka — such as broker restarts, partition leadership changes due to network problems, etc.
It seems we eventually found the issue and have a reproducer for it. The full reproducer, with detailed steps to reproduce, can be found in this repo:
https://github.com/FZambia/message_loss_reproducer
It uses several replicas for partition, puts the system under CPU pressure to get REQUEST_TIMED_OUT with a higher probability. The reproducer automatically verifies message loss by consuming all messages and comparing against what was ACKed. In case loss is detected it reports sth like this:
╔══════════════════════════════════════════════════════════════╗
║ REPRODUCED: MESSAGE LOSS DETECTED ║
╠══════════════════════════════════════════════════════════════╣
║ Messages ACKed: 847 ║
║ Messages found: 846 ║
║ Messages LOST: 1 ║
╠══════════════════════════════════════════════════════════════╣
║ Lost messages: ║
║ - a1b2c3d4-message-42 (ACKed at offset 156) ║
╚══════════════════════════════════════════════════════════════╝
Versions
- Kafka 3.6
- franz-go v1.20.6
The problem
The problem reproduces when all of the following happen:
- Producing with idempotency enabled (default in franz-go)
- Producer call context gets canceled
- Kafka already wrote the message to log, but returned an error (like
REQUEST_TIMED_OUT)
What happens step by step (our assumption)
1. Producer sends Message A with (PID=1, epoch=0, seq=0)
2. Kafka writes Message A at offset 0
3. Kafka responds with REQUEST_TIMED_OUT (or similar error)
4. Context is canceled, franz-go treats produce as failed
5. BUT: franz-go does NOT invalidate current idempotency invariant (bug!?)
6. Producer sends Message B with (PID=1, epoch=0, seq=0) ← same invariant!
7. Kafka sees duplicate (PID, epoch, seq) → treats as retry of Message A
8. Kafka returns success with offset 0 (Message A's offset)
9. franz-go reports Message B successfully produced at offset 0
Result:
┌─────────┬─────────────┬──────────────────┐
│ Offset │ Actual Data │ What client sees │
├─────────┼─────────────┼──────────────────┤
│ 0 │ Message A │ Message B (!) │
└─────────┴─────────────┴──────────────────┘
Message B is LOST - it was never written to Kafka,
but the client received a successful ACK with an offset belonging to another message.
Other observations
We believe there are several places in franz-go where canceled context is processed. For example, injecting RETRIABLE error in Kafka response and non-RETRIABLE - both result into message loss, but the code path in the franz-go seems to be different.
Our current fix
For now we started to use context.WithoutCancel and set kgo.ProduceRequestTimeout to be less than customer's HTTP request timeout (source of context cancellations in our case). Will appreciate if you could validate this workaround 🙏