Skip to content

Message loss during idempotent Produce when caller Context is canceled #1217

@FZambia

Description

@FZambia

Dear Travis,

Hi! First of all, we are active users of your library — it works great and is superior compared to other Go Kafka libraries we have experience with. Thanks a lot for your awesome work.

Over the last several months, we have been investigating occasional message loss that occurred in our system during message production.

The top-level symptoms were: we see a message successfully produced with an offset assigned and no error returned, but then this message cannot be found in the Kafka topic. This usually happened when there were issues with Kafka — such as broker restarts, partition leadership changes due to network problems, etc.

It seems we eventually found the issue and have a reproducer for it. The full reproducer, with detailed steps to reproduce, can be found in this repo:

https://github.com/FZambia/message_loss_reproducer

It uses several replicas for partition, puts the system under CPU pressure to get REQUEST_TIMED_OUT with a higher probability. The reproducer automatically verifies message loss by consuming all messages and comparing against what was ACKed. In case loss is detected it reports sth like this:

╔══════════════════════════════════════════════════════════════╗
║  REPRODUCED: MESSAGE LOSS DETECTED                           ║
╠══════════════════════════════════════════════════════════════╣
║  Messages ACKed: 847                                         ║
║  Messages found: 846                                         ║
║  Messages LOST:  1                                           ║
╠══════════════════════════════════════════════════════════════╣
║  Lost messages:                                              ║
║    - a1b2c3d4-message-42 (ACKed at offset 156)               ║
╚══════════════════════════════════════════════════════════════╝

Versions

  • Kafka 3.6
  • franz-go v1.20.6

The problem

The problem reproduces when all of the following happen:

  • Producing with idempotency enabled (default in franz-go)
  • Producer call context gets canceled
  • Kafka already wrote the message to log, but returned an error (like REQUEST_TIMED_OUT)

What happens step by step (our assumption)

1. Producer sends Message A with (PID=1, epoch=0, seq=0)
2. Kafka writes Message A at offset 0
3. Kafka responds with REQUEST_TIMED_OUT (or similar error)
4. Context is canceled, franz-go treats produce as failed
5. BUT: franz-go does NOT invalidate current idempotency invariant (bug!?)

6. Producer sends Message B with (PID=1, epoch=0, seq=0)  ← same invariant!
7. Kafka sees duplicate (PID, epoch, seq) → treats as retry of Message A
8. Kafka returns success with offset 0 (Message A's offset)
9. franz-go reports Message B successfully produced at offset 0

Result:
┌─────────┬─────────────┬──────────────────┐
│ Offset  │ Actual Data │ What client sees │
├─────────┼─────────────┼──────────────────┤
│ 0       │ Message A   │ Message B (!)    │
└─────────┴─────────────┴──────────────────┘

Message B is LOST - it was never written to Kafka,
but the client received a successful ACK with an offset belonging to another message.

Other observations

We believe there are several places in franz-go where canceled context is processed. For example, injecting RETRIABLE error in Kafka response and non-RETRIABLE - both result into message loss, but the code path in the franz-go seems to be different.

Our current fix

For now we started to use context.WithoutCancel and set kgo.ProduceRequestTimeout to be less than customer's HTTP request timeout (source of context cancellations in our case). Will appreciate if you could validate this workaround 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions