Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180

stouch · 2025-02-11T15:22:40Z

Summary

We host jitsu using jitsu-chart (helm install config of jitsu).

Yesterday, for hours, our jitsu track API endpoint seems to have received the events while the events were not pushed to our external database because the kafka controller pod (and / or clickhouse pod) had no space left on their volume.

I just increased the volume disk and restarted clickhouse and kafka controller. Now everything looks fine, But I would like to get back the lost events. I currently see a lot of logs in the kafka controller but I dont how to determine if the lost events of yesterday are well being processed (I dont see any of them yet in my destination db). Are they going to be resent ?

I would also know if there is kind of a /health API endpoint in jitsu so we can identify such issues when the clickhouse and/or controller / console / ingest entrypoint have issues.

System configuration and versions

https://github.com/stafftastic/jitsu-chart @ latest

Artifacts (logs, etc)

I see that kind of logs in kafka:

│ [2025-02-11 15:20:14,986] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3000 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,133] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3001 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,298] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3002 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,433] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3003 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,558] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3004 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,690] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3005 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)

The text was updated successfully, but these errors were encountered:

vklimontovich · 2025-02-11T19:07:08Z

If Kafka is down, ingest should be writing data to a local disk and replay it later (@absorbb is this correct). However, ingest should have a persistent disks, not sure how if it's configured in stafftastic's chart.

On the other hand, if Kafka wasn't returning error to Jitsu Producer, there's no way to restore the data

absorbb · 2025-02-11T20:12:24Z

ClickHouse is not critical to the system.

However, if Kafka is down, issues will arise since the retry mechanism depends on Kafka.
There is a backup log available: config.go#L40, but it is neither enabled in jitsu-chart nor documented on our site.

Currently, the only fallback mechanism enabled by default is the Kafka producer’s in-memory queue (default size: 100,000). However, in your case, it was likely lost due to a restart or became overfilled.

Unfortunately, there is no way to recover events in this scenario.

Our services expose a /health endpoint (/healthcheck for the console), but none of them currently verify Kafka's state. While Kafka can be monitored separately, it would be beneficial to integrate Kafka health checks directly into ingest.

Action Items:

Document the backup log settings.
Document the producer queue size.
Add Kafka health checks to the ingest /health endpoint.

stouch · 2025-02-13T16:13:41Z

OK, I understood why my disks continuesly explode : The kafka db looks to be filled by millions of data in : /bitnami/kafka/data because when I run du -sh ./* I got several topics of > 3/4GB in few days.

These topics are about events that I already got in my destination databases, so why are they still here stored ?

I will be AGAIN out of disk space in a dozen of hours and so I must increase again the disks.

Is this normal ?

I filtered grep rows with "M" size, but I also go topics of "G"bytes..

vklimontovich · 2025-02-13T16:43:05Z

That's how Kafka works, Jitsu can't delete any data from the topic. It's controlled by topic retention policy. By default jitsu creates topics with around 7 day retention. You can either change that — it should be env var, grep through bulker and rotor code. Or you can change retention after topic is created

echozio · 2025-02-13T20:06:56Z

@vklimontovich

If Kafka is down, ingest should be writing data to a local disk and replay it later (@absorbb is this correct). However, ingest should have a persistent disks, not sure how if it's configured in stafftastic's chart.

Only Kafka, ClickHouse, MongoDB and PostgreSQL have any kind of persistence in our chart. If ingest should have persistence we'll update the chart accordingly. Do you have some additional info about this, like where it's stored and if it requires additional housekeeping? I'm unable to find anything in the docs or the reference docker compose file.

stouch · 2025-02-13T21:05:53Z

My Kafka will probably be full by this night, Can I delete manually the topics in cli ? Or can I run a command to cleanup ? If not I'll increse the disks

echozio · 2025-02-13T22:17:32Z

@stouch You should be able to alter the retention period by running

bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter --topic destination-messages --add-config retention.ms=

adding your desired retention period in milliseconds at the end, which should begin purging of messages older than that.

stouch assigned vklimontovich Feb 11, 2025

absorbb assigned absorbb and unassigned vklimontovich Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180

Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180

stouch commented Feb 11, 2025

vklimontovich commented Feb 11, 2025 •

edited

Loading

absorbb commented Feb 11, 2025

stouch commented Feb 13, 2025 •

edited

Loading

vklimontovich commented Feb 13, 2025

echozio commented Feb 13, 2025

stouch commented Feb 13, 2025

echozio commented Feb 13, 2025

Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180

Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180

Comments

stouch commented Feb 11, 2025

Summary

System configuration and versions

Artifacts (logs, etc)

vklimontovich commented Feb 11, 2025 • edited Loading

absorbb commented Feb 11, 2025

Action Items:

stouch commented Feb 13, 2025 • edited Loading

vklimontovich commented Feb 13, 2025

echozio commented Feb 13, 2025

stouch commented Feb 13, 2025

echozio commented Feb 13, 2025

vklimontovich commented Feb 11, 2025 •

edited

Loading

stouch commented Feb 13, 2025 •

edited

Loading