Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180

Open
stouch opened this issue Feb 11, 2025 · 7 comments
Assignees

Comments

@stouch
Copy link

stouch commented Feb 11, 2025

Summary

We host jitsu using jitsu-chart (helm install config of jitsu).

Yesterday, for hours, our jitsu track API endpoint seems to have received the events while the events were not pushed to our external database because the kafka controller pod (and / or clickhouse pod) had no space left on their volume.

I just increased the volume disk and restarted clickhouse and kafka controller. Now everything looks fine, But I would like to get back the lost events. I currently see a lot of logs in the kafka controller but I dont how to determine if the lost events of yesterday are well being processed (I dont see any of them yet in my destination db). Are they going to be resent ?

I would also know if there is kind of a /health API endpoint in jitsu so we can identify such issues when the clickhouse and/or controller / console / ingest entrypoint have issues.

System configuration and versions

https://github.com/stafftastic/jitsu-chart @ latest

Artifacts (logs, etc)

I see that kind of logs in kafka:

│ [2025-02-11 15:20:14,986] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3000 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,133] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3001 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,298] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3002 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,433] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3003 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,558] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3004 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)                                                                                                                                                                                   │
│ [2025-02-11 15:20:15,690] INFO [TransactionCoordinator id=0] Initialized transactionalId in.id.cm00rd78400061c0alg6fdykx-uegh-a43t-rSIaHk.m.retry.t._all__failed_8725bd2d-5bbe-43a3-926e-9aa651e5ff51 with producerId 1001 and producer epoch 3005 on partition │
│  __transaction_state-2 (kafka.coordinator.transaction.TransactionCoordinator)   
@vklimontovich
Copy link
Contributor

vklimontovich commented Feb 11, 2025

If Kafka is down, ingest should be writing data to a local disk and replay it later (@absorbb is this correct). However, ingest should have a persistent disks, not sure how if it's configured in stafftastic's chart.

On the other hand, if Kafka wasn't returning error to Jitsu Producer, there's no way to restore the data

@absorbb
Copy link
Contributor

absorbb commented Feb 11, 2025

ClickHouse is not critical to the system.

However, if Kafka is down, issues will arise since the retry mechanism depends on Kafka.
There is a backup log available: config.go#L40, but it is neither enabled in jitsu-chart nor documented on our site.

Currently, the only fallback mechanism enabled by default is the Kafka producer’s in-memory queue (default size: 100,000). However, in your case, it was likely lost due to a restart or became overfilled.

Unfortunately, there is no way to recover events in this scenario.

Our services expose a /health endpoint (/healthcheck for the console), but none of them currently verify Kafka's state. While Kafka can be monitored separately, it would be beneficial to integrate Kafka health checks directly into ingest.

Action Items:

  • Document the backup log settings.
  • Document the producer queue size.
  • Add Kafka health checks to the ingest /health endpoint.

@absorbb absorbb assigned absorbb and unassigned vklimontovich Feb 11, 2025
@stouch
Copy link
Author

stouch commented Feb 13, 2025

OK, I understood why my disks continuesly explode : The kafka db looks to be filled by millions of data in : /bitnami/kafka/data because when I run du -sh ./* I got several topics of > 3/4GB in few days.

These topics are about events that I already got in my destination databases, so why are they still here stored ?

I will be AGAIN out of disk space in a dozen of hours and so I must increase again the disks.

Is this normal ?

Image

I filtered grep rows with "M" size, but I also go topics of "G"bytes..

@vklimontovich
Copy link
Contributor

That's how Kafka works, Jitsu can't delete any data from the topic. It's controlled by topic retention policy. By default jitsu creates topics with around 7 day retention. You can either change that — it should be env var, grep through bulker and rotor code. Or you can change retention after topic is created

@echozio
Copy link
Contributor

echozio commented Feb 13, 2025

@vklimontovich

If Kafka is down, ingest should be writing data to a local disk and replay it later (@absorbb is this correct). However, ingest should have a persistent disks, not sure how if it's configured in stafftastic's chart.

Only Kafka, ClickHouse, MongoDB and PostgreSQL have any kind of persistence in our chart. If ingest should have persistence we'll update the chart accordingly. Do you have some additional info about this, like where it's stored and if it requires additional housekeeping? I'm unable to find anything in the docs or the reference docker compose file.

@stouch
Copy link
Author

stouch commented Feb 13, 2025

My Kafka will probably be full by this night, Can I delete manually the topics in cli ? Or can I run a command to cleanup ? If not I'll increse the disks

@echozio
Copy link
Contributor

echozio commented Feb 13, 2025

@stouch You should be able to alter the retention period by running

bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter --topic destination-messages --add-config retention.ms=

adding your desired retention period in milliseconds at the end, which should begin purging of messages older than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants