-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost bulk events because No space left on device in kafka controller (and / or clickhouse shard) + Health endpoint ? #1180
Comments
If Kafka is down, ingest should be writing data to a local disk and replay it later (@absorbb is this correct). However, ingest should have a persistent disks, not sure how if it's configured in stafftastic's chart. On the other hand, if Kafka wasn't returning error to Jitsu Producer, there's no way to restore the data |
ClickHouse is not critical to the system. However, if Kafka is down, issues will arise since the retry mechanism depends on Kafka. Currently, the only fallback mechanism enabled by default is the Kafka producer’s in-memory queue (default size: 100,000). However, in your case, it was likely lost due to a restart or became overfilled. Unfortunately, there is no way to recover events in this scenario. Our services expose a Action Items:
|
That's how Kafka works, Jitsu can't delete any data from the topic. It's controlled by topic retention policy. By default jitsu creates topics with around 7 day retention. You can either change that — it should be env var, grep through bulker and rotor code. Or you can change retention after topic is created |
Only Kafka, ClickHouse, MongoDB and PostgreSQL have any kind of persistence in our chart. If ingest should have persistence we'll update the chart accordingly. Do you have some additional info about this, like where it's stored and if it requires additional housekeeping? I'm unable to find anything in the docs or the reference docker compose file. |
My Kafka will probably be full by this night, Can I delete manually the topics in cli ? Or can I run a command to cleanup ? If not I'll increse the disks |
@stouch You should be able to alter the retention period by running
adding your desired retention period in milliseconds at the end, which should begin purging of messages older than that. |
Summary
We host jitsu using jitsu-chart (helm install config of jitsu).
Yesterday, for hours, our jitsu track API endpoint seems to have received the events while the events were not pushed to our external database because the kafka controller pod (and / or clickhouse pod) had no space left on their volume.
I just increased the volume disk and restarted clickhouse and kafka controller. Now everything looks fine, But I would like to get back the lost events. I currently see a lot of logs in the kafka controller but I dont how to determine if the lost events of yesterday are well being processed (I dont see any of them yet in my destination db). Are they going to be resent ?
I would also know if there is kind of a /health API endpoint in jitsu so we can identify such issues when the clickhouse and/or controller / console / ingest entrypoint have issues.
System configuration and versions
https://github.com/stafftastic/jitsu-chart @ latest
Artifacts (logs, etc)
I see that kind of logs in kafka:
The text was updated successfully, but these errors were encountered: