Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for partition based scaling on the kafka scaler #6558

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

patelvp
Copy link

@patelvp patelvp commented Feb 20, 2025

Issue: #2581

Scaling kafka consumers should be done in factors of partition count on the topic. This is to ensure that the partitions are evenly spread across all consumers. If the paritions are not evenly spread we run the risk of some partitions being consumed faster than other. This PR adds a new property on the kafka scaler ensureEvenDistributionOfPartitions. When this property is set to true the scaler ensure that the number of pods are always evenly spread across the number of topics on the partition.

Checklist

Fixes #2581

QA:
Tested this locally by pushing an image on a local kind cluster. Kafka consumer and producers are outside the cluster to get granual and quick control over production and consumption rate.
Plotted the pod count and kafka partition lag onto grafana.
Screenshot 2025-02-19 at 5 08 27 PM

@patelvp patelvp requested a review from a team as a code owner February 20, 2025 06:26
Scaling kafka consumers should be done in factors of partition count on the topic. This is to ensure that the partitions are evenly spread across all consumers. If the paritions are not evenly spread we run the risk of some partitions being consumed faster than other. This PR adds a new property on the kafka scaler `ensureEvenDistributionOfPartitions`. When this property is set to true the scaler ensure that the number of pods are always evenly spread across the number of topics on the partition.

Signed-off-by: Vishal Patel <[email protected]>
@patelvp patelvp force-pushed the add-partition-based-scaling-kafka-scaler branch from a548ac9 to c46e703 Compare February 20, 2025 17:37
@JorTurFer
Copy link
Member

@dttung2905 @zroubalik , you are the Kafka experts xD
Does this make sense?

Copy link
Contributor

@dttung2905 dttung2905 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR. Personally, I like the direction this PR is heading to. Just 1 small comment for my understanding

Comment on lines +1017 to +1022
for _, factor := range factors {
if factor*lagThreshold >= totalLag {
return factor
}
}
return totalTopicPartitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to understand and confirm the logic here. Are we trying to get the smallest number of pods to satisfy the condition factor*lagThreshold >= totalLag ? The reason is the factors array is sorted in ascending order from FindFactors

Another follow up question is that what if it is sorted in descending order to provide a more aggressive strategy to reduce lag 🤔 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to understand and confirm the logic here. Are we trying to get the smallest number of pods to satisfy the condition factor*lagThreshold >= totalLag ? The reason is the factors array is sorted in ascending order from FindFactors

Yep just trying to find the smallest number of pods that will satisfy the condition

what if it is sorted in descending order to provide a more aggressive strategy to reduce lag

In that case, eventually, either we have to flip the conditional factor*lagThreshold >= totalLag to end up getting the same number of pods that we get with the proposed logic, or we risk running way more pods than we should ideally want. Consider if there are 100 partitions. It would immediately scale to 100 pods even for a lag of 100 and lagThreshold of 10.
Not sure if you I am missing anything when you say aggressive strategy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kafka Partitions Per Pod Scaling
3 participants