KAFKA-15045: (KIP-924 pt. 5) Add rack information to ApplicationState #15972

apourchet · 2024-05-15T21:58:33Z

This rack information is required to compute rack-aware assignments, which many of the current assigners do.

The internal ClientMetadata class was also edited to pass around this rack information.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

This rack information is required to compute rack-aware assignments, which many of the current assigners do.

ableegoldman · 2024-05-16T22:14:26Z

...s/src/main/java/org/apache/kafka/streams/processor/internals/assignment/DefaultTaskInfo.java

+ this.changelogTopicPartitions = unmodifiableSet(changelogTopicPartitions);
+ }
+
+ public static DefaultTaskInfo of(final TaskId taskId,


since this is an internal API, you can just have a normal public constructor. The static constructor thing is only for public classes where we want to make a "nice looking" fluent API

ableegoldman · 2024-05-16T22:15:39Z

...s/src/main/java/org/apache/kafka/streams/processor/internals/assignment/DefaultTaskInfo.java

+ final Map<TaskId, Set<TopicPartition>> inputPartitionsForTask,
+ final Map<TaskId, Set<TopicPartition>> changelogPartitionsForTask) {
+
+ final Set<TopicPartition> inputPartitions = inputPartitionsForTask.get(taskId);


if this is the only place where we use the inputPartitionsForTask/changelogPartitionsForTask map, let's just pass in the inputPartitions & changelogPartitions sets directly

ableegoldman · 2024-05-16T22:22:31Z

...s/src/main/java/org/apache/kafka/streams/processor/internals/assignment/DefaultTaskInfo.java

+ inputPartitions.forEach(partition -> {
+ racksForPartition.computeIfAbsent(partition, k -> new HashSet<>());
+ final String consumer = previousOwnerForPartition.apply(partition);
+ final Optional<String> rack = rackForConsumer.get(consumer);
+ rack.ifPresent(s -> racksForPartition.get(partition).add(s));
+ });


Ah, this is not computing the right rack id -- this would be the rack id of the KafkaStreams node that had this partition assigned during the last rebalance. What we want is the rack.id of the broker node(s) that host this partition. This is going to be a bit complex so let's chat online (ditto for the changelogPartitions rack info as well)

ableegoldman · 2024-05-16T22:30:10Z

...s/src/main/java/org/apache/kafka/streams/processor/internals/assignment/DefaultTaskInfo.java

+ final Set<String> stateStoreNames = new HashSet<>();
+ return new DefaultTaskInfo(
+ taskId,
+ isStateful, // All standby tasks are stateful.


Ah, another thing to note here is that this class corresponds to a "logical task", not a "physical one". I just made up those terms but hopefully this will make sense: a "physical" task can be active or standby and represents an actual task that was assigned to a client and will be running on that client, where a "logical" task is just the metadata corresponding to that task id. Where a "task id" logically represents a combination of subtopology (grouping of processors) and partition number. So a logical task doesn't have a concept of active vs standby because it's just metadata, this class is basically telling the assignor which tasks exist in this application. The assignor then has to create a set of physical tasks to actually be assigned, basically one active task and however many standby tasks for each "logical task"

Hope that didn't make things more confusing...anyways this comment isn't incorrect, but it doesn't exactly apply in this context. The "isStateful" is just metadata related to whether it has state stores in this subtopology (I'll tell you how to get this info later, or you can even compute it based on stateStoresNames#isEmpty)

streams/src/main/java/org/apache/kafka/streams/processor/assignment/TaskInfo.java

ableegoldman · 2024-05-18T01:14:18Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopologyBuilder.java

+ * @return the set of changelog topics, which includes both source changelog topics and non
+ * source changelog topics.
+ */
+ public Set<InternalTopicConfig> stateChangelogTopics() {


nit: I'd just call this changelogTopics which I think helps make it obvious that this is the super-set of the #sourceChangelogTopics and #nonSourceChangelogTopics APIs

(you can rename the field itself as well but don't have to, that's up to you)

ableegoldman · 2024-05-18T01:19:03Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

+ final Set<String> changelogTopics = entry.getValue().stateChangelogTopics()
+ .stream().map(t -> t.name).collect(Collectors.toSet());


if all we need is the topic names from the #stateChangelogTopics API then let's just have it return that directly. You should be able to just return the #keySet of that stateChangelogTopics map to get a Set with the topic names right away

ableegoldman · 2024-05-18T01:25:57Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

+ });
+ final Set<TopicPartition> changelogTopicPartitions = new HashSet<>();
+ changelogPartitionsForTask.forEach((taskId, partitions) -> {
+ logicalTaskIds.add(taskId);


I suppose this doesn't hurt anything since logicalTasks is a Set, but the taskIds returned by the partition grouper should be the same for the source and changelog topics. So you can remove this line

(alternatively you can create the logicalTaskIds map up front by copying the keyset of one of the partitionsForTask maps but that's just an implementation detail, up to you. However I would probably consider adding a check to make sure these two maps return the same set of tasks. Doesn't need to scan the entire thing, maybe just a simple

if (sourcePartitionsForTask.size() != changelogPartitionsForTask.size()) { log.error("Partition grouper returned {} tasks for source topics but {} tasks for changelog topics, sourcePartitionsForTask.size(), changelogPartitionsForTask.size()); throw new TaskAssignmentException(//error msg ); }

Note that we'll also want to deduplicate the source-changelog partitions for the rack id computation. We should include them in the source topics/remove them from the changelog topics passed into the #getRacksForTopicPartitions call. Of course we still need the changelogTopicPartitions as well, so we'll want a third set of nonSourceChangelogTopicPartitions that's specifically for the rack id computation.

To be more precise, I'm imagining something like this:

final Set<TopicPartition> sourceTopicPartitions = new HashSet<>(); final Set<TopicPartition> changelogTopicPartitions = new HashSet<>(); final Set<TopicPartition> nonSourceChangelogTopicPartitions = new HashSet<>(); for (final var entry : sourceTopicPartitions.entrySet()) { final TaskId task = entry.getKey(); final Set<TopicPartition> taskSourcePartitions = entry.getValue(); final Set<TopicPartition> taskChangelogPartitions = changelogTopicPartitions.get(taskId); final Set<TopicPartition> taskNonSourceChangelogPartitions = new HashSet(taskChangelogPartitions); taskNonSourceChangelogPartitions.removeAll(taskSourcePartitions); logicalTaskIds.add(taskId); sourceTopicPartitions.addAll(taskSourcePartitions); changelogTopicPartitions.addAll(taskChangelogPartitions); nonSourceChangelogTopicPartitions.addAll(taskNonSourceChangelogPartitions); }

Then we pass the nonSourceChangelogPartitions into the #getRacksForTopicPartition instead of the changelogPartitions set.

Sorry for the wall of text 😅 It might not seem like a huge deal but if it's an app with only source-changelog partitions, then doing this will save the assignor from having to make any DescribeTopics request since there are no non-source changelogs.

And yes, apps with only source changelogs do exist, they're pretty common for certain kinds of table-based processing (and especially apps that make heavy use of IQ). And saving a remote fetch is actually a pretty big deal, doing them in the middle of an assignment makes the rebalance vulnerable to timing out, especially when brokers are under heavy load or the app is experiencing rebalancing issues to begin with

ableegoldman · 2024-05-18T01:29:45Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/assignment/RackUtils.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class RackUtils {


nit: make this final and add private constructor so it's clear this is just a static utils class

ableegoldman · 2024-05-18T01:36:52Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/assignment/RackUtils.java

+ cluster, topicsWithUpToDateMetadata);
+
+ final Map<String, List<TopicPartitionInfo>> freshTopicPartitionInfo =
+ describeTopics(internalTopicManager, topicsToDescribe);


It's not a huge deal but if we have time left at the end it might make sense to condense this into a single call where we describe all the topics in one go rather than making a separate request for the source topics and changelogs

But it really isn't a big deal because in general, after the first rebalance, all the source topics should have been created and we really will do only one call since only the changelogs will be unknown

on that note, can you check to make sure this skips the actual DescribeTopics request if the set of topics to describe is empty? Like does it end up making a call with the admin client? If not then we should guard this with a if (!topicsToDescribe.isEmpty) (or we can just add this check anyways to be safe)

I agree, I wrote it this way to mimic the exact pattern of use that the RackAwareAssigner uses. Once this is all wired and tested though we can make optimizations and changes like this one (and the lazy rack info one).

ableegoldman · 2024-05-18T02:08:52Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/assignment/RackUtils.java

+ LOG.error("TopicPartition {} doesn't exist in cluster", topicPartition);
+ continue;


It's not an error for a topic to not be included in the Cluster, even source topics might not exist here if they had to be created by the assignor during the rebalance, since the Cluster metadata represents the state of the cluster when this rebalance/assignment first began.

Since the point of this method seems to be to collect topics with missing metadata that we'll need to look up via a DescribeTopics request, the ones for which cluster.partition(topicPartition) returns null are exactly the ones that should be returned by this method.

In fact I'd go ahead and remove everything past this line as well, this method should focus only on differentiating topics with missing metadata from ones we already have the info for. If the Cluster has metadata for this partition but the replicas set is missing/empty, then there's something wrong with this partition, and calling DescribeTopics probably won't help

Let's rename this to #topicsWithMissingMetadata while we're at it

This isn't how the current rack aware code works:

final PartitionInfo partitionInfo = fullMetadata.partition(topicPartition); if (partitionInfo == null) { log.error("TopicPartition {} doesn't exist in cluster", topicPartition); return false; } final Node[] replica = partitionInfo.replicas(); if (replica == null || replica.length == 0) { topicsToDescribe.add(topicPartition.topic()); continue; } for (final Node node : replica) { if (node.hasRack()) { racksForPartition.computeIfAbsent(topicPartition, k -> new HashSet<>()).add(node.rack()); } else { log.warn("Node {} for topic partition {} doesn't have rack", node, topicPartition); return false; } } }

Above is the logic of populateTopicsToDescribe, which evidently uses the (replica == null || replica.length == 0) condition to decide to fetch further topic information, yet considers cluster.partition(topicPartition) to be an error, which causes the RackAwareAssignor to be turned off entirely.

ableegoldman · 2024-05-18T02:10:35Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/assignment/RackUtils.java

+ final List<Node> replicas = partitionInfo.replicas();
+ if (replicas == null || replicas.isEmpty()) {
+ LOG.error("No replicas found for topic partition {}: {}", topic, partition);
+ return;


nit: can you factor the lambda out into a separate method? I was really confused by this empty return for a while until I realized it wasn't returning from the getRacksForTopicPartition method, just the lambda inside this loop

ableegoldman · 2024-05-18T02:21:47Z

...ams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java

+ final Map<TopicPartition, Set<String>> racksForSourcePartitions = RackUtils.getRacksForTopicPartition(
+ cluster, internalTopicManager, sourceTopicPartitions, false);
+ final Map<TopicPartition, Set<String>> racksForChangelogPartitions = RackUtils.getRacksForTopicPartition(


Since the rack info is nontrivial to compute and always makes a remote call (which can take a long time and even time out or otherwise fail) and not every assignor/app will actually use it I'm thinking maybe we should try to initialize it lazily, only if/when the user actually requests the rack info

I'm totally happy to push that into a followup PR to keep the scope well-defined for now, so don't worry about it for now. We'd still need everything you implemented here and would just be moving it around and/or subbing in function pointers instead of passing around data strucutres directly, so it shouldn't have any impact on how this PR goes. Just wanted to make a note so I don't forget

ableegoldman · 2024-05-20T23:40:49Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopologyBuilder.java

+ * source changelog topics.
+ */
+ public Set<String> changelogTopics() {
+ return Collections.unmodifiableSet(new HashSet<>(stateChangelogTopics.keySet()));


I think you can skip the new HashSet step, that's pretty much redundant with the unmodifiableSet and since we don't plan on modifying the returned set, it's better to just wrap the keySet directly to save a bunch of unnecessary copying

ableegoldman

I'm still a bit wary of the original rack id computation logic but we can revisit that. This LGTM

dajac · 2024-05-23T06:14:06Z

Hey @ableegoldman @apourchet, I see new failures in trunk that seems to be related to this PR. The last build of this PR had 100+ failures: https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15972/8/tests. Could you please take a look?

ableegoldman · 2024-05-23T19:55:51Z

Yep, just noticed this. Sorry about that. We're taking a look

apourchet added 5 commits May 15, 2024 15:51

KAFKA-15045: (KIP-924 pt. 5) Add rack information to ApplicationState

9b14f7f

This rack information is required to compute rack-aware assignments, which many of the current assigners do.

remove call to buildApplicationState

db84280

partially revert client metadata change

5159cd4

TaskInfoImpl => DefaultTaskInfo

3da4ce3

compilation fix

f7366a1

ableegoldman reviewed May 16, 2024

View reviewed changes

Fixed rack information fetching

91ed904

ableegoldman reviewed May 18, 2024

View reviewed changes

apourchet added 2 commits May 20, 2024 09:50

addressed comments

d5fd7b9

addressed last comment

080d3b7

ableegoldman reviewed May 20, 2024

View reviewed changes

apourchet added 2 commits May 20, 2024 18:34

last comment

b607c01

Merge branch 'trunk' into KIP-924-5

3910eb6

ableegoldman approved these changes May 21, 2024

View reviewed changes

ableegoldman merged commit ef2c5e4 into apache:trunk May 22, 2024
1 check failed

dajac mentioned this pull request May 23, 2024

KAFKA-16793: Heartbeat API for upgrading ConsumerGroup #15988

Merged

3 tasks

This was referenced May 23, 2024

Revert "KAFKA-15045: (KIP-924 pt. 10) Topic partition rack annotation simplified (#16034)" #16055

Closed

KAFKA-15045: (KIP-924 pt. 10) Topic partition rack annotation simplified #16034

Merged

dajac mentioned this pull request May 24, 2024

KAFKA-16626: Lazily convert subscribed topic names to topic ids #15970

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15045: (KIP-924 pt. 5) Add rack information to ApplicationState #15972

KAFKA-15045: (KIP-924 pt. 5) Add rack information to ApplicationState #15972

apourchet commented May 15, 2024

ableegoldman May 16, 2024

ableegoldman May 16, 2024

ableegoldman May 16, 2024

ableegoldman May 16, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 18, 2024

apourchet May 20, 2024

ableegoldman May 18, 2024

apourchet May 20, 2024 •

edited

ableegoldman May 18, 2024

ableegoldman May 18, 2024

ableegoldman May 20, 2024

ableegoldman left a comment

dajac commented May 23, 2024

ableegoldman commented May 23, 2024

		final Set<String> changelogTopics = entry.getValue().stateChangelogTopics()
		.stream().map(t -> t.name).collect(Collectors.toSet());

		LOG.error("TopicPartition {} doesn't exist in cluster", topicPartition);
		continue;

KAFKA-15045: (KIP-924 pt. 5) Add rack information to ApplicationState #15972

KAFKA-15045: (KIP-924 pt. 5) Add rack information to ApplicationState #15972

Conversation

apourchet commented May 15, 2024

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apourchet May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ableegoldman left a comment

Choose a reason for hiding this comment

dajac commented May 23, 2024

ableegoldman commented May 23, 2024

apourchet May 20, 2024 •

edited