From b7e9b81e8b32313f00d38257ba731e73d17224cb Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Thu, 29 Aug 2024 12:27:28 -0400
Subject: [PATCH 01/16] path-walk: introduce an object walk by path

In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.

The data structures are documented in 'struct path_walk_context', but in
summary the most important are:

  * 'paths_to_lists' is a strmap that connects a path to a
    type_and_oid_list for that path. To avoid conflicts in path names,
    we make sure that tree paths end in "/" (except the root path with
    is an empty string) and blob paths do not end in "/".

  * 'path_stack' is a string list that is added to in an append-only
    way. This stores the stack of our depth-first search on the heap
    instead of using recursion.

  * 'path_stack_pushed' is a strmap that stores path names that were
    already added to 'path_stack', to avoid repeating paths in the
    stack. Mostly, this saves us from quadratic lookups from doing
    unsorted checks into the string_list.

The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.

The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.

The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.

The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.

In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.txt |  45 ++++
 Makefile                                  |   1 +
 path-walk.c                               | 263 ++++++++++++++++++++++
 path-walk.h                               |  43 ++++
 4 files changed, 352 insertions(+)
 create mode 100644 Documentation/technical/api-path-walk.txt
 create mode 100644 path-walk.c
 create mode 100644 path-walk.h

diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000000..c550c77ca30754
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,45 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+	The most important option is the `path_fn` option, which is a
+	function pointer to the callback that can execute logic on the
+	object IDs for objects grouped by type and path. This function
+	also receives a `data` value that corresponds to the
+	`path_fn_data` member, for providing custom data structures to
+	this callback function.
+
+`revs`::
+	To configure the exact details of the reachable set of objects,
+	use the `revs` member and initialize it using the revision
+	machinery in `revision.h`. Initialize `revs` using calls such as
+	`setup_revisions()` or `parse_revision_opt()`. Do not call
+	`prepare_revision_walk()`, as that will be called within
+	`walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257af6..d0d8d6888e3c47 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
 LIB_OBJS += patch-delta.o
 LIB_OBJS += patch-ids.o
 LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
 LIB_OBJS += pathspec.o
 LIB_OBJS += pkt-line.o
 LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000000..24cf04c1e7dde6
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,263 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list
+{
+	enum object_type type;
+	struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+	.type = OBJ_NONE, 	 \
+	.oids = OID_ARRAY_INIT	 \
+}
+
+struct path_walk_context {
+	/**
+	 * Repeats of data in 'struct path_walk_info' for
+	 * access with fewer characters.
+	 */
+	struct repository *repo;
+	struct rev_info *revs;
+	struct path_walk_info *info;
+
+	/**
+	 * Map a path to a 'struct type_and_oid_list'
+	 * containing the objects discovered at that
+	 * path.
+	 */
+	struct strmap paths_to_lists;
+
+	/**
+	 * Store the current list of paths in a stack, to
+	 * facilitate depth-first-search without recursion.
+	 *
+	 * Use path_stack_pushed to indicate whether a path
+	 * was previously added to path_stack.
+	 */
+	struct string_list path_stack;
+	struct strset path_stack_pushed;
+};
+
+static void push_to_stack(struct path_walk_context *ctx,
+			  const char *path)
+{
+	if (strset_contains(&ctx->path_stack_pushed, path))
+		return;
+
+	strset_add(&ctx->path_stack_pushed, path);
+	string_list_append(&ctx->path_stack, path);
+}
+
+static int add_children(struct path_walk_context *ctx,
+			const char *base_path,
+			struct object_id *oid)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+	struct strbuf path = STRBUF_INIT;
+	size_t base_len;
+	struct tree *tree = lookup_tree(ctx->repo, oid);
+
+	if (!tree) {
+		error(_("failed to walk children of tree %s: not found"),
+		      oid_to_hex(oid));
+		return -1;
+	} else if (parse_tree_gently(tree, 1)) {
+		die("bad tree object %s", oid_to_hex(oid));
+	}
+
+	strbuf_addstr(&path, base_path);
+	base_len = path.len;
+
+	parse_tree(tree);
+	init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		struct type_and_oid_list *list;
+		struct object *o;
+		/* Not actually true, but we will ignore submodules later. */
+		enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+		/* Skip submodules. */
+		if (S_ISGITLINK(entry.mode))
+			continue;
+
+		if (type == OBJ_TREE) {
+			struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+			o = child ? &child->object : NULL;
+		} else if (type == OBJ_BLOB) {
+			struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+			o = child ? &child->object : NULL;
+		} else {
+			/* Wrong type? */
+			continue;
+		}
+
+		if (!o) /* report error?*/
+			continue;
+
+		strbuf_setlen(&path, base_len);
+		strbuf_add(&path, entry.path, entry.pathlen);
+
+		/*
+		 * Trees will end with "/" for concatenation and distinction
+		 * from blobs at the same path.
+		 */
+		if (type == OBJ_TREE)
+			strbuf_addch(&path, '/');
+
+		if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+			CALLOC_ARRAY(list, 1);
+			list->type = type;
+			strmap_put(&ctx->paths_to_lists, path.buf, list);
+		}
+		push_to_stack(ctx, path.buf);
+
+		/* Skip this object if already seen. */
+		if (o->flags & SEEN)
+			continue;
+		o->flags |= SEEN;
+		oid_array_append(&list->oids, &entry.oid);
+	}
+
+	free_tree_buffer(tree);
+	strbuf_release(&path);
+	return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+		     const char *path)
+{
+	struct type_and_oid_list *list;
+	int ret = 0;
+
+	list = strmap_get(&ctx->paths_to_lists, path);
+
+	if (!list->oids.nr)
+		return 0;
+
+	/* Evaluate function pointer on this data. */
+	ret = ctx->info->path_fn(path, &list->oids, list->type,
+				 ctx->info->path_fn_data);
+
+	/* Expand data for children. */
+	if (list->type == OBJ_TREE) {
+		for (size_t i = 0; i < list->oids.nr; i++) {
+			ret |= add_children(ctx,
+					    path,
+					    &list->oids.oid[i]);
+		}
+	}
+
+	oid_array_clear(&list->oids);
+	strmap_remove(&ctx->paths_to_lists, path, 1);
+	return ret;
+}
+
+static void clear_strmap(struct strmap *map)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *e;
+
+	hashmap_for_each_entry(&map->map, &iter, e, ent) {
+		struct type_and_oid_list *list = e->value;
+		oid_array_clear(&list->oids);
+	}
+	strmap_clear(map, 1);
+	strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+	const char *root_path = "";
+	int ret = 0;
+	size_t commits_nr = 0, paths_nr = 0;
+	struct commit *c;
+	struct type_and_oid_list *root_tree_list;
+	struct path_walk_context ctx = {
+		.repo = info->revs->repo,
+		.revs = info->revs,
+		.info = info,
+		.path_stack = STRING_LIST_INIT_DUP,
+		.path_stack_pushed = STRSET_INIT,
+		.paths_to_lists = STRMAP_INIT
+	};
+
+	trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+	/* Insert a single list for the root tree into the paths. */
+	CALLOC_ARRAY(root_tree_list, 1);
+	root_tree_list->type = OBJ_TREE;
+	strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+	push_to_stack(&ctx, root_path);
+
+	if (prepare_revision_walk(info->revs))
+		die(_("failed to setup revision walk"));
+
+	while ((c = get_revision(info->revs))) {
+		struct object_id *oid = get_commit_tree_oid(c);
+		struct tree *t;
+		commits_nr++;
+
+		oid = get_commit_tree_oid(c);
+		t = lookup_tree(info->revs->repo, oid);
+
+		if (!t) {
+			warning("could not find tree %s", oid_to_hex(oid));
+			continue;
+		}
+
+		if (t->object.flags & SEEN)
+			continue;
+		t->object.flags |= SEEN;
+		oid_array_append(&root_tree_list->oids, oid);
+	}
+
+	trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+	trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+	trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+	while (!ret && ctx.path_stack.nr) {
+		char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+		ctx.path_stack.nr--;
+		paths_nr++;
+
+		ret = walk_path(&ctx, path);
+
+		free(path);
+	}
+	trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+	trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+	clear_strmap(&ctx.paths_to_lists);
+	strset_clear(&ctx.path_stack_pushed);
+	string_list_clear(&ctx.path_stack, 0);
+	return ret;
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000000..c9e94a98bc8f6b
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,43 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+		       struct oid_array *oids,
+		       enum object_type type,
+		       void *data);
+
+struct path_walk_info {
+	/**
+	 * revs provides the definitions for the commit walk, including
+	 * which commits are UNINTERESTING or not.
+	 */
+	struct rev_info *revs;
+
+	/**
+	 * The caller wishes to execute custom logic on objects reachable at a
+	 * given path. Every reachable object will be visited exactly once, and
+	 * the first path to see an object wins. This may not be a stable choice.
+	 */
+	path_fn path_fn;
+	void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);

From cf2ed61b324b019f61afb1b66163b1284e675db2 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Mon, 28 Oct 2024 21:23:27 -0400
Subject: [PATCH 02/16] test-lib-functions: add test_cmp_sorted

This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 t/test-lib-functions.sh | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index fde9bf54fc35fc..16b70aebd60749 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1267,6 +1267,16 @@ test_cmp () {
 	eval "$GIT_TEST_CMP" '"$@"'
 }
 
+# test_cmp_sorted runs test_cmp on sorted versions of the two
+# input files. Uses "$1.sorted" and "$2.sorted" as temp files.
+
+test_cmp_sorted () {
+	sort <"$1" >"$1.sorted" &&
+	sort <"$2" >"$2.sorted" &&
+	test_cmp "$1.sorted" "$2.sorted" &&
+	rm "$1.sorted" "$2.sorted"
+}
+
 # Check that the given config key has the expected value.
 #
 #    test_cmp_config [-C <dir>] <expected-value>

From a3c754d93cc3ecfcb4d2764930873af0b611eda9 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Wed, 18 Sep 2024 10:03:03 -0400
Subject: [PATCH 03/16] t6601: add helper for testing path-walk API

Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

Store and output a 'batch_nr' value so we can demonstrate that the paths are
grouped together in a batch and not following some other ordering. This
allows us to test the depth-first behavior of the path-walk API. However, we
purposefully do not test the order of the objects in the batch, so the
output is compared to the expected output through a sort.

It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.txt |   3 +-
 Makefile                                  |   1 +
 t/helper/test-path-walk.c                 |  90 ++++++++++++++++
 t/helper/test-tool.c                      |   1 +
 t/helper/test-tool.h                      |   1 +
 t/t6601-path-walk.sh                      | 120 ++++++++++++++++++++++
 6 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 t/helper/test-path-walk.c
 create mode 100755 t/t6601-path-walk.sh

diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c550c77ca30754..662162ec70b38b 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -42,4 +42,5 @@ commits.
 Examples
 --------
 
-See example usages in future changes.
+See example usages in:
+	`t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3c47..50413d964920e7 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-partial-clone.o
 TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
 TEST_BUILTINS_OBJS += test-pcre2-config.o
 TEST_BUILTINS_OBJS += test-pkt-line.o
 TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000000..aa468871079ab9
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,90 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+	N_("test-tool path-walk <options> -- <revision-options>"),
+	NULL
+};
+
+struct path_walk_test_data {
+	uintmax_t batch_nr;
+	uintmax_t tree_nr;
+	uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+		      enum object_type type, void *data)
+{
+	struct path_walk_test_data *tdata = data;
+	const char *typestr;
+
+	switch (type) {
+	case OBJ_TREE:
+		typestr = "TREE";
+		tdata->tree_nr += oids->nr;
+		break;
+
+	case OBJ_BLOB:
+		typestr = "BLOB";
+		tdata->blob_nr += oids->nr;
+		break;
+
+	default:
+		BUG("we do not understand this type");
+	}
+
+	for (size_t i = 0; i < oids->nr; i++)
+		printf("%"PRIuMAX":%s:%s:%s\n",
+		       tdata->batch_nr, typestr, path,
+		       oid_to_hex(&oids->oid[i]));
+
+	tdata->batch_nr++;
+	return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+	int res;
+	struct rev_info revs = REV_INFO_INIT;
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	struct path_walk_test_data data = { 0 };
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	setup_git_directory();
+	revs.repo = the_repository;
+
+	argc = parse_options(argc, argv, NULL,
+			     options, path_walk_usage,
+			     PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+	if (argc > 1)
+		setup_revisions(argc, argv, &revs, NULL);
+	else
+		usage(path_walk_usage[0]);
+
+	info.revs = &revs;
+	info.path_fn = emit_block;
+	info.path_fn_data = &data;
+
+	res = walk_objects_by_path(&info);
+
+	printf("trees:%" PRIuMAX "\n"
+	       "blobs:%" PRIuMAX "\n",
+	       data.tree_nr, data.blob_nr);
+
+	release_revisions(&revs);
+	return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4c17..43676e7b93a43f 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
 	{ "parse-subcommand", cmd__parse_subcommand },
 	{ "partial-clone", cmd__partial_clone },
 	{ "path-utils", cmd__path_utils },
+	{ "path-walk", cmd__path_walk },
 	{ "pcre2-config", cmd__pcre2_config },
 	{ "pkt-line", cmd__pkt_line },
 	{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da37f..9cfc5da6e57b00 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
 int cmd__parse_subcommand(int argc, const char **argv);
 int cmd__partial_clone(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
 int cmd__pcre2_config(int argc, const char **argv);
 int cmd__pkt_line(int argc, const char **argv);
 int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000000..a3da55933f4405
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,120 @@
+#!/bin/sh
+
+TEST_PASSES_SANITIZE_LEAK=true
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+	git checkout -b base &&
+
+	mkdir left &&
+	mkdir right &&
+	echo a >a &&
+	echo b >left/b &&
+	echo c >right/c &&
+	git add . &&
+	git commit -m "first" &&
+
+	echo d >right/d &&
+	git add right &&
+	git commit -m "second" &&
+
+	echo bb >left/b &&
+	git commit -a -m "third" &&
+
+	git checkout -b topic HEAD~1 &&
+	echo cc >right/c &&
+	git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+	test-tool path-walk -- --all >out &&
+
+	cat >expect <<-EOF &&
+	0:TREE::$(git rev-parse topic^{tree})
+	0:TREE::$(git rev-parse base^{tree})
+	0:TREE::$(git rev-parse base~1^{tree})
+	0:TREE::$(git rev-parse base~2^{tree})
+	1:TREE:right/:$(git rev-parse topic:right)
+	1:TREE:right/:$(git rev-parse base~1:right)
+	1:TREE:right/:$(git rev-parse base~2:right)
+	2:BLOB:right/d:$(git rev-parse base~1:right/d)
+	3:BLOB:right/c:$(git rev-parse base~2:right/c)
+	3:BLOB:right/c:$(git rev-parse topic:right/c)
+	4:TREE:left/:$(git rev-parse base:left)
+	4:TREE:left/:$(git rev-parse base~2:left)
+	5:BLOB:left/b:$(git rev-parse base~2:left/b)
+	5:BLOB:left/b:$(git rev-parse base:left/b)
+	6:BLOB:a:$(git rev-parse base~2:a)
+	blobs:6
+	trees:9
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'topic only' '
+	test-tool path-walk -- topic >out &&
+
+	cat >expect <<-EOF &&
+	0:TREE::$(git rev-parse topic^{tree})
+	0:TREE::$(git rev-parse base~1^{tree})
+	0:TREE::$(git rev-parse base~2^{tree})
+	1:TREE:right/:$(git rev-parse topic:right)
+	1:TREE:right/:$(git rev-parse base~1:right)
+	1:TREE:right/:$(git rev-parse base~2:right)
+	2:BLOB:right/d:$(git rev-parse base~1:right/d)
+	3:BLOB:right/c:$(git rev-parse base~2:right/c)
+	3:BLOB:right/c:$(git rev-parse topic:right/c)
+	4:TREE:left/:$(git rev-parse base~2:left)
+	5:BLOB:left/b:$(git rev-parse base~2:left/b)
+	6:BLOB:a:$(git rev-parse base~2:a)
+	blobs:5
+	trees:7
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base' '
+	test-tool path-walk -- topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:TREE::$(git rev-parse topic^{tree})
+	1:TREE:right/:$(git rev-parse topic:right)
+	2:BLOB:right/d:$(git rev-parse topic:right/d)
+	3:BLOB:right/c:$(git rev-parse topic:right/c)
+	4:TREE:left/:$(git rev-parse topic:left)
+	5:BLOB:left/b:$(git rev-parse topic:left/b)
+	6:BLOB:a:$(git rev-parse topic:a)
+	blobs:4
+	trees:3
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, boundary' '
+	test-tool path-walk -- --boundary topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:TREE::$(git rev-parse topic^{tree})
+	0:TREE::$(git rev-parse base~1^{tree})
+	1:TREE:right/:$(git rev-parse topic:right)
+	1:TREE:right/:$(git rev-parse base~1:right)
+	2:BLOB:right/d:$(git rev-parse base~1:right/d)
+	3:BLOB:right/c:$(git rev-parse base~1:right/c)
+	3:BLOB:right/c:$(git rev-parse topic:right/c)
+	4:TREE:left/:$(git rev-parse base~1:left)
+	5:BLOB:left/b:$(git rev-parse base~1:left/b)
+	6:BLOB:a:$(git rev-parse base~1:a)
+	blobs:5
+	trees:5
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_done

From 83b746f569df932452d8d102dc29dda0859c4ed2 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <derrickstolee@github.com>
Date: Sun, 1 Sep 2024 14:25:33 -0400
Subject: [PATCH 04/16] path-walk: allow consumer to specify object types

We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.

Future changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.txt |   9 ++
 path-walk.c                               |  33 ++++-
 path-walk.h                               |  14 +-
 t/helper/test-path-walk.c                 |  18 ++-
 t/t6601-path-walk.sh                      | 149 +++++++++++++++-------
 5 files changed, 173 insertions(+), 50 deletions(-)

diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 662162ec70b38b..dce553b6114e1c 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,6 +39,15 @@ It is also important that you do not specify the `--objects` flag for the
 the objects will be walked in a separate way based on those starting
 commits.
 
+`commits`, `blobs`, `trees`::
+	By default, these members are enabled and signal that the path-walk
+	API should call the `path_fn` on objects of these types. Specialized
+	applications could disable some options to make it simpler to walk
+	the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
 Examples
 --------
 
diff --git a/path-walk.c b/path-walk.c
index 24cf04c1e7dde6..2ca0840236701d 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx,
 		if (S_ISGITLINK(entry.mode))
 			continue;
 
+		/* If the caller doesn't want blobs, then don't bother. */
+		if (!ctx->info->blobs && type == OBJ_BLOB)
+			continue;
+
 		if (type == OBJ_TREE) {
 			struct tree *child = lookup_tree(ctx->repo, &entry.oid);
 			o = child ? &child->object : NULL;
@@ -157,9 +161,11 @@ static int walk_path(struct path_walk_context *ctx,
 	if (!list->oids.nr)
 		return 0;
 
-	/* Evaluate function pointer on this data. */
-	ret = ctx->info->path_fn(path, &list->oids, list->type,
-				 ctx->info->path_fn_data);
+	/* Evaluate function pointer on this data, if requested. */
+	if ((list->type == OBJ_TREE && ctx->info->trees) ||
+	    (list->type == OBJ_BLOB && ctx->info->blobs))
+		ret = ctx->info->path_fn(path, &list->oids, list->type,
+					ctx->info->path_fn_data);
 
 	/* Expand data for children. */
 	if (list->type == OBJ_TREE) {
@@ -201,6 +207,7 @@ int walk_objects_by_path(struct path_walk_info *info)
 	size_t commits_nr = 0, paths_nr = 0;
 	struct commit *c;
 	struct type_and_oid_list *root_tree_list;
+	struct type_and_oid_list *commit_list;
 	struct path_walk_context ctx = {
 		.repo = info->revs->repo,
 		.revs = info->revs,
@@ -212,6 +219,9 @@ int walk_objects_by_path(struct path_walk_info *info)
 
 	trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
 
+	CALLOC_ARRAY(commit_list, 1);
+	commit_list->type = OBJ_COMMIT;
+
 	/* Insert a single list for the root tree into the paths. */
 	CALLOC_ARRAY(root_tree_list, 1);
 	root_tree_list->type = OBJ_TREE;
@@ -222,10 +232,18 @@ int walk_objects_by_path(struct path_walk_info *info)
 		die(_("failed to setup revision walk"));
 
 	while ((c = get_revision(info->revs))) {
-		struct object_id *oid = get_commit_tree_oid(c);
+		struct object_id *oid;
 		struct tree *t;
 		commits_nr++;
 
+		if (info->commits)
+			oid_array_append(&commit_list->oids,
+					 &c->object.oid);
+
+		/* If we only care about commits, then skip trees. */
+		if (!info->trees && !info->blobs)
+			continue;
+
 		oid = get_commit_tree_oid(c);
 		t = lookup_tree(info->revs->repo, oid);
 
@@ -243,6 +261,13 @@ int walk_objects_by_path(struct path_walk_info *info)
 	trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
 	trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
 
+	/* Track all commits. */
+	if (info->commits && commit_list->oids.nr)
+		ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+				    info->path_fn_data);
+	oid_array_clear(&commit_list->oids);
+	free(commit_list);
+
 	trace2_region_enter("path-walk", "path-walk", info->revs->repo);
 	while (!ret && ctx.path_stack.nr) {
 		char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
diff --git a/path-walk.h b/path-walk.h
index c9e94a98bc8f6b..2d2afc29b47d58 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -30,9 +30,21 @@ struct path_walk_info {
 	 */
 	path_fn path_fn;
 	void *path_fn_data;
+
+	/**
+	 * Initialize which object types the path_fn should be called on. This
+	 * could also limit the walk to skip blobs if not set.
+	 */
+	int commits;
+	int trees;
+	int blobs;
 };
 
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT {   \
+	.blobs = 1,		\
+	.trees = 1,		\
+	.commits = 1,		\
+}
 
 /**
  * Given the configuration of 'info', walk the commits based on 'info->revs' and
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index aa468871079ab9..2b7e6e98d1807f 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -19,6 +19,8 @@ static const char * const path_walk_usage[] = {
 
 struct path_walk_test_data {
 	uintmax_t batch_nr;
+
+	uintmax_t commit_nr;
 	uintmax_t tree_nr;
 	uintmax_t blob_nr;
 };
@@ -30,6 +32,11 @@ static int emit_block(const char *path, struct oid_array *oids,
 	const char *typestr;
 
 	switch (type) {
+	case OBJ_COMMIT:
+		typestr = "COMMIT";
+		tdata->commit_nr += oids->nr;
+		break;
+
 	case OBJ_TREE:
 		typestr = "TREE";
 		tdata->tree_nr += oids->nr;
@@ -60,6 +67,12 @@ int cmd__path_walk(int argc, const char **argv)
 	struct path_walk_info info = PATH_WALK_INFO_INIT;
 	struct path_walk_test_data data = { 0 };
 	struct option options[] = {
+		OPT_BOOL(0, "blobs", &info.blobs,
+			 N_("toggle inclusion of blob objects")),
+		OPT_BOOL(0, "commits", &info.commits,
+			 N_("toggle inclusion of commit objects")),
+		OPT_BOOL(0, "trees", &info.trees,
+			 N_("toggle inclusion of tree objects")),
 		OPT_END(),
 	};
 
@@ -81,9 +94,10 @@ int cmd__path_walk(int argc, const char **argv)
 
 	res = walk_objects_by_path(&info);
 
-	printf("trees:%" PRIuMAX "\n"
+	printf("commits:%" PRIuMAX "\n"
+	       "trees:%" PRIuMAX "\n"
 	       "blobs:%" PRIuMAX "\n",
-	       data.tree_nr, data.blob_nr);
+	       data.commit_nr, data.tree_nr, data.blob_nr);
 
 	release_revisions(&revs);
 	return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index a3da55933f4405..dcd3c03a2e8229 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -33,22 +33,27 @@ test_expect_success 'all' '
 	test-tool path-walk -- --all >out &&
 
 	cat >expect <<-EOF &&
-	0:TREE::$(git rev-parse topic^{tree})
-	0:TREE::$(git rev-parse base^{tree})
-	0:TREE::$(git rev-parse base~1^{tree})
-	0:TREE::$(git rev-parse base~2^{tree})
-	1:TREE:right/:$(git rev-parse topic:right)
-	1:TREE:right/:$(git rev-parse base~1:right)
-	1:TREE:right/:$(git rev-parse base~2:right)
-	2:BLOB:right/d:$(git rev-parse base~1:right/d)
-	3:BLOB:right/c:$(git rev-parse base~2:right/c)
-	3:BLOB:right/c:$(git rev-parse topic:right/c)
-	4:TREE:left/:$(git rev-parse base:left)
-	4:TREE:left/:$(git rev-parse base~2:left)
-	5:BLOB:left/b:$(git rev-parse base~2:left/b)
-	5:BLOB:left/b:$(git rev-parse base:left/b)
-	6:BLOB:a:$(git rev-parse base~2:a)
+	0:COMMIT::$(git rev-parse topic)
+	0:COMMIT::$(git rev-parse base)
+	0:COMMIT::$(git rev-parse base~1)
+	0:COMMIT::$(git rev-parse base~2)
+	1:TREE::$(git rev-parse topic^{tree})
+	1:TREE::$(git rev-parse base^{tree})
+	1:TREE::$(git rev-parse base~1^{tree})
+	1:TREE::$(git rev-parse base~2^{tree})
+	2:TREE:right/:$(git rev-parse topic:right)
+	2:TREE:right/:$(git rev-parse base~1:right)
+	2:TREE:right/:$(git rev-parse base~2:right)
+	3:BLOB:right/d:$(git rev-parse base~1:right/d)
+	4:BLOB:right/c:$(git rev-parse base~2:right/c)
+	4:BLOB:right/c:$(git rev-parse topic:right/c)
+	5:TREE:left/:$(git rev-parse base:left)
+	5:TREE:left/:$(git rev-parse base~2:left)
+	6:BLOB:left/b:$(git rev-parse base~2:left/b)
+	6:BLOB:left/b:$(git rev-parse base:left/b)
+	7:BLOB:a:$(git rev-parse base~2:a)
 	blobs:6
+	commits:4
 	trees:9
 	EOF
 
@@ -59,19 +64,23 @@ test_expect_success 'topic only' '
 	test-tool path-walk -- topic >out &&
 
 	cat >expect <<-EOF &&
-	0:TREE::$(git rev-parse topic^{tree})
-	0:TREE::$(git rev-parse base~1^{tree})
-	0:TREE::$(git rev-parse base~2^{tree})
-	1:TREE:right/:$(git rev-parse topic:right)
-	1:TREE:right/:$(git rev-parse base~1:right)
-	1:TREE:right/:$(git rev-parse base~2:right)
-	2:BLOB:right/d:$(git rev-parse base~1:right/d)
-	3:BLOB:right/c:$(git rev-parse base~2:right/c)
-	3:BLOB:right/c:$(git rev-parse topic:right/c)
-	4:TREE:left/:$(git rev-parse base~2:left)
-	5:BLOB:left/b:$(git rev-parse base~2:left/b)
-	6:BLOB:a:$(git rev-parse base~2:a)
+	0:COMMIT::$(git rev-parse topic)
+	0:COMMIT::$(git rev-parse base~1)
+	0:COMMIT::$(git rev-parse base~2)
+	1:TREE::$(git rev-parse topic^{tree})
+	1:TREE::$(git rev-parse base~1^{tree})
+	1:TREE::$(git rev-parse base~2^{tree})
+	2:TREE:right/:$(git rev-parse topic:right)
+	2:TREE:right/:$(git rev-parse base~1:right)
+	2:TREE:right/:$(git rev-parse base~2:right)
+	3:BLOB:right/d:$(git rev-parse base~1:right/d)
+	4:BLOB:right/c:$(git rev-parse base~2:right/c)
+	4:BLOB:right/c:$(git rev-parse topic:right/c)
+	5:TREE:left/:$(git rev-parse base~2:left)
+	6:BLOB:left/b:$(git rev-parse base~2:left/b)
+	7:BLOB:a:$(git rev-parse base~2:a)
 	blobs:5
+	commits:3
 	trees:7
 	EOF
 
@@ -82,15 +91,66 @@ test_expect_success 'topic, not base' '
 	test-tool path-walk -- topic --not base >out &&
 
 	cat >expect <<-EOF &&
+	0:COMMIT::$(git rev-parse topic)
+	1:TREE::$(git rev-parse topic^{tree})
+	2:TREE:right/:$(git rev-parse topic:right)
+	3:BLOB:right/d:$(git rev-parse topic:right/d)
+	4:BLOB:right/c:$(git rev-parse topic:right/c)
+	5:TREE:left/:$(git rev-parse topic:left)
+	6:BLOB:left/b:$(git rev-parse topic:left/b)
+	7:BLOB:a:$(git rev-parse topic:a)
+	blobs:4
+	commits:1
+	trees:3
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only blobs' '
+	test-tool path-walk --no-trees --no-commits \
+		-- topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	commits:0
+	trees:0
+	0:BLOB:right/d:$(git rev-parse topic:right/d)
+	1:BLOB:right/c:$(git rev-parse topic:right/c)
+	2:BLOB:left/b:$(git rev-parse topic:left/b)
+	3:BLOB:a:$(git rev-parse topic:a)
+	blobs:4
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+	test-tool path-walk --no-blobs --no-trees \
+		-- topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:COMMIT::$(git rev-parse topic)
+	commits:1
+	trees:0
+	blobs:0
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only trees' '
+	test-tool path-walk --no-blobs --no-commits \
+		-- topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	commits:0
 	0:TREE::$(git rev-parse topic^{tree})
 	1:TREE:right/:$(git rev-parse topic:right)
-	2:BLOB:right/d:$(git rev-parse topic:right/d)
-	3:BLOB:right/c:$(git rev-parse topic:right/c)
-	4:TREE:left/:$(git rev-parse topic:left)
-	5:BLOB:left/b:$(git rev-parse topic:left/b)
-	6:BLOB:a:$(git rev-parse topic:a)
-	blobs:4
+	2:TREE:left/:$(git rev-parse topic:left)
 	trees:3
+	blobs:0
 	EOF
 
 	test_cmp_sorted expect out
@@ -100,17 +160,20 @@ test_expect_success 'topic, not base, boundary' '
 	test-tool path-walk -- --boundary topic --not base >out &&
 
 	cat >expect <<-EOF &&
-	0:TREE::$(git rev-parse topic^{tree})
-	0:TREE::$(git rev-parse base~1^{tree})
-	1:TREE:right/:$(git rev-parse topic:right)
-	1:TREE:right/:$(git rev-parse base~1:right)
-	2:BLOB:right/d:$(git rev-parse base~1:right/d)
-	3:BLOB:right/c:$(git rev-parse base~1:right/c)
-	3:BLOB:right/c:$(git rev-parse topic:right/c)
-	4:TREE:left/:$(git rev-parse base~1:left)
-	5:BLOB:left/b:$(git rev-parse base~1:left/b)
-	6:BLOB:a:$(git rev-parse base~1:a)
+	0:COMMIT::$(git rev-parse topic)
+	0:COMMIT::$(git rev-parse base~1)
+	1:TREE::$(git rev-parse topic^{tree})
+	1:TREE::$(git rev-parse base~1^{tree})
+	2:TREE:right/:$(git rev-parse topic:right)
+	2:TREE:right/:$(git rev-parse base~1:right)
+	3:BLOB:right/d:$(git rev-parse base~1:right/d)
+	4:BLOB:right/c:$(git rev-parse base~1:right/c)
+	4:BLOB:right/c:$(git rev-parse topic:right/c)
+	5:TREE:left/:$(git rev-parse base~1:left)
+	6:BLOB:left/b:$(git rev-parse base~1:left/b)
+	7:BLOB:a:$(git rev-parse base~1:a)
 	blobs:5
+	commits:2
 	trees:5
 	EOF
 

From 97765aa04c2fe4a254a93814a6e0de14fa4f9149 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Mon, 9 Sep 2024 12:31:21 -0400
Subject: [PATCH 05/16] path-walk: visit tags and cached objects

The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.

When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".

When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.

The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.

The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.

Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.

Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.txt |   2 +-
 path-walk.c                               | 174 +++++++++++++++++++-
 path-walk.h                               |   2 +
 t/helper/test-path-walk.c                 |  18 ++-
 t/t6601-path-walk.sh                      | 186 +++++++++++++++++++---
 5 files changed, 356 insertions(+), 26 deletions(-)

diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index dce553b6114e1c..6022c381b7c3f0 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,7 +39,7 @@ It is also important that you do not specify the `--objects` flag for the
 the objects will be walked in a separate way based on those starting
 commits.
 
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
 	By default, these members are enabled and signal that the path-walk
 	API should call the `path_fn` on objects of these types. Specialized
 	applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 2ca0840236701d..a1f539dcd46204 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,10 +13,13 @@
 #include "revision.h"
 #include "string-list.h"
 #include "strmap.h"
+#include "tag.h"
 #include "trace2.h"
 #include "tree.h"
 #include "tree-walk.h"
 
+static const char *root_path = "";
+
 struct type_and_oid_list
 {
 	enum object_type type;
@@ -158,12 +161,16 @@ static int walk_path(struct path_walk_context *ctx,
 
 	list = strmap_get(&ctx->paths_to_lists, path);
 
+	if (!list)
+		BUG("provided path '%s' that had no associated list", path);
+
 	if (!list->oids.nr)
 		return 0;
 
 	/* Evaluate function pointer on this data, if requested. */
 	if ((list->type == OBJ_TREE && ctx->info->trees) ||
-	    (list->type == OBJ_BLOB && ctx->info->blobs))
+	    (list->type == OBJ_BLOB && ctx->info->blobs) ||
+	    (list->type == OBJ_TAG && ctx->info->tags))
 		ret = ctx->info->path_fn(path, &list->oids, list->type,
 					ctx->info->path_fn_data);
 
@@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map)
 	strmap_init(map);
 }
 
+static void setup_pending_objects(struct path_walk_info *info,
+				  struct path_walk_context *ctx)
+{
+	struct type_and_oid_list *tags = NULL;
+	struct type_and_oid_list *tagged_blobs = NULL;
+	struct type_and_oid_list *root_tree_list = NULL;
+
+	if (info->tags)
+		CALLOC_ARRAY(tags, 1);
+	if (info->blobs)
+		CALLOC_ARRAY(tagged_blobs, 1);
+	if (info->trees)
+		root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
+
+	/*
+	 * Pending objects include:
+	 * * Commits at branch tips.
+	 * * Annotated tags at tag tips.
+	 * * Any kind of object at lightweight tag tips.
+	 * * Trees and blobs in the index (with an associated path).
+	 */
+	for (size_t i = 0; i < info->revs->pending.nr; i++) {
+		struct object_array_entry *pending = info->revs->pending.objects + i;
+		struct object *obj = pending->item;
+
+		/* Commits will be picked up by revision walk. */
+		if (obj->type == OBJ_COMMIT)
+			continue;
+
+		/* Navigate annotated tag object chains. */
+		while (obj->type == OBJ_TAG) {
+			struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
+			if (!tag)
+				break;
+			if (tag->object.flags & SEEN)
+				break;
+			tag->object.flags |= SEEN;
+
+			if (tags)
+				oid_array_append(&tags->oids, &obj->oid);
+			obj = tag->tagged;
+		}
+
+		if (obj->type == OBJ_TAG)
+			continue;
+
+		/* We are now at a non-tag object. */
+		if (obj->flags & SEEN)
+			continue;
+		obj->flags |= SEEN;
+
+		switch (obj->type) {
+		case OBJ_TREE:
+			if (!info->trees)
+				continue;
+			if (pending->path) {
+				struct type_and_oid_list *list;
+				char *path = *pending->path ? xstrfmt("%s/", pending->path)
+							    : xstrdup("");
+				if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+					CALLOC_ARRAY(list, 1);
+					list->type = OBJ_TREE;
+					strmap_put(&ctx->paths_to_lists, path, list);
+				}
+				oid_array_append(&list->oids, &obj->oid);
+				free(path);
+			} else {
+				/* assume a root tree, such as a lightweight tag. */
+				oid_array_append(&root_tree_list->oids, &obj->oid);
+			}
+			break;
+
+		case OBJ_BLOB:
+			if (!info->blobs)
+				continue;
+			if (pending->path) {
+				struct type_and_oid_list *list;
+				char *path = pending->path;
+				if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+					CALLOC_ARRAY(list, 1);
+					list->type = OBJ_BLOB;
+					strmap_put(&ctx->paths_to_lists, path, list);
+				}
+				oid_array_append(&list->oids, &obj->oid);
+			} else {
+				/* assume a root tree, such as a lightweight tag. */
+				oid_array_append(&tagged_blobs->oids, &obj->oid);
+			}
+			break;
+
+		case OBJ_COMMIT:
+			/* Make sure it is in the object walk */
+			if (obj != pending->item)
+				add_pending_object(info->revs, obj, "");
+			break;
+
+		default:
+			BUG("should not see any other type here");
+		}
+	}
+
+	/*
+	 * Add tag objects and tagged blobs if they exist.
+	 */
+	if (tagged_blobs) {
+		if (tagged_blobs->oids.nr) {
+			const char *tagged_blob_path = "/tagged-blobs";
+			tagged_blobs->type = OBJ_BLOB;
+			push_to_stack(ctx, tagged_blob_path);
+			strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+		} else {
+			oid_array_clear(&tagged_blobs->oids);
+			free(tagged_blobs);
+		}
+	}
+	if (tags) {
+		if (tags->oids.nr) {
+			const char *tag_path = "/tags";
+			tags->type = OBJ_TAG;
+			push_to_stack(ctx, tag_path);
+			strmap_put(&ctx->paths_to_lists, tag_path, tags);
+		} else {
+			oid_array_clear(&tags->oids);
+			free(tags);
+		}
+	}
+}
+
 /**
  * Given the configuration of 'info', walk the commits based on 'info->revs' and
  * call 'info->path_fn' on each discovered path.
@@ -202,7 +337,6 @@ static void clear_strmap(struct strmap *map)
  */
 int walk_objects_by_path(struct path_walk_info *info)
 {
-	const char *root_path = "";
 	int ret = 0;
 	size_t commits_nr = 0, paths_nr = 0;
 	struct commit *c;
@@ -222,15 +356,31 @@ int walk_objects_by_path(struct path_walk_info *info)
 	CALLOC_ARRAY(commit_list, 1);
 	commit_list->type = OBJ_COMMIT;
 
+	if (info->tags)
+		info->revs->tag_objects = 1;
+
 	/* Insert a single list for the root tree into the paths. */
 	CALLOC_ARRAY(root_tree_list, 1);
 	root_tree_list->type = OBJ_TREE;
 	strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
 	push_to_stack(&ctx, root_path);
 
+	/*
+	 * Set these values before preparing the walk to catch
+	 * lightweight tags pointing to non-commits and indexed objects.
+	 */
+	info->revs->blob_objects = info->blobs;
+	info->revs->tree_objects = info->trees;
+
 	if (prepare_revision_walk(info->revs))
 		die(_("failed to setup revision walk"));
 
+	info->revs->blob_objects = info->revs->tree_objects = 0;
+
+	trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
+	setup_pending_objects(info, &ctx);
+	trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
+
 	while ((c = get_revision(info->revs))) {
 		struct object_id *oid;
 		struct tree *t;
@@ -278,6 +428,26 @@ int walk_objects_by_path(struct path_walk_info *info)
 
 		free(path);
 	}
+
+	/* Are there paths remaining? Likely they are from indexed objects. */
+	if (!strmap_empty(&ctx.paths_to_lists)) {
+		struct hashmap_iter iter;
+		struct strmap_entry *entry;
+
+		strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry)
+			push_to_stack(&ctx, entry->key);
+
+		while (!ret && ctx.path_stack.nr) {
+			char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+			ctx.path_stack.nr--;
+			paths_nr++;
+
+			ret = walk_path(&ctx, path);
+
+			free(path);
+		}
+	}
+
 	trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
 	trace2_region_leave("path-walk", "path-walk", info->revs->repo);
 
diff --git a/path-walk.h b/path-walk.h
index 2d2afc29b47d58..ca839f873e4dea 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -38,12 +38,14 @@ struct path_walk_info {
 	int commits;
 	int trees;
 	int blobs;
+	int tags;
 };
 
 #define PATH_WALK_INFO_INIT {   \
 	.blobs = 1,		\
 	.trees = 1,		\
 	.commits = 1,		\
+	.tags = 1,		\
 }
 
 /**
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 2b7e6e98d1807f..265bd0b443bde8 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -23,6 +23,7 @@ struct path_walk_test_data {
 	uintmax_t commit_nr;
 	uintmax_t tree_nr;
 	uintmax_t blob_nr;
+	uintmax_t tag_nr;
 };
 
 static int emit_block(const char *path, struct oid_array *oids,
@@ -47,10 +48,20 @@ static int emit_block(const char *path, struct oid_array *oids,
 		tdata->blob_nr += oids->nr;
 		break;
 
+	case OBJ_TAG:
+		typestr = "TAG";
+		tdata->tag_nr += oids->nr;
+		break;
+
 	default:
 		BUG("we do not understand this type");
 	}
 
+	/* This should never be output during tests. */
+	if (!oids->nr)
+		printf("%"PRIuMAX":%s:%s:EMPTY\n",
+		       tdata->batch_nr, typestr, path);
+
 	for (size_t i = 0; i < oids->nr; i++)
 		printf("%"PRIuMAX":%s:%s:%s\n",
 		       tdata->batch_nr, typestr, path,
@@ -71,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of blob objects")),
 		OPT_BOOL(0, "commits", &info.commits,
 			 N_("toggle inclusion of commit objects")),
+		OPT_BOOL(0, "tags", &info.tags,
+			 N_("toggle inclusion of tag objects")),
 		OPT_BOOL(0, "trees", &info.trees,
 			 N_("toggle inclusion of tree objects")),
 		OPT_END(),
@@ -96,8 +109,9 @@ int cmd__path_walk(int argc, const char **argv)
 
 	printf("commits:%" PRIuMAX "\n"
 	       "trees:%" PRIuMAX "\n"
-	       "blobs:%" PRIuMAX "\n",
-	       data.commit_nr, data.tree_nr, data.blob_nr);
+	       "blobs:%" PRIuMAX "\n"
+	       "tags:%" PRIuMAX "\n",
+	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
 
 	release_revisions(&revs);
 	return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index dcd3c03a2e8229..bf43ab0e22a73e 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -9,29 +9,142 @@ test_description='direct path-walk API tests'
 test_expect_success 'setup test repository' '
 	git checkout -b base &&
 
+	# Make some objects that will only be reachable
+	# via non-commit tags.
+	mkdir child &&
+	echo file >child/file &&
+	git add child &&
+	git commit -m "will abandon" &&
+	git tag -a -m "tree" tree-tag HEAD^{tree} &&
+	echo file2 >file2 &&
+	git add file2 &&
+	git commit --amend -m "will abandon" &&
+	git tag tree-tag2 HEAD^{tree} &&
+
+	echo blob >file &&
+	blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+	git tag -a -m "blob" blob-tag "$blob_oid" &&
+	echo blob2 >file2 &&
+	blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+	git tag blob-tag2 "$blob2_oid" &&
+
+	rm -fr child file file2 &&
+
 	mkdir left &&
 	mkdir right &&
 	echo a >a &&
 	echo b >left/b &&
 	echo c >right/c &&
 	git add . &&
-	git commit -m "first" &&
+	git commit --amend -m "first" &&
+	git tag -m "first" first HEAD &&
 
 	echo d >right/d &&
 	git add right &&
 	git commit -m "second" &&
+	git tag -a -m "second (under)" second.1 HEAD &&
+	git tag -a -m "second (top)" second.2 second.1 &&
 
+	# Set up file/dir collision in history.
+	rm a &&
+	mkdir a &&
+	echo a >a/a &&
 	echo bb >left/b &&
-	git commit -a -m "third" &&
+	git add a left &&
+	git commit -m "third" &&
+	git tag -a -m "third" third &&
 
 	git checkout -b topic HEAD~1 &&
 	echo cc >right/c &&
-	git commit -a -m "topic"
+	git commit -a -m "topic" &&
+	git tag -a -m "fourth" fourth
 '
 
 test_expect_success 'all' '
 	test-tool path-walk -- --all >out &&
 
+	cat >expect <<-EOF &&
+	0:COMMIT::$(git rev-parse topic)
+	0:COMMIT::$(git rev-parse base)
+	0:COMMIT::$(git rev-parse base~1)
+	0:COMMIT::$(git rev-parse base~2)
+	1:TAG:/tags:$(git rev-parse refs/tags/first)
+	1:TAG:/tags:$(git rev-parse refs/tags/second.1)
+	1:TAG:/tags:$(git rev-parse refs/tags/second.2)
+	1:TAG:/tags:$(git rev-parse refs/tags/third)
+	1:TAG:/tags:$(git rev-parse refs/tags/fourth)
+	1:TAG:/tags:$(git rev-parse refs/tags/tree-tag)
+	1:TAG:/tags:$(git rev-parse refs/tags/blob-tag)
+	2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+	2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+	3:TREE::$(git rev-parse topic^{tree})
+	3:TREE::$(git rev-parse base^{tree})
+	3:TREE::$(git rev-parse base~1^{tree})
+	3:TREE::$(git rev-parse base~2^{tree})
+	3:TREE::$(git rev-parse refs/tags/tree-tag^{})
+	3:TREE::$(git rev-parse refs/tags/tree-tag2^{})
+	4:BLOB:a:$(git rev-parse base~2:a)
+	5:TREE:right/:$(git rev-parse topic:right)
+	5:TREE:right/:$(git rev-parse base~1:right)
+	5:TREE:right/:$(git rev-parse base~2:right)
+	6:BLOB:right/d:$(git rev-parse base~1:right/d)
+	7:BLOB:right/c:$(git rev-parse base~2:right/c)
+	7:BLOB:right/c:$(git rev-parse topic:right/c)
+	8:TREE:left/:$(git rev-parse base:left)
+	8:TREE:left/:$(git rev-parse base~2:left)
+	9:BLOB:left/b:$(git rev-parse base~2:left/b)
+	9:BLOB:left/b:$(git rev-parse base:left/b)
+	10:TREE:a/:$(git rev-parse base:a)
+	11:BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
+	12:TREE:child/:$(git rev-parse refs/tags/tree-tag:child)
+	13:BLOB:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+	blobs:10
+	commits:4
+	tags:7
+	trees:13
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'indexed objects' '
+	test_when_finished git reset --hard &&
+
+	# stage change into index, adding a blob but
+	# also invalidating the cache-tree for the root
+	# and the "left" directory.
+	echo bogus >left/c &&
+	git add left &&
+
+	test-tool path-walk -- --indexed-objects >out &&
+
+	cat >expect <<-EOF &&
+	0:BLOB:a:$(git rev-parse HEAD:a)
+	1:BLOB:left/b:$(git rev-parse HEAD:left/b)
+	2:BLOB:left/c:$(git rev-parse :left/c)
+	3:BLOB:right/c:$(git rev-parse HEAD:right/c)
+	4:BLOB:right/d:$(git rev-parse HEAD:right/d)
+	5:TREE:right/:$(git rev-parse topic:right)
+	blobs:5
+	commits:0
+	tags:0
+	trees:1
+	EOF
+
+	test_cmp_sorted expect out
+'
+
+test_expect_success 'branches and indexed objects mix well' '
+	test_when_finished git reset --hard &&
+
+	# stage change into index, adding a blob but
+	# also invalidating the cache-tree for the root
+	# and the "right" directory.
+	echo fake >right/d &&
+	git add right &&
+
+	test-tool path-walk -- --indexed-objects --branches >out &&
+
 	cat >expect <<-EOF &&
 	0:COMMIT::$(git rev-parse topic)
 	0:COMMIT::$(git rev-parse base)
@@ -41,20 +154,23 @@ test_expect_success 'all' '
 	1:TREE::$(git rev-parse base^{tree})
 	1:TREE::$(git rev-parse base~1^{tree})
 	1:TREE::$(git rev-parse base~2^{tree})
-	2:TREE:right/:$(git rev-parse topic:right)
-	2:TREE:right/:$(git rev-parse base~1:right)
-	2:TREE:right/:$(git rev-parse base~2:right)
-	3:BLOB:right/d:$(git rev-parse base~1:right/d)
-	4:BLOB:right/c:$(git rev-parse base~2:right/c)
-	4:BLOB:right/c:$(git rev-parse topic:right/c)
-	5:TREE:left/:$(git rev-parse base:left)
-	5:TREE:left/:$(git rev-parse base~2:left)
-	6:BLOB:left/b:$(git rev-parse base~2:left/b)
-	6:BLOB:left/b:$(git rev-parse base:left/b)
-	7:BLOB:a:$(git rev-parse base~2:a)
-	blobs:6
+	2:BLOB:a:$(git rev-parse base~2:a)
+	3:TREE:right/:$(git rev-parse topic:right)
+	3:TREE:right/:$(git rev-parse base~1:right)
+	3:TREE:right/:$(git rev-parse base~2:right)
+	4:BLOB:right/d:$(git rev-parse base~1:right/d)
+	4:BLOB:right/d:$(git rev-parse :right/d)
+	5:BLOB:right/c:$(git rev-parse base~2:right/c)
+	5:BLOB:right/c:$(git rev-parse topic:right/c)
+	6:TREE:left/:$(git rev-parse base:left)
+	6:TREE:left/:$(git rev-parse base~2:left)
+	7:BLOB:left/b:$(git rev-parse base:left/b)
+	7:BLOB:left/b:$(git rev-parse base~2:left/b)
+	8:TREE:a/:$(git rev-parse refs/tags/third:a)
+	blobs:7
 	commits:4
-	trees:9
+	tags:0
+	trees:10
 	EOF
 
 	test_cmp_sorted expect out
@@ -81,6 +197,7 @@ test_expect_success 'topic only' '
 	7:BLOB:a:$(git rev-parse base~2:a)
 	blobs:5
 	commits:3
+	tags:0
 	trees:7
 	EOF
 
@@ -101,6 +218,7 @@ test_expect_success 'topic, not base' '
 	7:BLOB:a:$(git rev-parse topic:a)
 	blobs:4
 	commits:1
+	tags:0
 	trees:3
 	EOF
 
@@ -112,13 +230,14 @@ test_expect_success 'topic, not base, only blobs' '
 		-- topic --not base >out &&
 
 	cat >expect <<-EOF &&
-	commits:0
-	trees:0
 	0:BLOB:right/d:$(git rev-parse topic:right/d)
 	1:BLOB:right/c:$(git rev-parse topic:right/c)
 	2:BLOB:left/b:$(git rev-parse topic:left/b)
 	3:BLOB:a:$(git rev-parse topic:a)
 	blobs:4
+	commits:0
+	tags:0
+	trees:0
 	EOF
 
 	test_cmp_sorted expect out
@@ -133,8 +252,9 @@ test_expect_success 'topic, not base, only commits' '
 	cat >expect <<-EOF &&
 	0:COMMIT::$(git rev-parse topic)
 	commits:1
-	trees:0
 	blobs:0
+	tags:0
+	trees:0
 	EOF
 
 	test_cmp_sorted expect out
@@ -145,12 +265,13 @@ test_expect_success 'topic, not base, only trees' '
 		-- topic --not base >out &&
 
 	cat >expect <<-EOF &&
-	commits:0
 	0:TREE::$(git rev-parse topic^{tree})
 	1:TREE:right/:$(git rev-parse topic:right)
 	2:TREE:left/:$(git rev-parse topic:left)
-	trees:3
+	commits:0
 	blobs:0
+	tags:0
+	trees:3
 	EOF
 
 	test_cmp_sorted expect out
@@ -174,10 +295,33 @@ test_expect_success 'topic, not base, boundary' '
 	7:BLOB:a:$(git rev-parse base~1:a)
 	blobs:5
 	commits:2
+	tags:0
 	trees:5
 	EOF
 
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'trees are reported exactly once' '
+	test_when_finished "rm -rf unique-trees" &&
+	test_create_repo unique-trees &&
+	(
+		cd unique-trees &&
+		mkdir initial &&
+		test_commit initial/file &&
+
+		git switch -c move-to-top &&
+		git mv initial/file.t ./ &&
+		test_tick &&
+		git commit -m moved &&
+
+		git update-ref refs/heads/other HEAD
+	) &&
+
+	test-tool -C unique-trees path-walk -- --all >out &&
+	tree=$(git -C unique-trees rev-parse HEAD:) &&
+	grep "$tree" out >out-filtered &&
+	test_line_count = 1 out-filtered
+'
+
 test_done

From a4aaa3b001b75c19c96130f1c057157a29f9a7f5 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Tue, 3 Sep 2024 21:55:47 -0400
Subject: [PATCH 06/16] path-walk: mark trees and blobs as UNINTERESTING

When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.

This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.

To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/technical/api-path-walk.txt |  8 +++
 path-walk.c                               | 73 +++++++++++++++++++++
 path-walk.h                               |  8 +++
 t/helper/test-path-walk.c                 | 12 +++-
 t/t6601-path-walk.sh                      | 79 +++++++++++++++++------
 5 files changed, 158 insertions(+), 22 deletions(-)

diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6022c381b7c3f0..7075d0d5ab50fd 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,14 @@ commits.
 While it is possible to walk only commits in this way, consumers would be
 better off using the revision walk API instead.
 
+`prune_all_uninteresting`::
+	By default, all reachable paths are emitted by the path-walk API.
+	This option allows consumers to declare that they are not
+	interested in paths where all included objects are marked with the
+	`UNINTERESTING` flag. This requires using the `boundary` option in
+	the revision walk so that the walk emits commits marked with the
+	`UNINTERESTING` flag.
+
 Examples
 --------
 
diff --git a/path-walk.c b/path-walk.c
index a1f539dcd46204..896ec0c4779101 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -8,6 +8,7 @@
 #include "dir.h"
 #include "hashmap.h"
 #include "hex.h"
+#include "list-objects.h"
 #include "object.h"
 #include "oid-array.h"
 #include "revision.h"
@@ -24,6 +25,7 @@ struct type_and_oid_list
 {
 	enum object_type type;
 	struct oid_array oids;
+	int maybe_interesting;
 };
 
 #define TYPE_AND_OID_LIST_INIT { \
@@ -140,6 +142,9 @@ static int add_children(struct path_walk_context *ctx,
 		if (o->flags & SEEN)
 			continue;
 		o->flags |= SEEN;
+
+		if (!(o->flags & UNINTERESTING))
+			list->maybe_interesting = 1;
 		oid_array_append(&list->oids, &entry.oid);
 	}
 
@@ -167,6 +172,43 @@ static int walk_path(struct path_walk_context *ctx,
 	if (!list->oids.nr)
 		return 0;
 
+	if (ctx->info->prune_all_uninteresting) {
+		/*
+		 * This is true if all objects were UNINTERESTING
+		 * when added to the list.
+		 */
+		if (!list->maybe_interesting)
+			return 0;
+
+		/*
+		 * But it's still possible that the objects were set
+		 * as UNINTERESTING after being added. Do a quick check.
+		 */
+		list->maybe_interesting = 0;
+		for (size_t i = 0;
+		     !list->maybe_interesting && i < list->oids.nr;
+		     i++) {
+			if (list->type == OBJ_TREE) {
+				struct tree *t = lookup_tree(ctx->repo,
+							     &list->oids.oid[i]);
+				if (t && !(t->object.flags & UNINTERESTING))
+					list->maybe_interesting = 1;
+			} else if (list->type == OBJ_BLOB) {
+				struct blob *b = lookup_blob(ctx->repo,
+							     &list->oids.oid[i]);
+				if (b && !(b->object.flags & UNINTERESTING))
+					list->maybe_interesting = 1;
+			} else {
+				/* Tags are always interesting if visited. */
+				list->maybe_interesting = 1;
+			}
+		}
+
+		/* We have confirmed that all objects are UNINTERESTING. */
+		if (!list->maybe_interesting)
+			return 0;
+	}
+
 	/* Evaluate function pointer on this data, if requested. */
 	if ((list->type == OBJ_TREE && ctx->info->trees) ||
 	    (list->type == OBJ_BLOB && ctx->info->blobs) ||
@@ -201,6 +243,26 @@ static void clear_strmap(struct strmap *map)
 	strmap_init(map);
 }
 
+static struct repository *edge_repo;
+static struct type_and_oid_list *edge_tree_list;
+
+static void show_edge(struct commit *commit)
+{
+	struct tree *t = repo_get_commit_tree(edge_repo, commit);
+
+	if (!t)
+		return;
+
+	if (commit->object.flags & UNINTERESTING)
+		t->object.flags |= UNINTERESTING;
+
+	if (t->object.flags & SEEN)
+		return;
+	t->object.flags |= SEEN;
+
+	oid_array_append(&edge_tree_list->oids, &t->object.oid);
+}
+
 static void setup_pending_objects(struct path_walk_info *info,
 				  struct path_walk_context *ctx)
 {
@@ -309,6 +371,7 @@ static void setup_pending_objects(struct path_walk_info *info,
 		if (tagged_blobs->oids.nr) {
 			const char *tagged_blob_path = "/tagged-blobs";
 			tagged_blobs->type = OBJ_BLOB;
+			tagged_blobs->maybe_interesting = 1;
 			push_to_stack(ctx, tagged_blob_path);
 			strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
 		} else {
@@ -320,6 +383,7 @@ static void setup_pending_objects(struct path_walk_info *info,
 		if (tags->oids.nr) {
 			const char *tag_path = "/tags";
 			tags->type = OBJ_TAG;
+			tags->maybe_interesting = 1;
 			push_to_stack(ctx, tag_path);
 			strmap_put(&ctx->paths_to_lists, tag_path, tags);
 		} else {
@@ -362,6 +426,7 @@ int walk_objects_by_path(struct path_walk_info *info)
 	/* Insert a single list for the root tree into the paths. */
 	CALLOC_ARRAY(root_tree_list, 1);
 	root_tree_list->type = OBJ_TREE;
+	root_tree_list->maybe_interesting = 1;
 	strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
 	push_to_stack(&ctx, root_path);
 
@@ -375,6 +440,14 @@ int walk_objects_by_path(struct path_walk_info *info)
 	if (prepare_revision_walk(info->revs))
 		die(_("failed to setup revision walk"));
 
+	/* Walk trees to mark them as UNINTERESTING. */
+	edge_repo = info->revs->repo;
+	edge_tree_list = root_tree_list;
+	mark_edges_uninteresting(info->revs, show_edge,
+				 info->prune_all_uninteresting);
+	edge_repo = NULL;
+	edge_tree_list = NULL;
+
 	info->revs->blob_objects = info->revs->tree_objects = 0;
 
 	trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index ca839f873e4dea..de0db007dc9a5f 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -39,6 +39,14 @@ struct path_walk_info {
 	int trees;
 	int blobs;
 	int tags;
+
+	/**
+	 * When 'prune_all_uninteresting' is set and a path has all objects
+	 * marked as UNINTERESTING, then the path-walk will not visit those
+	 * objects. It will not call path_fn on those objects and will not
+	 * walk the children of such trees.
+	 */
+	int prune_all_uninteresting;
 };
 
 #define PATH_WALK_INFO_INIT {   \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 265bd0b443bde8..7e791cfaf979b7 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -62,10 +62,14 @@ static int emit_block(const char *path, struct oid_array *oids,
 		printf("%"PRIuMAX":%s:%s:EMPTY\n",
 		       tdata->batch_nr, typestr, path);
 
-	for (size_t i = 0; i < oids->nr; i++)
-		printf("%"PRIuMAX":%s:%s:%s\n",
+	for (size_t i = 0; i < oids->nr; i++) {
+		struct object *o = lookup_unknown_object(the_repository,
+							 &oids->oid[i]);
+		printf("%"PRIuMAX":%s:%s:%s%s\n",
 		       tdata->batch_nr, typestr, path,
-		       oid_to_hex(&oids->oid[i]));
+		       oid_to_hex(&oids->oid[i]),
+		       o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+	}
 
 	tdata->batch_nr++;
 	return 0;
@@ -86,6 +90,8 @@ int cmd__path_walk(int argc, const char **argv)
 			 N_("toggle inclusion of tag objects")),
 		OPT_BOOL(0, "trees", &info.trees,
 			 N_("toggle inclusion of tree objects")),
+		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+			 N_("toggle pruning of uninteresting paths")),
 		OPT_END(),
 	};
 
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index bf43ab0e22a73e..d3c0015319a354 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -211,11 +211,11 @@ test_expect_success 'topic, not base' '
 	0:COMMIT::$(git rev-parse topic)
 	1:TREE::$(git rev-parse topic^{tree})
 	2:TREE:right/:$(git rev-parse topic:right)
-	3:BLOB:right/d:$(git rev-parse topic:right/d)
+	3:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
 	4:BLOB:right/c:$(git rev-parse topic:right/c)
-	5:TREE:left/:$(git rev-parse topic:left)
-	6:BLOB:left/b:$(git rev-parse topic:left/b)
-	7:BLOB:a:$(git rev-parse topic:a)
+	5:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
+	6:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+	7:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
 	blobs:4
 	commits:1
 	tags:0
@@ -225,15 +225,38 @@ test_expect_success 'topic, not base' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'fourth, blob-tag2, not base' '
+	test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:COMMIT::$(git rev-parse topic)
+	1:TAG:/tags:$(git rev-parse fourth)
+	2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+	3:TREE::$(git rev-parse topic^{tree})
+	4:TREE:right/:$(git rev-parse topic:right)
+	5:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+	6:BLOB:right/c:$(git rev-parse topic:right/c)
+	7:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+	8:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+	9:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+	blobs:5
+	commits:1
+	tags:1
+	trees:3
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'topic, not base, only blobs' '
 	test-tool path-walk --no-trees --no-commits \
 		-- topic --not base >out &&
 
 	cat >expect <<-EOF &&
-	0:BLOB:right/d:$(git rev-parse topic:right/d)
+	0:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
 	1:BLOB:right/c:$(git rev-parse topic:right/c)
-	2:BLOB:left/b:$(git rev-parse topic:left/b)
-	3:BLOB:a:$(git rev-parse topic:a)
+	2:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+	3:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
 	blobs:4
 	commits:0
 	tags:0
@@ -267,7 +290,7 @@ test_expect_success 'topic, not base, only trees' '
 	cat >expect <<-EOF &&
 	0:TREE::$(git rev-parse topic^{tree})
 	1:TREE:right/:$(git rev-parse topic:right)
-	2:TREE:left/:$(git rev-parse topic:left)
+	2:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
 	commits:0
 	blobs:0
 	tags:0
@@ -282,17 +305,17 @@ test_expect_success 'topic, not base, boundary' '
 
 	cat >expect <<-EOF &&
 	0:COMMIT::$(git rev-parse topic)
-	0:COMMIT::$(git rev-parse base~1)
+	0:COMMIT::$(git rev-parse base~1):UNINTERESTING
 	1:TREE::$(git rev-parse topic^{tree})
-	1:TREE::$(git rev-parse base~1^{tree})
+	1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
 	2:TREE:right/:$(git rev-parse topic:right)
-	2:TREE:right/:$(git rev-parse base~1:right)
-	3:BLOB:right/d:$(git rev-parse base~1:right/d)
-	4:BLOB:right/c:$(git rev-parse base~1:right/c)
+	2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+	3:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+	4:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
 	4:BLOB:right/c:$(git rev-parse topic:right/c)
-	5:TREE:left/:$(git rev-parse base~1:left)
-	6:BLOB:left/b:$(git rev-parse base~1:left/b)
-	7:BLOB:a:$(git rev-parse base~1:a)
+	5:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+	6:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+	7:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
 	blobs:5
 	commits:2
 	tags:0
@@ -302,6 +325,27 @@ test_expect_success 'topic, not base, boundary' '
 	test_cmp_sorted expect out
 '
 
+test_expect_success 'topic, not base, boundary with pruning' '
+	test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+	cat >expect <<-EOF &&
+	0:COMMIT::$(git rev-parse topic)
+	0:COMMIT::$(git rev-parse base~1):UNINTERESTING
+	1:TREE::$(git rev-parse topic^{tree})
+	1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+	2:TREE:right/:$(git rev-parse topic:right)
+	2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+	3:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+	3:BLOB:right/c:$(git rev-parse topic:right/c)
+	blobs:2
+	commits:2
+	tags:0
+	trees:4
+	EOF
+
+	test_cmp_sorted expect out
+'
+
 test_expect_success 'trees are reported exactly once' '
 	test_when_finished "rm -rf unique-trees" &&
 	test_create_repo unique-trees &&
@@ -309,15 +353,12 @@ test_expect_success 'trees are reported exactly once' '
 		cd unique-trees &&
 		mkdir initial &&
 		test_commit initial/file &&
-
 		git switch -c move-to-top &&
 		git mv initial/file.t ./ &&
 		test_tick &&
 		git commit -m moved &&
-
 		git update-ref refs/heads/other HEAD
 	) &&
-
 	test-tool -C unique-trees path-walk -- --all >out &&
 	tree=$(git -C unique-trees rev-parse HEAD:) &&
 	grep "$tree" out >out-filtered &&

From 512f033e899b6fff554da7ea7da5e050cd3d8feb Mon Sep 17 00:00:00 2001
From: Jeff Hostetler <jeffhostetler@github.com>
Date: Mon, 29 Apr 2024 08:55:03 -0400
Subject: [PATCH 07/16] survey: stub in new experimental 'git-survey' command

Start work on a new 'git survey' command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Co-authored-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 .gitignore                      |  1 +
 Documentation/config.txt        |  2 +
 Documentation/config/survey.txt | 11 +++++
 Documentation/git-survey.txt    | 36 ++++++++++++++++
 Makefile                        |  1 +
 builtin.h                       |  1 +
 builtin/survey.c                | 74 +++++++++++++++++++++++++++++++++
 command-list.txt                |  1 +
 git.c                           |  1 +
 t/t8100-git-survey.sh           | 18 ++++++++
 10 files changed, 146 insertions(+)
 create mode 100644 Documentation/config/survey.txt
 create mode 100644 Documentation/git-survey.txt
 create mode 100644 builtin/survey.c
 create mode 100755 t/t8100-git-survey.sh

diff --git a/.gitignore b/.gitignore
index 6687bd6db4c0a6..8a775af2635fe7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -165,6 +165,7 @@
 /git-submodule
 /git-submodule--helper
 /git-subtree
+/git-survey
 /git-svn
 /git-switch
 /git-symbolic-ref
diff --git a/Documentation/config.txt b/Documentation/config.txt
index 8c0b3ed8075214..4ee8b693022ef7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -534,6 +534,8 @@ include::config/status.txt[]
 
 include::config/submodule.txt[]
 
+include::config/survey.txt[]
+
 include::config/tag.txt[]
 
 include::config/tar.txt[]
diff --git a/Documentation/config/survey.txt b/Documentation/config/survey.txt
new file mode 100644
index 00000000000000..c1b0f852a1250e
--- /dev/null
+++ b/Documentation/config/survey.txt
@@ -0,0 +1,11 @@
+survey.*::
+	These variables adjust the default behavior of the `git survey`
+	command. The intention is that this command could be run in the
+	background with these options.
++
+--
+	verbose::
+		This boolean value implies the `--[no-]verbose` option.
+	progress::
+		This boolean value implies the `--[no-]progress` option.
+--
diff --git a/Documentation/git-survey.txt b/Documentation/git-survey.txt
new file mode 100644
index 00000000000000..cdd1ec4358b8bb
--- /dev/null
+++ b/Documentation/git-survey.txt
@@ -0,0 +1,36 @@
+git-survey(1)
+=============
+
+NAME
+----
+git-survey - EXPERIMENTAL: Measure various repository dimensions of scale
+
+SYNOPSIS
+--------
+[verse]
+(EXPERIMENTAL!) `git survey` <options>
+
+DESCRIPTION
+-----------
+
+Survey the repository and measure various dimensions of scale.
+
+As repositories grow to "monorepo" size, certain data shapes can cause
+performance problems.  `git-survey` attempts to measure and report on
+known problem areas.
+
+OPTIONS
+-------
+
+--progress::
+	Show progress.  This is automatically enabled when interactive.
+
+OUTPUT
+------
+
+By default, `git survey` will print information about the repository in a
+human-readable format that includes overviews and tables.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 50413d964920e7..176d9aa14ebf6c 100644
--- a/Makefile
+++ b/Makefile
@@ -1307,6 +1307,7 @@ BUILTIN_OBJS += builtin/sparse-checkout.o
 BUILTIN_OBJS += builtin/stash.o
 BUILTIN_OBJS += builtin/stripspace.o
 BUILTIN_OBJS += builtin/submodule--helper.o
+BUILTIN_OBJS += builtin/survey.o
 BUILTIN_OBJS += builtin/symbolic-ref.o
 BUILTIN_OBJS += builtin/tag.o
 BUILTIN_OBJS += builtin/unpack-file.o
diff --git a/builtin.h b/builtin.h
index f7b166b33484d3..0739d969ad26b5 100644
--- a/builtin.h
+++ b/builtin.h
@@ -231,6 +231,7 @@ int cmd_status(int argc, const char **argv, const char *prefix, struct repositor
 int cmd_stash(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_stripspace(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_submodule__helper(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_survey(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_switch(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_symbolic_ref(int argc, const char **argv, const char *prefix, struct repository *repo);
 int cmd_tag(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/survey.c b/builtin/survey.c
new file mode 100644
index 00000000000000..105175bbafec6e
--- /dev/null
+++ b/builtin/survey.c
@@ -0,0 +1,74 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static const char * const survey_usage[] = {
+	N_("(EXPERIMENTAL!) git survey <options>"),
+	NULL,
+};
+
+struct survey_opts {
+	int verbose;
+	int show_progress;
+};
+
+struct survey_context {
+	struct repository *repo;
+
+	/* Options that control what is done. */
+	struct survey_opts opts;
+};
+
+static int survey_load_config_cb(const char *var, const char *value,
+				 const struct config_context *cctx, void *pvoid)
+{
+	struct survey_context *ctx = pvoid;
+
+	if (!strcmp(var, "survey.verbose")) {
+		ctx->opts.verbose = git_config_bool(var, value);
+		return 0;
+	}
+	if (!strcmp(var, "survey.progress")) {
+		ctx->opts.show_progress = git_config_bool(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cctx, pvoid);
+}
+
+static void survey_load_config(struct survey_context *ctx)
+{
+	repo_config(ctx->repo, survey_load_config_cb, ctx);
+}
+
+int cmd_survey(int argc, const char **argv, const char *prefix,
+	       struct repository *repo)
+{
+	static struct survey_context ctx = {
+		.opts = {
+			.verbose = 0,
+			.show_progress = -1, /* defaults to isatty(2) */
+		},
+	};
+
+	static struct option survey_options[] = {
+		OPT__VERBOSE(&ctx.opts.verbose, N_("verbose output")),
+		OPT_BOOL(0, "progress", &ctx.opts.show_progress, N_("show progress")),
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(survey_usage, survey_options);
+
+	ctx.repo = repo;
+
+	prepare_repo_settings(ctx.repo);
+	survey_load_config(&ctx);
+
+	argc = parse_options(argc, argv, prefix, survey_options, survey_usage, 0);
+
+	if (ctx.opts.show_progress < 0)
+		ctx.opts.show_progress = isatty(2);
+
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index e0bb87b3b5c278..d389561a5f1161 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -186,6 +186,7 @@ git-stash                               mainporcelain
 git-status                              mainporcelain           info
 git-stripspace                          purehelpers
 git-submodule                           mainporcelain
+git-survey                              mainporcelain
 git-svn                                 foreignscminterface
 git-switch                              mainporcelain           history
 git-symbolic-ref                        plumbingmanipulators
diff --git a/git.c b/git.c
index 2fbea24ec921e0..18c35342cdb2cd 100644
--- a/git.c
+++ b/git.c
@@ -629,6 +629,7 @@ static struct cmd_struct commands[] = {
 	{ "status", cmd_status, RUN_SETUP | NEED_WORK_TREE },
 	{ "stripspace", cmd_stripspace },
 	{ "submodule--helper", cmd_submodule__helper, RUN_SETUP },
+	{ "survey", cmd_survey, RUN_SETUP },
 	{ "switch", cmd_switch, RUN_SETUP | NEED_WORK_TREE },
 	{ "symbolic-ref", cmd_symbolic_ref, RUN_SETUP },
 	{ "tag", cmd_tag, RUN_SETUP | DELAY_PAGER_CONFIG },
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
new file mode 100755
index 00000000000000..2df7fa83629301
--- /dev/null
+++ b/t/t8100-git-survey.sh
@@ -0,0 +1,18 @@
+#!/bin/sh
+
+test_description='git survey'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+TEST_PASSES_SANITIZE_LEAK=0
+export TEST_PASSES_SANITIZE_LEAK
+
+. ./test-lib.sh
+
+test_expect_success 'git survey -h shows experimental warning' '
+	test_expect_code 129 git survey -h 2>usage &&
+	grep "EXPERIMENTAL!" usage
+'
+
+test_done

From 842d1263acfd6da6738390857e2d8e945ed071eb Mon Sep 17 00:00:00 2001
From: Jeff Hostetler <jeffhostetler@github.com>
Date: Mon, 29 Apr 2024 09:51:34 -0400
Subject: [PATCH 08/16] survey: add command line opts to select references

By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-survey.txt |  34 +++++
 builtin/survey.c             | 247 +++++++++++++++++++++++++++++++++++
 t/t8100-git-survey.sh        |   9 ++
 3 files changed, 290 insertions(+)

diff --git a/Documentation/git-survey.txt b/Documentation/git-survey.txt
index cdd1ec4358b8bb..c648ef704e3806 100644
--- a/Documentation/git-survey.txt
+++ b/Documentation/git-survey.txt
@@ -19,12 +19,46 @@ As repositories grow to "monorepo" size, certain data shapes can cause
 performance problems.  `git-survey` attempts to measure and report on
 known problem areas.
 
+Ref Selection and Reachable Objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this first analysis phase, `git survey` will iterate over the set of
+requested branches, tags, and other refs and treewalk over all of the
+reachable commits, trees, and blobs and generate various statistics.
+
 OPTIONS
 -------
 
 --progress::
 	Show progress.  This is automatically enabled when interactive.
 
+Ref Selection
+~~~~~~~~~~~~~
+
+The following options control the set of refs that `git survey` will examine.
+By default, `git survey` will look at tags, local branches, and remote refs.
+If any of the following options are given, the default set is cleared and
+only refs for the given options are added.
+
+--all-refs::
+	Use all refs.  This includes local branches, tags, remote refs,
+	notes, and stashes.  This option overrides all of the following.
+
+--branches::
+	Add local branches (`refs/heads/`) to the set.
+
+--tags::
+	Add tags (`refs/tags/`) to the set.
+
+--remotes::
+	Add remote branches (`refs/remote/`) to the set.
+
+--detached::
+	Add HEAD to the set.
+
+--other::
+	Add notes (`refs/notes/`) and stashes (`refs/stash/`) to the set.
+
 OUTPUT
 ------
 
diff --git a/builtin/survey.c b/builtin/survey.c
index 105175bbafec6e..3a1ad354c71932 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -1,15 +1,54 @@
 #include "builtin.h"
 #include "config.h"
+#include "object.h"
+#include "object-store-ll.h"
 #include "parse-options.h"
+#include "progress.h"
+#include "ref-filter.h"
+#include "strvec.h"
+#include "trace2.h"
 
 static const char * const survey_usage[] = {
 	N_("(EXPERIMENTAL!) git survey <options>"),
 	NULL,
 };
 
+struct survey_refs_wanted {
+	int want_all_refs; /* special override */
+
+	int want_branches;
+	int want_tags;
+	int want_remotes;
+	int want_detached;
+	int want_other; /* see FILTER_REFS_OTHERS -- refs/notes/, refs/stash/ */
+};
+
+static struct survey_refs_wanted default_ref_options = {
+	.want_all_refs = 1,
+};
+
 struct survey_opts {
 	int verbose;
 	int show_progress;
+	struct survey_refs_wanted refs;
+};
+
+struct survey_report_ref_summary {
+	size_t refs_nr;
+	size_t branches_nr;
+	size_t remote_refs_nr;
+	size_t tags_nr;
+	size_t tags_annotated_nr;
+	size_t others_nr;
+	size_t unknown_nr;
+};
+
+/**
+ * This struct contains all of the information that needs to be printed
+ * at the end of the exploration of the repository and its references.
+ */
+struct survey_report {
+	struct survey_report_ref_summary refs;
 };
 
 struct survey_context {
@@ -17,8 +56,84 @@ struct survey_context {
 
 	/* Options that control what is done. */
 	struct survey_opts opts;
+
+	/* Info for output only. */
+	struct survey_report report;
+
+	/*
+	 * The rest of the members are about enabling the activity
+	 * of the 'git survey' command, including ref listings, object
+	 * pointers, and progress.
+	 */
+
+	struct progress *progress;
+	size_t progress_nr;
+	size_t progress_total;
+
+	struct strvec refs;
 };
 
+static void clear_survey_context(struct survey_context *ctx)
+{
+	strvec_clear(&ctx->refs);
+}
+
+/*
+ * After parsing the command line arguments, figure out which refs we
+ * should scan.
+ *
+ * If ANY were given in positive sense, then we ONLY include them and
+ * do not use the builtin values.
+ */
+static void fixup_refs_wanted(struct survey_context *ctx)
+{
+	struct survey_refs_wanted *rw = &ctx->opts.refs;
+
+	/*
+	 * `--all-refs` overrides and enables everything.
+	 */
+	if (rw->want_all_refs == 1) {
+		rw->want_branches = 1;
+		rw->want_tags = 1;
+		rw->want_remotes = 1;
+		rw->want_detached = 1;
+		rw->want_other = 1;
+		return;
+	}
+
+	/*
+	 * If none of the `--<ref-type>` were given, we assume all
+	 * of the builtin unspecified values.
+	 */
+	if (rw->want_branches == -1 &&
+	    rw->want_tags == -1 &&
+	    rw->want_remotes == -1 &&
+	    rw->want_detached == -1 &&
+	    rw->want_other == -1) {
+		*rw = default_ref_options;
+		return;
+	}
+
+	/*
+	 * Since we only allow positive boolean values on the command
+	 * line, we will only have true values where they specified
+	 * a `--<ref-type>`.
+	 *
+	 * So anything that still has an unspecified value should be
+	 * set to false.
+	 */
+	if (rw->want_branches == -1)
+		rw->want_branches = 0;
+	if (rw->want_tags == -1)
+		rw->want_tags = 0;
+	if (rw->want_remotes == -1)
+		rw->want_remotes = 0;
+	if (rw->want_detached == -1)
+		rw->want_detached = 0;
+	if (rw->want_other == -1)
+		rw->want_other = 0;
+}
+
 static int survey_load_config_cb(const char *var, const char *value,
 				 const struct config_context *cctx, void *pvoid)
 {
@@ -41,6 +156,115 @@ static void survey_load_config(struct survey_context *ctx)
 	repo_config(ctx->repo, survey_load_config_cb, ctx);
 }
 
+static void do_load_refs(struct survey_context *ctx,
+			 struct ref_array *ref_array)
+{
+	struct ref_filter filter = REF_FILTER_INIT;
+	struct ref_sorting *sorting;
+	struct string_list sorting_options = STRING_LIST_INIT_DUP;
+
+	string_list_append(&sorting_options, "objectname");
+	sorting = ref_sorting_options(&sorting_options);
+
+	if (ctx->opts.refs.want_detached)
+		strvec_push(&ctx->refs, "HEAD");
+
+	if (ctx->opts.refs.want_all_refs) {
+		strvec_push(&ctx->refs, "refs/");
+	} else {
+		if (ctx->opts.refs.want_branches)
+			strvec_push(&ctx->refs, "refs/heads/");
+		if (ctx->opts.refs.want_tags)
+			strvec_push(&ctx->refs, "refs/tags/");
+		if (ctx->opts.refs.want_remotes)
+			strvec_push(&ctx->refs, "refs/remotes/");
+		if (ctx->opts.refs.want_other) {
+			strvec_push(&ctx->refs, "refs/notes/");
+			strvec_push(&ctx->refs, "refs/stash/");
+		}
+	}
+
+	filter.name_patterns = ctx->refs.v;
+	filter.ignore_case = 0;
+	filter.match_as_path = 1;
+
+	if (ctx->opts.show_progress) {
+		ctx->progress_total = 0;
+		ctx->progress = start_progress(_("Scanning refs..."), 0);
+	}
+
+	filter_refs(ref_array, &filter, FILTER_REFS_KIND_MASK);
+
+	if (ctx->opts.show_progress) {
+		ctx->progress_total = ref_array->nr;
+		display_progress(ctx->progress, ctx->progress_total);
+	}
+
+	ref_array_sort(sorting, ref_array);
+
+	stop_progress(&ctx->progress);
+	ref_filter_clear(&filter);
+	ref_sorting_release(sorting);
+}
+
+/*
+ * The REFS phase:
+ *
+ * Load the set of requested refs and assess them for scalablity problems.
+ * Use that set to start a treewalk to all reachable objects and assess
+ * them.
+ *
+ * This data will give us insights into the repository itself (the number
+ * of refs, the size and shape of the DAG, the number and size of the
+ * objects).
+ *
+ * Theoretically, this data is independent of the on-disk representation
+ * (e.g. independent of packing concerns).
+ */
+static void survey_phase_refs(struct survey_context *ctx)
+{
+	struct ref_array ref_array = { 0 };
+
+	trace2_region_enter("survey", "phase/refs", ctx->repo);
+	do_load_refs(ctx, &ref_array);
+
+	ctx->report.refs.refs_nr = ref_array.nr;
+	for (size_t i = 0; i < ref_array.nr; i++) {
+		unsigned long size;
+		struct ref_array_item *item = ref_array.items[i];
+
+		switch (item->kind) {
+		case FILTER_REFS_TAGS:
+			ctx->report.refs.tags_nr++;
+			if (oid_object_info(ctx->repo,
+					    &item->objectname,
+					    &size) == OBJ_TAG)
+				ctx->report.refs.tags_annotated_nr++;
+			break;
+
+		case FILTER_REFS_BRANCHES:
+			ctx->report.refs.branches_nr++;
+			break;
+
+		case FILTER_REFS_REMOTES:
+			ctx->report.refs.remote_refs_nr++;
+			break;
+
+		case FILTER_REFS_OTHERS:
+			ctx->report.refs.others_nr++;
+			break;
+
+		default:
+			ctx->report.refs.unknown_nr++;
+			break;
+		}
+	}
+
+	trace2_region_leave("survey", "phase/refs", ctx->repo);
+
+	ref_array_clear(&ref_array);
+}
+
 int cmd_survey(int argc, const char **argv, const char *prefix,
 	       struct repository *repo)
 {
@@ -48,12 +272,30 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 		.opts = {
 			.verbose = 0,
 			.show_progress = -1, /* defaults to isatty(2) */
+
+			.refs.want_all_refs = -1,
+
+			.refs.want_branches = -1, /* default these to undefined */
+			.refs.want_tags = -1,
+			.refs.want_remotes = -1,
+			.refs.want_detached = -1,
+			.refs.want_other = -1,
 		},
+		.refs = STRVEC_INIT,
 	};
 
 	static struct option survey_options[] = {
 		OPT__VERBOSE(&ctx.opts.verbose, N_("verbose output")),
 		OPT_BOOL(0, "progress", &ctx.opts.show_progress, N_("show progress")),
+
+		OPT_BOOL_F(0, "all-refs", &ctx.opts.refs.want_all_refs, N_("include all refs"),          PARSE_OPT_NONEG),
+
+		OPT_BOOL_F(0, "branches", &ctx.opts.refs.want_branches, N_("include branches"),          PARSE_OPT_NONEG),
+		OPT_BOOL_F(0, "tags",     &ctx.opts.refs.want_tags,     N_("include tags"),              PARSE_OPT_NONEG),
+		OPT_BOOL_F(0, "remotes",  &ctx.opts.refs.want_remotes,  N_("include all remotes refs"),  PARSE_OPT_NONEG),
+		OPT_BOOL_F(0, "detached", &ctx.opts.refs.want_detached, N_("include detached HEAD"),     PARSE_OPT_NONEG),
+		OPT_BOOL_F(0, "other",    &ctx.opts.refs.want_other,    N_("include notes and stashes"), PARSE_OPT_NONEG),
+
 		OPT_END(),
 	};
 
@@ -70,5 +312,10 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 	if (ctx.opts.show_progress < 0)
 		ctx.opts.show_progress = isatty(2);
 
+	fixup_refs_wanted(&ctx);
+
+	survey_phase_refs(&ctx);
+
+	clear_survey_context(&ctx);
 	return 0;
 }
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
index 2df7fa83629301..6656cf20bf7a17 100755
--- a/t/t8100-git-survey.sh
+++ b/t/t8100-git-survey.sh
@@ -15,4 +15,13 @@ test_expect_success 'git survey -h shows experimental warning' '
 	grep "EXPERIMENTAL!" usage
 '
 
+test_expect_success 'create a semi-interesting repo' '
+	test_commit_bulk 10
+'
+
+test_expect_success 'git survey (default)' '
+	git survey >out 2>err &&
+	test_line_count = 0 err
+'
+
 test_done

From d2b33bed5300b2832b79d83694d818054119dcf9 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Sun, 1 Sep 2024 15:58:32 -0400
Subject: [PATCH 09/16] survey: start pretty printing data in table form

When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-survey.txt |   7 ++
 builtin/survey.c             | 157 +++++++++++++++++++++++++++++++++++
 t/t8100-git-survey.sh        |  18 +++-
 3 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-survey.txt b/Documentation/git-survey.txt
index c648ef704e3806..25d10781831c99 100644
--- a/Documentation/git-survey.txt
+++ b/Documentation/git-survey.txt
@@ -65,6 +65,13 @@ OUTPUT
 By default, `git survey` will print information about the repository in a
 human-readable format that includes overviews and tables.
 
+References Summary
+~~~~~~~~~~~~~~~~~~
+
+The references summary includes a count of each kind of reference,
+including branches, remote refs, and tags (split by "all" and
+"annotated").
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/survey.c b/builtin/survey.c
index 3a1ad354c71932..9c490fc26da0fa 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -5,6 +5,7 @@
 #include "parse-options.h"
 #include "progress.h"
 #include "ref-filter.h"
+#include "strbuf.h"
 #include "strvec.h"
 #include "trace2.h"
 
@@ -78,6 +79,160 @@ static void clear_survey_context(struct survey_context *ctx)
 	strvec_clear(&ctx->refs);
 }
 
+struct survey_table {
+	const char *table_name;
+	struct strvec header;
+	struct strvec *rows;
+	size_t rows_nr;
+	size_t rows_alloc;
+};
+
+#define SURVEY_TABLE_INIT {	\
+	.header = STRVEC_INIT,	\
+}
+
+static void clear_table(struct survey_table *table)
+{
+	strvec_clear(&table->header);
+	for (size_t i = 0; i < table->rows_nr; i++)
+		strvec_clear(&table->rows[i]);
+	free(table->rows);
+}
+
+static void insert_table_rowv(struct survey_table *table, ...)
+{
+	va_list ap;
+	char *arg;
+	ALLOC_GROW(table->rows, table->rows_nr + 1, table->rows_alloc);
+
+	memset(&table->rows[table->rows_nr], 0, sizeof(struct strvec));
+
+	va_start(ap, table);
+	while ((arg = va_arg(ap, char *)))
+		strvec_push(&table->rows[table->rows_nr], arg);
+	va_end(ap);
+
+	table->rows_nr++;
+}
+
+#define SECTION_SEGMENT "========================================"
+#define SECTION_SEGMENT_LEN 40
+static const char *section_line = SECTION_SEGMENT
+				  SECTION_SEGMENT
+				  SECTION_SEGMENT
+				  SECTION_SEGMENT;
+static const size_t section_len = 4 * SECTION_SEGMENT_LEN;
+
+static void print_table_title(const char *name, size_t *widths, size_t nr)
+{
+	size_t width = 3 * (nr - 1);
+
+	for (size_t i = 0; i < nr; i++)
+		width += widths[i];
+
+	if (width > section_len)
+		width = section_len;
+
+	printf("\n%s\n%.*s\n", name, (int)width, section_line);
+}
+
+static void print_row_plaintext(struct strvec *row, size_t *widths)
+{
+	static struct strbuf line = STRBUF_INIT;
+	strbuf_setlen(&line, 0);
+
+	for (size_t i = 0; i < row->nr; i++) {
+		const char *str = row->v[i];
+		size_t len = strlen(str);
+		if (i)
+			strbuf_add(&line, " | ", 3);
+		strbuf_addchars(&line, ' ', widths[i] - len);
+		strbuf_add(&line, str, len);
+	}
+	printf("%s\n", line.buf);
+}
+
+static void print_divider_plaintext(size_t *widths, size_t nr)
+{
+	static struct strbuf line = STRBUF_INIT;
+	strbuf_setlen(&line, 0);
+
+	for (size_t i = 0; i < nr; i++) {
+		if (i)
+			strbuf_add(&line, "-+-", 3);
+		strbuf_addchars(&line, '-', widths[i]);
+	}
+	printf("%s\n", line.buf);
+}
+
+static void print_table_plaintext(struct survey_table *table)
+{
+	size_t *column_widths;
+	size_t columns_nr = table->header.nr;
+	CALLOC_ARRAY(column_widths, columns_nr);
+
+	for (size_t i = 0; i < columns_nr; i++) {
+		column_widths[i] = strlen(table->header.v[i]);
+
+		for (size_t j = 0; j < table->rows_nr; j++) {
+			size_t rowlen = strlen(table->rows[j].v[i]);
+			if (column_widths[i] < rowlen)
+				column_widths[i] = rowlen;
+		}
+	}
+
+	print_table_title(table->table_name, column_widths, columns_nr);
+	print_row_plaintext(&table->header, column_widths);
+	print_divider_plaintext(column_widths, columns_nr);
+
+	for (size_t j = 0; j < table->rows_nr; j++)
+		print_row_plaintext(&table->rows[j], column_widths);
+
+	free(column_widths);
+}
+
+static void survey_report_plaintext_refs(struct survey_context *ctx)
+{
+	struct survey_report_ref_summary *refs = &ctx->report.refs;
+	struct survey_table table = SURVEY_TABLE_INIT;
+
+	table.table_name = _("REFERENCES SUMMARY");
+
+	strvec_push(&table.header, _("Ref Type"));
+	strvec_push(&table.header, _("Count"));
+
+	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_branches) {
+		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->branches_nr);
+		insert_table_rowv(&table, _("Branches"), fmt, NULL);
+		free(fmt);
+	}
+
+	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_remotes) {
+		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->remote_refs_nr);
+		insert_table_rowv(&table, _("Remote refs"), fmt, NULL);
+		free(fmt);
+	}
+
+	if (ctx->opts.refs.want_all_refs || ctx->opts.refs.want_tags) {
+		char *fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->tags_nr);
+		insert_table_rowv(&table, _("Tags (all)"), fmt, NULL);
+		free(fmt);
+		fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)refs->tags_annotated_nr);
+		insert_table_rowv(&table, _("Tags (annotated)"), fmt, NULL);
+		free(fmt);
+	}
+
+	print_table_plaintext(&table);
+	clear_table(&table);
+}
+
+static void survey_report_plaintext(struct survey_context *ctx)
+{
+	printf("GIT SURVEY for \"%s\"\n", ctx->repo->worktree);
+	printf("-----------------------------------------------------\n");
+	survey_report_plaintext_refs(ctx);
+}
+
 /*
  * After parsing the command line arguments, figure out which refs we
  * should scan.
@@ -316,6 +471,8 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 
 	survey_phase_refs(&ctx);
 
+	survey_report_plaintext(&ctx);
+
 	clear_survey_context(&ctx);
 	return 0;
 }
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
index 6656cf20bf7a17..b76064b2a867ac 100755
--- a/t/t8100-git-survey.sh
+++ b/t/t8100-git-survey.sh
@@ -21,7 +21,23 @@ test_expect_success 'create a semi-interesting repo' '
 
 test_expect_success 'git survey (default)' '
 	git survey >out 2>err &&
-	test_line_count = 0 err
+	test_line_count = 0 err &&
+
+	tr , " " >expect <<-EOF &&
+	GIT SURVEY for "$(pwd)"
+	-----------------------------------------------------
+
+	REFERENCES SUMMARY
+	========================
+	,       Ref Type | Count
+	-----------------+------
+	,       Branches |     1
+	     Remote refs |     0
+	      Tags (all) |     0
+	Tags (annotated) |     0
+	EOF
+
+	test_cmp expect out
 '
 
 test_done

From 9b211cfd66413465fe5ecd1686cb28c07453bbe7 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Sun, 1 Sep 2024 20:33:47 -0400
Subject: [PATCH 10/16] survey: add object count summary

At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |   1343
    Commits | 179344
      Trees | 314350
      Blobs | 184030

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-survey.txt              |   6 +
 Documentation/technical/api-path-walk.txt |   3 +-
 builtin/survey.c                          | 137 ++++++++++++++++++++--
 t/t8100-git-survey.sh                     |  23 +++-
 4 files changed, 157 insertions(+), 12 deletions(-)

diff --git a/Documentation/git-survey.txt b/Documentation/git-survey.txt
index 25d10781831c99..894c7be3053eb9 100644
--- a/Documentation/git-survey.txt
+++ b/Documentation/git-survey.txt
@@ -72,6 +72,12 @@ The references summary includes a count of each kind of reference,
 including branches, remote refs, and tags (split by "all" and
 "annotated").
 
+Reachable Object Summary
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The reachable object summary shows the total number of each kind of Git
+object, including tags, commits, trees, and blobs.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 7075d0d5ab50fd..21011048d7c7f6 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -60,4 +60,5 @@ Examples
 --------
 
 See example usages in:
-	`t/helper/test-path-walk.c`
+	`t/helper/test-path-walk.c`,
+	`builtin/survey.c`
diff --git a/builtin/survey.c b/builtin/survey.c
index 9c490fc26da0fa..541fdac01abdc5 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -1,12 +1,25 @@
+/*
+ * This macro is necessary to access 'default_abbrev'
+ * within REV_INFO_INIT.
+ */
+#define USE_THE_REPOSITORY_VARIABLE
+
 #include "builtin.h"
 #include "config.h"
+#include "environment.h"
+#include "hex.h"
 #include "object.h"
+#include "object-name.h"
 #include "object-store-ll.h"
 #include "parse-options.h"
+#include "path-walk.h"
 #include "progress.h"
 #include "ref-filter.h"
+#include "refs.h"
+#include "revision.h"
 #include "strbuf.h"
 #include "strvec.h"
+#include "tag.h"
 #include "trace2.h"
 
 static const char * const survey_usage[] = {
@@ -44,12 +57,20 @@ struct survey_report_ref_summary {
 	size_t unknown_nr;
 };
 
+struct survey_report_object_summary {
+	size_t commits_nr;
+	size_t tags_nr;
+	size_t trees_nr;
+	size_t blobs_nr;
+};
+
 /**
  * This struct contains all of the information that needs to be printed
  * at the end of the exploration of the repository and its references.
  */
 struct survey_report {
 	struct survey_report_ref_summary refs;
+	struct survey_report_object_summary reachable_objects;
 };
 
 struct survey_context {
@@ -72,10 +93,12 @@ struct survey_context {
 	size_t progress_total;
 
 	struct strvec refs;
+	struct ref_array ref_array;
 };
 
 static void clear_survey_context(struct survey_context *ctx)
 {
+	ref_array_clear(&ctx->ref_array);
 	strvec_clear(&ctx->refs);
 }
 
@@ -126,10 +149,14 @@ static const size_t section_len = 4 * SECTION_SEGMENT_LEN;
 static void print_table_title(const char *name, size_t *widths, size_t nr)
 {
 	size_t width = 3 * (nr - 1);
+	size_t min_width = strlen(name);
 
 	for (size_t i = 0; i < nr; i++)
 		width += widths[i];
 
+	if (width < min_width)
+		width = min_width;
+
 	if (width > section_len)
 		width = section_len;
 
@@ -226,11 +253,43 @@ static void survey_report_plaintext_refs(struct survey_context *ctx)
 	clear_table(&table);
 }
 
+static void survey_report_plaintext_reachable_object_summary(struct survey_context *ctx)
+{
+	struct survey_report_object_summary *objs = &ctx->report.reachable_objects;
+	struct survey_table table = SURVEY_TABLE_INIT;
+	char *fmt;
+
+	table.table_name = _("REACHABLE OBJECT SUMMARY");
+
+	strvec_push(&table.header, _("Object Type"));
+	strvec_push(&table.header, _("Count"));
+
+	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->tags_nr);
+	insert_table_rowv(&table, _("Tags"), fmt, NULL);
+	free(fmt);
+
+	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->commits_nr);
+	insert_table_rowv(&table, _("Commits"), fmt, NULL);
+	free(fmt);
+
+	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->trees_nr);
+	insert_table_rowv(&table, _("Trees"), fmt, NULL);
+	free(fmt);
+
+	fmt = xstrfmt("%"PRIuMAX"", (uintmax_t)objs->blobs_nr);
+	insert_table_rowv(&table, _("Blobs"), fmt, NULL);
+	free(fmt);
+
+	print_table_plaintext(&table);
+	clear_table(&table);
+}
+
 static void survey_report_plaintext(struct survey_context *ctx)
 {
 	printf("GIT SURVEY for \"%s\"\n", ctx->repo->worktree);
 	printf("-----------------------------------------------------\n");
 	survey_report_plaintext_refs(ctx);
+	survey_report_plaintext_reachable_object_summary(ctx);
 }
 
 /*
@@ -378,15 +437,13 @@ static void do_load_refs(struct survey_context *ctx,
  */
 static void survey_phase_refs(struct survey_context *ctx)
 {
-	struct ref_array ref_array = { 0 };
-
 	trace2_region_enter("survey", "phase/refs", ctx->repo);
-	do_load_refs(ctx, &ref_array);
+	do_load_refs(ctx, &ctx->ref_array);
 
-	ctx->report.refs.refs_nr = ref_array.nr;
-	for (size_t i = 0; i < ref_array.nr; i++) {
+	ctx->report.refs.refs_nr = ctx->ref_array.nr;
+	for (size_t i = 0; i < ctx->ref_array.nr; i++) {
 		unsigned long size;
-		struct ref_array_item *item = ref_array.items[i];
+		struct ref_array_item *item = ctx->ref_array.items[i];
 
 		switch (item->kind) {
 		case FILTER_REFS_TAGS:
@@ -416,8 +473,72 @@ static void survey_phase_refs(struct survey_context *ctx)
 	}
 
 	trace2_region_leave("survey", "phase/refs", ctx->repo);
+}
 
-	ref_array_clear(&ref_array);
+static void increment_object_counts(
+		struct survey_report_object_summary *summary,
+		enum object_type type,
+		size_t nr)
+{
+	switch (type) {
+	case OBJ_COMMIT:
+		summary->commits_nr += nr;
+		break;
+
+	case OBJ_TREE:
+		summary->trees_nr += nr;
+		break;
+
+	case OBJ_BLOB:
+		summary->blobs_nr += nr;
+		break;
+
+	case OBJ_TAG:
+		summary->tags_nr += nr;
+		break;
+
+	default:
+		break;
+	}
+}
+
+static int survey_objects_path_walk_fn(const char *path UNUSED,
+				       struct oid_array *oids,
+				       enum object_type type,
+				       void *data)
+{
+	struct survey_context *ctx = data;
+
+	increment_object_counts(&ctx->report.reachable_objects,
+				type, oids->nr);
+
+	return 0;
+}
+
+static void survey_phase_objects(struct survey_context *ctx)
+{
+	struct rev_info revs = REV_INFO_INIT;
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	unsigned int add_flags = 0;
+
+	trace2_region_enter("survey", "phase/objects", ctx->repo);
+
+	info.revs = &revs;
+	info.path_fn = survey_objects_path_walk_fn;
+	info.path_fn_data = ctx;
+
+	repo_init_revisions(ctx->repo, &revs, "");
+	revs.tag_objects = 1;
+
+	for (size_t i = 0; i < ctx->ref_array.nr; i++) {
+		struct ref_array_item *item = ctx->ref_array.items[i];
+		add_pending_oid(&revs, NULL, &item->objectname, add_flags);
+	}
+
+	walk_objects_by_path(&info);
+
+	release_revisions(&revs);
+	trace2_region_leave("survey", "phase/objects", ctx->repo);
 }
 
 int cmd_survey(int argc, const char **argv, const char *prefix,
@@ -471,6 +592,8 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 
 	survey_phase_refs(&ctx);
 
+	survey_phase_objects(&ctx);
+
 	survey_report_plaintext(&ctx);
 
 	clear_survey_context(&ctx);
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
index b76064b2a867ac..7a37da1bb2dadc 100755
--- a/t/t8100-git-survey.sh
+++ b/t/t8100-git-survey.sh
@@ -16,11 +16,17 @@ test_expect_success 'git survey -h shows experimental warning' '
 '
 
 test_expect_success 'create a semi-interesting repo' '
-	test_commit_bulk 10
+	test_commit_bulk 10 &&
+	git tag -a -m one one HEAD~5 &&
+	git tag -a -m two two HEAD~3 &&
+	git tag -a -m three three two &&
+	git tag -a -m four four three &&
+	git update-ref -d refs/tags/three &&
+	git update-ref -d refs/tags/two
 '
 
 test_expect_success 'git survey (default)' '
-	git survey >out 2>err &&
+	git survey --all-refs >out 2>err &&
 	test_line_count = 0 err &&
 
 	tr , " " >expect <<-EOF &&
@@ -33,8 +39,17 @@ test_expect_success 'git survey (default)' '
 	-----------------+------
 	,       Branches |     1
 	     Remote refs |     0
-	      Tags (all) |     0
-	Tags (annotated) |     0
+	      Tags (all) |     2
+	Tags (annotated) |     2
+
+	REACHABLE OBJECT SUMMARY
+	========================
+	Object Type | Count
+	------------+------
+	       Tags |     4
+	    Commits |    10
+	      Trees |    10
+	      Blobs |    10
 	EOF
 
 	test_cmp expect out

From 0c15c1ec7d113a1e00a66e191e2ff9f8ff178066 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Sun, 1 Sep 2024 20:58:35 -0400
Subject: [PATCH 11/16] survey: summarize total sizes by object type

Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/survey.c      | 132 ++++++++++++++++++++++++++++++++++++++++++
 t/t8100-git-survey.sh |  29 ++++++++++
 2 files changed, 161 insertions(+)

diff --git a/builtin/survey.c b/builtin/survey.c
index 541fdac01abdc5..8686415bd27241 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -64,6 +64,19 @@ struct survey_report_object_summary {
 	size_t blobs_nr;
 };
 
+/**
+ * For some category given by 'label', count the number of objects
+ * that match that label along with the on-disk size and the size
+ * after decompressing (both with delta bases and zlib).
+ */
+struct survey_report_object_size_summary {
+	char *label;
+	size_t nr;
+	size_t disk_size;
+	size_t inflated_size;
+	size_t num_missing;
+};
+
 /**
  * This struct contains all of the information that needs to be printed
  * at the end of the exploration of the repository and its references.
@@ -71,8 +84,16 @@ struct survey_report_object_summary {
 struct survey_report {
 	struct survey_report_ref_summary refs;
 	struct survey_report_object_summary reachable_objects;
+
+	struct survey_report_object_size_summary *by_type;
 };
 
+#define REPORT_TYPE_COMMIT 0
+#define REPORT_TYPE_TREE 1
+#define REPORT_TYPE_BLOB 2
+#define REPORT_TYPE_TAG 3
+#define REPORT_TYPE_COUNT 4
+
 struct survey_context {
 	struct repository *repo;
 
@@ -284,12 +305,48 @@ static void survey_report_plaintext_reachable_object_summary(struct survey_conte
 	clear_table(&table);
 }
 
+static void survey_report_object_sizes(const char *title,
+				       const char *categories,
+				       struct survey_report_object_size_summary *summary,
+				       size_t summary_nr)
+{
+	struct survey_table table = SURVEY_TABLE_INIT;
+	table.table_name = title;
+
+	strvec_push(&table.header, categories);
+	strvec_push(&table.header, _("Count"));
+	strvec_push(&table.header, _("Disk Size"));
+	strvec_push(&table.header, _("Inflated Size"));
+
+	for (size_t i = 0; i < summary_nr; i++) {
+		char *label_str =  xstrdup(summary[i].label);
+		char *nr_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].nr);
+		char *disk_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].disk_size);
+		char *inflate_str = xstrfmt("%"PRIuMAX, (uintmax_t)summary[i].inflated_size);
+
+		insert_table_rowv(&table, label_str, nr_str,
+				  disk_str, inflate_str, NULL);
+
+		free(label_str);
+		free(nr_str);
+		free(disk_str);
+		free(inflate_str);
+	}
+
+	print_table_plaintext(&table);
+	clear_table(&table);
+}
+
 static void survey_report_plaintext(struct survey_context *ctx)
 {
 	printf("GIT SURVEY for \"%s\"\n", ctx->repo->worktree);
 	printf("-----------------------------------------------------\n");
 	survey_report_plaintext_refs(ctx);
 	survey_report_plaintext_reachable_object_summary(ctx);
+	survey_report_object_sizes(_("TOTAL OBJECT SIZES BY TYPE"),
+				   _("Object Type"),
+				   ctx->report.by_type,
+				   REPORT_TYPE_COUNT);
 }
 
 /*
@@ -502,6 +559,68 @@ static void increment_object_counts(
 	}
 }
 
+static void increment_totals(struct survey_context *ctx,
+			     struct oid_array *oids,
+			     struct survey_report_object_size_summary *summary)
+{
+	for (size_t i = 0; i < oids->nr; i++) {
+		struct object_info oi = OBJECT_INFO_INIT;
+		unsigned oi_flags = OBJECT_INFO_FOR_PREFETCH;
+		unsigned long object_length = 0;
+		off_t disk_sizep = 0;
+		enum object_type type;
+
+		oi.typep = &type;
+		oi.sizep = &object_length;
+		oi.disk_sizep = &disk_sizep;
+
+		if (oid_object_info_extended(ctx->repo, &oids->oid[i],
+					     &oi, oi_flags) < 0) {
+			summary->num_missing++;
+		} else {
+			summary->nr++;
+			summary->disk_size += disk_sizep;
+			summary->inflated_size += object_length;
+		}
+	}
+}
+
+static void increment_object_totals(struct survey_context *ctx,
+				    struct oid_array *oids,
+				    enum object_type type)
+{
+	struct survey_report_object_size_summary *total;
+	struct survey_report_object_size_summary summary = { 0 };
+
+	increment_totals(ctx, oids, &summary);
+
+	switch (type) {
+	case OBJ_COMMIT:
+		total = &ctx->report.by_type[REPORT_TYPE_COMMIT];
+		break;
+
+	case OBJ_TREE:
+		total = &ctx->report.by_type[REPORT_TYPE_TREE];
+		break;
+
+	case OBJ_BLOB:
+		total = &ctx->report.by_type[REPORT_TYPE_BLOB];
+		break;
+
+	case OBJ_TAG:
+		total = &ctx->report.by_type[REPORT_TYPE_TAG];
+		break;
+
+	default:
+		BUG("No other type allowed");
+	}
+
+	total->nr += summary.nr;
+	total->disk_size += summary.disk_size;
+	total->inflated_size += summary.inflated_size;
+	total->num_missing += summary.num_missing;
+}
+
 static int survey_objects_path_walk_fn(const char *path UNUSED,
 				       struct oid_array *oids,
 				       enum object_type type,
@@ -511,10 +630,20 @@ static int survey_objects_path_walk_fn(const char *path UNUSED,
 
 	increment_object_counts(&ctx->report.reachable_objects,
 				type, oids->nr);
+	increment_object_totals(ctx, oids, type);
 
 	return 0;
 }
 
+static void initialize_report(struct survey_context *ctx)
+{
+	CALLOC_ARRAY(ctx->report.by_type, REPORT_TYPE_COUNT);
+	ctx->report.by_type[REPORT_TYPE_COMMIT].label = xstrdup(_("Commits"));
+	ctx->report.by_type[REPORT_TYPE_TREE].label = xstrdup(_("Trees"));
+	ctx->report.by_type[REPORT_TYPE_BLOB].label = xstrdup(_("Blobs"));
+	ctx->report.by_type[REPORT_TYPE_TAG].label = xstrdup(_("Tags"));
+}
+
 static void survey_phase_objects(struct survey_context *ctx)
 {
 	struct rev_info revs = REV_INFO_INIT;
@@ -527,12 +656,15 @@ static void survey_phase_objects(struct survey_context *ctx)
 	info.path_fn = survey_objects_path_walk_fn;
 	info.path_fn_data = ctx;
 
+	initialize_report(ctx);
+
 	repo_init_revisions(ctx->repo, &revs, "");
 	revs.tag_objects = 1;
 
 	for (size_t i = 0; i < ctx->ref_array.nr; i++) {
 		struct ref_array_item *item = ctx->ref_array.items[i];
 		add_pending_oid(&revs, NULL, &item->objectname, add_flags);
+		display_progress(ctx->progress, ++(ctx->progress_nr));
 	}
 
 	walk_objects_by_path(&info);
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
index 7a37da1bb2dadc..e738d6421a3224 100755
--- a/t/t8100-git-survey.sh
+++ b/t/t8100-git-survey.sh
@@ -29,6 +29,26 @@ test_expect_success 'git survey (default)' '
 	git survey --all-refs >out 2>err &&
 	test_line_count = 0 err &&
 
+	test_oid_cache <<-EOF &&
+	commits_size_on_disk sha1:     1523
+	commits_size_on_disk sha256:     1811
+
+	commits_size sha1:         2153
+	commits_size sha256:         2609
+
+	trees_size_on_disk sha1:      495
+	trees_size_on_disk sha256:      635
+
+	trees_size sha1:         1706
+	trees_size sha256:         2366
+
+	tags_size sha1:          528
+	tags_size sha256:          624
+
+	tags_size_on_disk sha1:      510
+	tags_size_on_disk sha256:      569
+	EOF
+
 	tr , " " >expect <<-EOF &&
 	GIT SURVEY for "$(pwd)"
 	-----------------------------------------------------
@@ -50,6 +70,15 @@ test_expect_success 'git survey (default)' '
 	    Commits |    10
 	      Trees |    10
 	      Blobs |    10
+
+	TOTAL OBJECT SIZES BY TYPE
+	===============================================
+	Object Type | Count | Disk Size | Inflated Size
+	------------+-------+-----------+--------------
+	    Commits |    10 | $(test_oid commits_size_on_disk) | $(test_oid commits_size)
+	      Trees |    10 | $(test_oid trees_size_on_disk) | $(test_oid trees_size)
+	      Blobs |    10 |       191 |           101
+	       Tags |     4 | $(test_oid tags_size_on_disk) | $(test_oid tags_size)
 	EOF
 
 	test_cmp expect out

From 7e80a8b815fa7351b573f4dad3e697747abad21e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Sun, 1 Sep 2024 21:21:54 -0400
Subject: [PATCH 12/16] survey: show progress during object walk

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/survey.c      | 14 ++++++++++++++
 t/t8100-git-survey.sh |  5 +++++
 2 files changed, 19 insertions(+)

diff --git a/builtin/survey.c b/builtin/survey.c
index 8686415bd27241..d9bdf835d974ef 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -632,6 +632,9 @@ static int survey_objects_path_walk_fn(const char *path UNUSED,
 				type, oids->nr);
 	increment_object_totals(ctx, oids, type);
 
+	ctx->progress_nr += oids->nr;
+	display_progress(ctx->progress, ctx->progress_nr);
+
 	return 0;
 }
 
@@ -661,13 +664,24 @@ static void survey_phase_objects(struct survey_context *ctx)
 	repo_init_revisions(ctx->repo, &revs, "");
 	revs.tag_objects = 1;
 
+	ctx->progress_nr = 0;
+	ctx->progress_total = ctx->ref_array.nr;
+	if (ctx->opts.show_progress)
+		ctx->progress = start_progress(_("Preparing object walk"),
+					       ctx->progress_total);
 	for (size_t i = 0; i < ctx->ref_array.nr; i++) {
 		struct ref_array_item *item = ctx->ref_array.items[i];
 		add_pending_oid(&revs, NULL, &item->objectname, add_flags);
 		display_progress(ctx->progress, ++(ctx->progress_nr));
 	}
+	stop_progress(&ctx->progress);
 
+	ctx->progress_nr = 0;
+	ctx->progress_total = 0;
+	if (ctx->opts.show_progress)
+		ctx->progress = start_progress(_("Walking objects"), 0);
 	walk_objects_by_path(&info);
+	stop_progress(&ctx->progress);
 
 	release_revisions(&revs);
 	trace2_region_leave("survey", "phase/objects", ctx->repo);
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
index e738d6421a3224..6c2867c11c323c 100755
--- a/t/t8100-git-survey.sh
+++ b/t/t8100-git-survey.sh
@@ -25,6 +25,11 @@ test_expect_success 'create a semi-interesting repo' '
 	git update-ref -d refs/tags/two
 '
 
+test_expect_success 'git survey --progress' '
+	GIT_PROGRESS_DELAY=0 git survey --all-refs --progress >out 2>err &&
+	grep "Preparing object walk" err
+'
+
 test_expect_success 'git survey (default)' '
 	git survey --all-refs >out 2>err &&
 	test_line_count = 0 err &&

From cce275f58ea9ec7901e05974913d86c9a6971d75 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Sun, 1 Sep 2024 22:35:06 -0400
Subject: [PATCH 13/16] survey: add ability to track prioritized lists

In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/survey.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/builtin/survey.c b/builtin/survey.c
index d9bdf835d974ef..2385688d24219f 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -77,6 +77,119 @@ struct survey_report_object_size_summary {
 	size_t num_missing;
 };
 
+typedef int (*survey_top_cmp)(void *v1, void *v2);
+
+MAYBE_UNUSED
+static int cmp_by_nr(void *v1, void *v2)
+{
+	struct survey_report_object_size_summary *s1 = v1;
+	struct survey_report_object_size_summary *s2 = v2;
+
+	if (s1->nr < s2->nr)
+		return -1;
+	if (s1->nr > s2->nr)
+		return 1;
+	return 0;
+}
+
+MAYBE_UNUSED
+static int cmp_by_disk_size(void *v1, void *v2)
+{
+	struct survey_report_object_size_summary *s1 = v1;
+	struct survey_report_object_size_summary *s2 = v2;
+
+	if (s1->disk_size < s2->disk_size)
+		return -1;
+	if (s1->disk_size > s2->disk_size)
+		return 1;
+	return 0;
+}
+
+MAYBE_UNUSED
+static int cmp_by_inflated_size(void *v1, void *v2)
+{
+	struct survey_report_object_size_summary *s1 = v1;
+	struct survey_report_object_size_summary *s2 = v2;
+
+	if (s1->inflated_size < s2->inflated_size)
+		return -1;
+	if (s1->inflated_size > s2->inflated_size)
+		return 1;
+	return 0;
+}
+
+/**
+ * Store a list of "top" categories by some sorting function. When
+ * inserting a new category, reorder the list and free the one that
+ * got ejected (if any).
+ */
+struct survey_report_top_table {
+	const char *name;
+	survey_top_cmp cmp_fn;
+	size_t nr;
+	size_t alloc;
+
+	/**
+	 * 'data' stores an array of structs and must be cast into
+	 * the proper array type before evaluating an index.
+	 */
+	void *data;
+};
+
+MAYBE_UNUSED
+static void init_top_sizes(struct survey_report_top_table *top,
+			   size_t limit, const char *name,
+			   survey_top_cmp cmp)
+{
+	struct survey_report_object_size_summary *sz_array;
+
+	top->name = name;
+	top->cmp_fn = cmp;
+	top->alloc = limit;
+	top->nr = 0;
+
+	CALLOC_ARRAY(sz_array, limit);
+	top->data = sz_array;
+}
+
+MAYBE_UNUSED
+static void clear_top_sizes(struct survey_report_top_table *top)
+{
+	struct survey_report_object_size_summary *sz_array = top->data;
+
+	for (size_t i = 0; i < top->nr; i++)
+		free(sz_array[i].label);
+	free(sz_array);
+}
+
+MAYBE_UNUSED
+static void maybe_insert_into_top_size(struct survey_report_top_table *top,
+				       struct survey_report_object_size_summary *summary)
+{
+	struct survey_report_object_size_summary *sz_array = top->data;
+	size_t pos = top->nr;
+
+	/* Compare against list from the bottom. */
+	while (pos > 0 && top->cmp_fn(&sz_array[pos - 1], summary) < 0)
+		pos--;
+
+	/* Not big enough! */
+	if (pos >= top->alloc)
+		return;
+
+	/* We need to shift the data. */
+	if (top->nr == top->alloc)
+		free(sz_array[top->nr - 1].label);
+	else
+		top->nr++;
+
+	for (size_t i = top->nr - 1; i > pos; i--)
+		memcpy(&sz_array[i], &sz_array[i - 1], sizeof(*sz_array));
+
+	memcpy(&sz_array[pos], summary, sizeof(*summary));
+	sz_array[pos].label = xstrdup(summary->label);
+}
+
 /**
  * This struct contains all of the information that needs to be printed
  * at the end of the exploration of the repository and its references.

From b70eee6b3f41a3e3ec46f56bb7797ff4f2bf264e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Sun, 1 Sep 2024 22:35:40 -0400
Subject: [PATCH 14/16] survey: add report of "largest" paths

Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

Since the on-disk size is likely to be fragile, stop testing the exact
output of 'git survey' and check that the correct set of headers is
output.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 builtin/survey.c      | 79 ++++++++++++++++++++++++++++++++++++++-----
 t/t8100-git-survey.sh | 12 ++++++-
 2 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/builtin/survey.c b/builtin/survey.c
index 2385688d24219f..ba72eb7d69a9b8 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -79,7 +79,6 @@ struct survey_report_object_size_summary {
 
 typedef int (*survey_top_cmp)(void *v1, void *v2);
 
-MAYBE_UNUSED
 static int cmp_by_nr(void *v1, void *v2)
 {
 	struct survey_report_object_size_summary *s1 = v1;
@@ -92,7 +91,6 @@ static int cmp_by_nr(void *v1, void *v2)
 	return 0;
 }
 
-MAYBE_UNUSED
 static int cmp_by_disk_size(void *v1, void *v2)
 {
 	struct survey_report_object_size_summary *s1 = v1;
@@ -105,7 +103,6 @@ static int cmp_by_disk_size(void *v1, void *v2)
 	return 0;
 }
 
-MAYBE_UNUSED
 static int cmp_by_inflated_size(void *v1, void *v2)
 {
 	struct survey_report_object_size_summary *s1 = v1;
@@ -136,7 +133,6 @@ struct survey_report_top_table {
 	void *data;
 };
 
-MAYBE_UNUSED
 static void init_top_sizes(struct survey_report_top_table *top,
 			   size_t limit, const char *name,
 			   survey_top_cmp cmp)
@@ -162,7 +158,6 @@ static void clear_top_sizes(struct survey_report_top_table *top)
 	free(sz_array);
 }
 
-MAYBE_UNUSED
 static void maybe_insert_into_top_size(struct survey_report_top_table *top,
 				       struct survey_report_object_size_summary *summary)
 {
@@ -199,6 +194,10 @@ struct survey_report {
 	struct survey_report_object_summary reachable_objects;
 
 	struct survey_report_object_size_summary *by_type;
+
+	struct survey_report_top_table *top_paths_by_count;
+	struct survey_report_top_table *top_paths_by_disk;
+	struct survey_report_top_table *top_paths_by_inflate;
 };
 
 #define REPORT_TYPE_COMMIT 0
@@ -450,6 +449,13 @@ static void survey_report_object_sizes(const char *title,
 	clear_table(&table);
 }
 
+static void survey_report_plaintext_sorted_size(
+		struct survey_report_top_table *top)
+{
+	survey_report_object_sizes(top->name,  _("Path"),
+				   top->data, top->nr);
+}
+
 static void survey_report_plaintext(struct survey_context *ctx)
 {
 	printf("GIT SURVEY for \"%s\"\n", ctx->repo->worktree);
@@ -460,6 +466,21 @@ static void survey_report_plaintext(struct survey_context *ctx)
 				   _("Object Type"),
 				   ctx->report.by_type,
 				   REPORT_TYPE_COUNT);
+
+	survey_report_plaintext_sorted_size(
+		&ctx->report.top_paths_by_count[REPORT_TYPE_TREE]);
+	survey_report_plaintext_sorted_size(
+		&ctx->report.top_paths_by_count[REPORT_TYPE_BLOB]);
+
+	survey_report_plaintext_sorted_size(
+		&ctx->report.top_paths_by_disk[REPORT_TYPE_TREE]);
+	survey_report_plaintext_sorted_size(
+		&ctx->report.top_paths_by_disk[REPORT_TYPE_BLOB]);
+
+	survey_report_plaintext_sorted_size(
+		&ctx->report.top_paths_by_inflate[REPORT_TYPE_TREE]);
+	survey_report_plaintext_sorted_size(
+		&ctx->report.top_paths_by_inflate[REPORT_TYPE_BLOB]);
 }
 
 /*
@@ -700,7 +721,8 @@ static void increment_totals(struct survey_context *ctx,
 
 static void increment_object_totals(struct survey_context *ctx,
 				    struct oid_array *oids,
-				    enum object_type type)
+				    enum object_type type,
+				    const char *path)
 {
 	struct survey_report_object_size_summary *total;
 	struct survey_report_object_size_summary summary = { 0 };
@@ -732,9 +754,30 @@ static void increment_object_totals(struct survey_context *ctx,
 	total->disk_size += summary.disk_size;
 	total->inflated_size += summary.inflated_size;
 	total->num_missing += summary.num_missing;
+
+	if (type == OBJ_TREE || type == OBJ_BLOB) {
+		int index = type == OBJ_TREE ?
+			    REPORT_TYPE_TREE : REPORT_TYPE_BLOB;
+		struct survey_report_top_table *top;
+
+		/*
+		 * Temporarily store (const char *) here, but it will
+		 * be duped if inserted and will not be freed.
+		 */
+		summary.label = (char *)path;
+
+		top = ctx->report.top_paths_by_count;
+		maybe_insert_into_top_size(&top[index], &summary);
+
+		top = ctx->report.top_paths_by_disk;
+		maybe_insert_into_top_size(&top[index], &summary);
+
+		top = ctx->report.top_paths_by_inflate;
+		maybe_insert_into_top_size(&top[index], &summary);
+	}
 }
 
-static int survey_objects_path_walk_fn(const char *path UNUSED,
+static int survey_objects_path_walk_fn(const char *path,
 				       struct oid_array *oids,
 				       enum object_type type,
 				       void *data)
@@ -743,7 +786,7 @@ static int survey_objects_path_walk_fn(const char *path UNUSED,
 
 	increment_object_counts(&ctx->report.reachable_objects,
 				type, oids->nr);
-	increment_object_totals(ctx, oids, type);
+	increment_object_totals(ctx, oids, type, path);
 
 	ctx->progress_nr += oids->nr;
 	display_progress(ctx->progress, ctx->progress_nr);
@@ -753,11 +796,31 @@ static int survey_objects_path_walk_fn(const char *path UNUSED,
 
 static void initialize_report(struct survey_context *ctx)
 {
+	const int top_limit = 100;
+
 	CALLOC_ARRAY(ctx->report.by_type, REPORT_TYPE_COUNT);
 	ctx->report.by_type[REPORT_TYPE_COMMIT].label = xstrdup(_("Commits"));
 	ctx->report.by_type[REPORT_TYPE_TREE].label = xstrdup(_("Trees"));
 	ctx->report.by_type[REPORT_TYPE_BLOB].label = xstrdup(_("Blobs"));
 	ctx->report.by_type[REPORT_TYPE_TAG].label = xstrdup(_("Tags"));
+
+	CALLOC_ARRAY(ctx->report.top_paths_by_count, REPORT_TYPE_COUNT);
+	init_top_sizes(&ctx->report.top_paths_by_count[REPORT_TYPE_TREE],
+		       top_limit, _("TOP DIRECTORIES BY COUNT"), cmp_by_nr);
+	init_top_sizes(&ctx->report.top_paths_by_count[REPORT_TYPE_BLOB],
+		       top_limit, _("TOP FILES BY COUNT"), cmp_by_nr);
+
+	CALLOC_ARRAY(ctx->report.top_paths_by_disk, REPORT_TYPE_COUNT);
+	init_top_sizes(&ctx->report.top_paths_by_disk[REPORT_TYPE_TREE],
+		       top_limit, _("TOP DIRECTORIES BY DISK SIZE"), cmp_by_disk_size);
+	init_top_sizes(&ctx->report.top_paths_by_disk[REPORT_TYPE_BLOB],
+		       top_limit, _("TOP FILES BY DISK SIZE"), cmp_by_disk_size);
+
+	CALLOC_ARRAY(ctx->report.top_paths_by_inflate, REPORT_TYPE_COUNT);
+	init_top_sizes(&ctx->report.top_paths_by_inflate[REPORT_TYPE_TREE],
+		       top_limit, _("TOP DIRECTORIES BY INFLATED SIZE"), cmp_by_inflated_size);
+	init_top_sizes(&ctx->report.top_paths_by_inflate[REPORT_TYPE_BLOB],
+		       top_limit, _("TOP FILES BY INFLATED SIZE"), cmp_by_inflated_size);
 }
 
 static void survey_phase_objects(struct survey_context *ctx)
diff --git a/t/t8100-git-survey.sh b/t/t8100-git-survey.sh
index 6c2867c11c323c..8c6edfcae0c6c2 100755
--- a/t/t8100-git-survey.sh
+++ b/t/t8100-git-survey.sh
@@ -86,7 +86,17 @@ test_expect_success 'git survey (default)' '
 	       Tags |     4 | $(test_oid tags_size_on_disk) | $(test_oid tags_size)
 	EOF
 
-	test_cmp expect out
+	lines=$(wc -l <expect) &&
+	head -n $lines out >out-trimmed &&
+	test_cmp expect out-trimmed &&
+
+	for type in "DIRECTORIES" "FILES"
+	do
+		for metric in "COUNT" "DISK SIZE" "INFLATED SIZE"
+		do
+			grep "TOP $type BY $metric" out || return 1
+		done || return 1
+	done
 '
 
 test_done

From dd54ff7e2eca6a7495df0ace3fb5197f33aeccbc Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Mon, 23 Sep 2024 15:38:25 -0400
Subject: [PATCH 15/16] survey: add --top=<N> option and config

The 'git survey' builtin provides several detail tables, such as "top
files by on-disk size". The size of these tables defaults to 100,
currently.

Allow the user to specify this number via a new --top=<N> option or the
new survey.top config key.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/config/survey.txt |  3 +++
 builtin/survey.c                | 22 ++++++++++++++--------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/survey.txt b/Documentation/config/survey.txt
index c1b0f852a1250e..9e594a2092f225 100644
--- a/Documentation/config/survey.txt
+++ b/Documentation/config/survey.txt
@@ -8,4 +8,7 @@ survey.*::
 		This boolean value implies the `--[no-]verbose` option.
 	progress::
 		This boolean value implies the `--[no-]progress` option.
+	top::
+		This integer value implies `--top=<N>`, specifying the
+		number of entries in the detail tables.
 --
diff --git a/builtin/survey.c b/builtin/survey.c
index ba72eb7d69a9b8..03a6727b4dc61d 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -44,6 +44,7 @@ static struct survey_refs_wanted default_ref_options = {
 struct survey_opts {
 	int verbose;
 	int show_progress;
+	int top_nr;
 	struct survey_refs_wanted refs;
 };
 
@@ -552,6 +553,10 @@ static int survey_load_config_cb(const char *var, const char *value,
 		ctx->opts.show_progress = git_config_bool(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "survey.top")) {
+		ctx->opts.top_nr = git_config_bool(var, value);
+		return 0;
+	}
 
 	return git_default_config(var, value, cctx, pvoid);
 }
@@ -796,8 +801,6 @@ static int survey_objects_path_walk_fn(const char *path,
 
 static void initialize_report(struct survey_context *ctx)
 {
-	const int top_limit = 100;
-
 	CALLOC_ARRAY(ctx->report.by_type, REPORT_TYPE_COUNT);
 	ctx->report.by_type[REPORT_TYPE_COMMIT].label = xstrdup(_("Commits"));
 	ctx->report.by_type[REPORT_TYPE_TREE].label = xstrdup(_("Trees"));
@@ -806,21 +809,21 @@ static void initialize_report(struct survey_context *ctx)
 
 	CALLOC_ARRAY(ctx->report.top_paths_by_count, REPORT_TYPE_COUNT);
 	init_top_sizes(&ctx->report.top_paths_by_count[REPORT_TYPE_TREE],
-		       top_limit, _("TOP DIRECTORIES BY COUNT"), cmp_by_nr);
+		       ctx->opts.top_nr, _("TOP DIRECTORIES BY COUNT"), cmp_by_nr);
 	init_top_sizes(&ctx->report.top_paths_by_count[REPORT_TYPE_BLOB],
-		       top_limit, _("TOP FILES BY COUNT"), cmp_by_nr);
+		       ctx->opts.top_nr, _("TOP FILES BY COUNT"), cmp_by_nr);
 
 	CALLOC_ARRAY(ctx->report.top_paths_by_disk, REPORT_TYPE_COUNT);
 	init_top_sizes(&ctx->report.top_paths_by_disk[REPORT_TYPE_TREE],
-		       top_limit, _("TOP DIRECTORIES BY DISK SIZE"), cmp_by_disk_size);
+		       ctx->opts.top_nr, _("TOP DIRECTORIES BY DISK SIZE"), cmp_by_disk_size);
 	init_top_sizes(&ctx->report.top_paths_by_disk[REPORT_TYPE_BLOB],
-		       top_limit, _("TOP FILES BY DISK SIZE"), cmp_by_disk_size);
+		       ctx->opts.top_nr, _("TOP FILES BY DISK SIZE"), cmp_by_disk_size);
 
 	CALLOC_ARRAY(ctx->report.top_paths_by_inflate, REPORT_TYPE_COUNT);
 	init_top_sizes(&ctx->report.top_paths_by_inflate[REPORT_TYPE_TREE],
-		       top_limit, _("TOP DIRECTORIES BY INFLATED SIZE"), cmp_by_inflated_size);
+		       ctx->opts.top_nr, _("TOP DIRECTORIES BY INFLATED SIZE"), cmp_by_inflated_size);
 	init_top_sizes(&ctx->report.top_paths_by_inflate[REPORT_TYPE_BLOB],
-		       top_limit, _("TOP FILES BY INFLATED SIZE"), cmp_by_inflated_size);
+		       ctx->opts.top_nr, _("TOP FILES BY INFLATED SIZE"), cmp_by_inflated_size);
 }
 
 static void survey_phase_objects(struct survey_context *ctx)
@@ -870,6 +873,7 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 		.opts = {
 			.verbose = 0,
 			.show_progress = -1, /* defaults to isatty(2) */
+			.top_nr = 100,
 
 			.refs.want_all_refs = -1,
 
@@ -885,6 +889,8 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 	static struct option survey_options[] = {
 		OPT__VERBOSE(&ctx.opts.verbose, N_("verbose output")),
 		OPT_BOOL(0, "progress", &ctx.opts.show_progress, N_("show progress")),
+		OPT_INTEGER('n', "top", &ctx.opts.top_nr,
+			    N_("number of entries to include in detail tables")),
 
 		OPT_BOOL_F(0, "all-refs", &ctx.opts.refs.want_all_refs, N_("include all refs"),          PARSE_OPT_NONEG),
 

From 38e116845da1032460f7d7ee9e26a9c8aa47e0ca Mon Sep 17 00:00:00 2001
From: Johannes Schindelin <johannes.schindelin@gmx.de>
Date: Mon, 1 Jul 2024 23:28:45 +0200
Subject: [PATCH 16/16] survey: clearly note the experimental nature in the
 output

While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 builtin/survey.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/builtin/survey.c b/builtin/survey.c
index 03a6727b4dc61d..eee0f8d7abb65b 100644
--- a/builtin/survey.c
+++ b/builtin/survey.c
@@ -21,6 +21,7 @@
 #include "strvec.h"
 #include "tag.h"
 #include "trace2.h"
+#include "color.h"
 
 static const char * const survey_usage[] = {
 	N_("(EXPERIMENTAL!) git survey <options>"),
@@ -906,6 +907,11 @@ int cmd_survey(int argc, const char **argv, const char *prefix,
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage_with_options(survey_usage, survey_options);
 
+	if (isatty(2))
+		color_fprintf_ln(stderr,
+				 want_color_fd(2, GIT_COLOR_AUTO) ? GIT_COLOR_YELLOW : "",
+				 "(THIS IS EXPERIMENTAL, EXPECT THE OUTPUT FORMAT TO CHANGE!)");
+
 	ctx.repo = repo;
 
 	prepare_repo_settings(ctx.repo);