Skip to content

Commit

Permalink
FAQ: wip
Browse files Browse the repository at this point in the history
  • Loading branch information
newren committed Nov 26, 2024
1 parent dfc2d00 commit f1ec35b
Showing 1 changed file with 230 additions and 0 deletions.
230 changes: 230 additions & 0 deletions Documentation/FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# Frequently Answered Questions

## Table of Contents

* [Why did `git-filter-repo` rewrite commit hashes?](#why-did-git-filter-repo-rewrite-commit-hashes)
* [Why did `git-filter-repo` rewrite more commit hashes than I expected?](#why-did-git-filter-repo-rewrite-more-commit-hashes-than-i-expected)
* [Why did `git-filter-repo` rewrite other branches too?](#why-did-git-filter-repo-rewrite-other-branches-too)
* [Help! Can I recover or undo the filtering?](#help-can-i-recover-or-undo-the-filtering)
* [Can you change `git-filter-repo` to allow future folks to recover from `--force`'d rewrites?](#can-you-change-git-filter-repo-to-allow-future-folks-to-recover-from---forced-rewrites)
* [Can I use `git-filter-repo` to fix a repository with corruption?](#Can-I-use-git-filter-repo-to-fix-a-repository-with-corruption)
* [What kinds of problems does `git-filter-repo` not try to solve?](#What-kinds-of-problems-does-git-filter-repo-not-try-to-solve)


## Why did `git-filter-repo` rewrite commit hashes?

This is fundamental to how Git operates. In more detail...

Each commit in Git is a hash of its contents. Those contents include
the commit message, the author (name, email, and time authored), the
committer (name, email and time committed), the toplevel tree hash,
and the parent(s) of the commit. This means that if any of the commit
fields change, including the tree hash or the hash of the parent(s) of
the commit, then the hash for the commit will change.

(The same is true for files ("blobs") and trees stored in git as well;
each is a hash of its contents, so literally if anything changes, the
commit hash will change.)

If you attempt to write commit (or tree or blob) objects with an
incorrect hash, Git will reject it as corrupt.

## Why did `git-filter-repo` rewrite more commit hashes than I expected?

There are two aspects to this, or two possible underlying questions users
might be asking here:
* Why did commits newer than the ones I expected have their hash change?
* Why did commits older than the ones I expected have their hash change?

For the first question, see [why filter-repo rewrites commit
hashes](#why-did-git-filter-repo-rewrite-commit-hashes), and note that
if you modify some old commit, perhaps to remove a file, then obviously
that commit's hash must change. Further, since that commit will have a
new hash, any other commit with that commit as a parent will need to
have a new hash. That will need to chain all the way to the most recent
commits in history. This is fundamental to Git and there is nothing you
can do to change this.

For the second question, there are two causes: (1) the filter you
specified applies to the older commits too, or (2) git-fast-export and
git-fast-import (both of which git-filter-repo uses) canonicalize
history in various ways. The second cause means that even if you have
no filter, these tools sometimes change commit hashes. This can happen
in any of these cases:

* If you have signed commits, the signatures will be stripped
* If you have commits with extended headers, the extended headers will
be stripped (signed commits are actually a special case of this)
* If you have commits in an encoding other than UTF-8, they will by
default be re-encoded into UTF-8
* If you have a commit without an author, one will be added that
matches the committer.
* If you have trees that are not canonical (e.g. incorrect sorting
order), they will be canonicalized

If this affects you and you really only want to rewrite newer commits in
history, you can use the `--refs` argument to git-filter-repo to specify
a range of history that you want rewritten.

(For those attempting to be clever and use `--refs` for the first
question: Note that if you attempt to only rewrite a few old commits,
then all you'll succeed in is adding new commits that won't be part of
any branch and will be subject to garbage collection. The branches will
still hold on to the unrewritten versions of the commits. Thus, you
have to rewrite all the way to the branch tip for the rewrite to be
meaningful. Said another way, the `--refs` trick is only useful for
restricting the rewrite to newer commits, never for restricting the
rewrite to older commits.)

## Why did `git-filter-repo` rewrite other branches too?

git-filter-repo's name is git-filter-**_repo_**. Obviously it is going
to rewrite all branches by default.

`git-filter-repo` can restrict its rewriting to a subset of history,
such as a single branch, using the `--refs` option. However, using that
comes with the risk that one branch now has a different version of some
commits than other branches do; usually, when you rewrite history, you
want all branches that depend on what you are rewriting to be updated.

## Help! Can I recover or undo the filtering?

Sure, _if_ you followed the instructions. The instructions told you to
make a fresh clone before running git-filter-repo. If you did that, you
can just throw away your clone with the flubbed rewrite, and make a new
clone.

If you didn't make a fresh clone, and you didn't run with `--force`, you
would have seen the following warning:
```
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
[...]
Please operate on a fresh clone instead. If you want to proceed
anyway, use --force.
```
If you then added `--force`, well, you were warned.

If you didn't make a fresh clone, and you started with `--force`, and you
didn't think to read the description of the `--force` option:
```
Ignore fresh clone checks and rewrite history (an irreversible
operation, especially since it by default ends with an
immediate pruning of reflogs and old objects).
```
and you didn't read even the beginning of the manual
```
git-filter-repo destructively rewrites history
```
and you think it's okay to run a command with `--force` in it on something
you don't have a backup of, then now is the time to reasses your life
choices. `--force` should be a pretty clear warning sign. (If someone
on the internet suggested `--force`, you should complain at them very
loudly, especially if it was on Stack Overflow or some similar site. And
you should also learn to carefully vet commands suggested by others on the
internet.)

See also the next question.

## Can you change `git-filter-repo` to allow future folks to recover from --force'd rewrites?

This will never be supported.

* Providing an alternate method to restore would require storing both
the original history and the new history, meaning that those who are
trying to shrink their repository size instead see it grow and have to
figure out extra steps to expunge the old history to see the actual
size savings. Experience showed with other tools that this was
frustrating and difficult to figure out for many users. Providing an
alternate method to restore would mean that users who are trying to
purge sensitive data from their repository still find the sensitive
data after the rewrite because it hasn't actually been purged. In
order to actually purge it, they have to take extra steps, which again
has made things difficult for users in the past with other tools.

* Providing an alternate method to restore would also mean trying to
figure out what should be backed up and how. The obvious choices used
by previous tools only actually provided partial backups (reflogs
would be ignored for example, as would uncommitted changes whether
staged or not). The only reasonable full backup mechanism is making a
separate clone, which is both expensive and something the user can and
should understand how to do on their own.

* Providing an alternate method to restore would also mean providing
documentation on how to restore. Past methods by other tools in the
history rewriting space suggested that it was rather difficult for
users to figure out. Difficult enough, in fact, that users simply
didn't ever use them. They instead made a separate clone before
rewriting history and if they didn't like the rewrite, then they just
blew it away and made a new clone to work with. Since that was
observed to be the easy restoration method, I simply enforced it with
this tool, requiring users who look like they might not be operating
on a fresh clone to use the --force flag.

But more than all that, if there were an alternate method to restore,
why would you have needed to specify the --force flag? Doesn't its
existence (and the wording of its documentation) make it pretty clear on
its own that there isn't going to be a way to restore?

## Can I use `git-filter-repo` to fix a repository with corruption?

Some kinds of corruption can be fixed, in conjunction with `git
replace`. If `git fsck` reports warnings/errors for certain objects,
you can often [replace them and rewrite
history](examples-from-user-filed-issues.md#Handling-repository-corruption).

## What kinds of problems does `git-filter-repo` not try to solve?

This question is often asked in the form of "How do I..." or even
written as a statement such as "I found a bug with `git-filter-repo`;
the behavior I got was different than I expected..." But if you're
trying to do one of the things below, then `git-filter-repo` is behaving
as designed and the way you solve your problem is you use a different
tool.

### Filtering history but magically keeping the same commit IDs

This is impossible. If you modify commits, or the files contained in
them, then you change their commit IDs; this is [fundamental to
Git](why-did-git-filter-repo-rewrite-commit-hashes).

However, _if_ you don't need to modify commits, but just don't want to
download everything, then look into one of the following:
* [partial clones](https://git-scm.com/docs/partial-clone)
* the ugly, retarded hack known as [shallow clones](https://git-scm.com/docs/shallow)
* a massive hack like [cheap fake
clones](https://github.com/newren/sequester-old-big-blobs) that at
least let you put your evil overlord laugh to use

### Bidirectional development between filtered and unfiltered repository (josh)

Some folks want to extract a subset of a repository, do development work
on it, then bring those changes back to the original repository, and
send further changes in both directions. Such a tool can be written
using fast-export and fast-import, but would need to make very different
design decisions than `git-filter-repo` did. Such a tool would be
capable of supporting this kind of development, but lose the ability
["to write arbitrary filters using a scripting
language"](https://josh-project.github.io/josh/#concept) among other
features that `git-filter-repo` has.

Such a tool exists; it's called [Josh](https://github.com/josh-project/josh).

```
To guarantee filters are reversible we have to restrict the kind of
filter that can be used; It is not possible to write arbitrary filters
using a scripting language like is allowed in other tools
```

### Filtering based on the difference (a.k.a. patch or change) between commits (rebase)
### Conversion between different version control systems (reposurgeon)
### Having two people filter their clone of the repository (with the same filtering command) and getting the same new commit IDs

<!--
## How do I see what was removed?
* Give answer in terms of `git rev-list --objects --all` in both a
separate fresh clone from before the rewrite and in the repo where
the rewrite was done. Then find the objects that exist in the old
but not the new.
-->

0 comments on commit f1ec35b

Please sign in to comment.