Skip to content

Commit a748924

Browse files
committed
FAQ: wip
1 parent dfc2d00 commit a748924

File tree

1 file changed

+301
-0
lines changed

1 file changed

+301
-0
lines changed

Documentation/FAQ.md

Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
# Frequently Answered Questions
2+
3+
## Table of Contents
4+
5+
* [Why did `git-filter-repo` rewrite commit hashes?](#why-did-git-filter-repo-rewrite-commit-hashes)
6+
* [Why did `git-filter-repo` rewrite more commit hashes than I expected?](#why-did-git-filter-repo-rewrite-more-commit-hashes-than-i-expected)
7+
* [Why did `git-filter-repo` rewrite other branches too?](#why-did-git-filter-repo-rewrite-other-branches-too)
8+
* [Help! Can I recover or undo the filtering?](#help-can-i-recover-or-undo-the-filtering)
9+
* [Can you change `git-filter-repo` to allow future folks to recover from `--force`'d rewrites?](#can-you-change-git-filter-repo-to-allow-future-folks-to-recover-from---forced-rewrites)
10+
* [Can I use `git-filter-repo` to fix a repository with corruption?](#Can-I-use-git-filter-repo-to-fix-a-repository-with-corruption)
11+
* [What kinds of problems does `git-filter-repo` not try to solve?](#What-kinds-of-problems-does-git-filter-repo-not-try-to-solve)
12+
* [Filtering history but magically keeping the same commit IDs](#Filtering-history-but-magically-keeping-the-same-commit-IDs)
13+
* [Bidirectional development between a filtered and unfiltered repository](#Bidirectional-development-between-a-filtered-and-unfiltered-repository)
14+
* [Removing specific commits, or filtering based on the difference (a.k.a. patch or change) between commits](#Removing-specific-commits-or-filtering-based-on-the-difference-aka-patch-or-change-between-commits)
15+
* [Filtering two different clones of the same repository and getting the same new commit IDs](#Filtering-two-different-clones-of-the-same-repository-and-getting-the-same-new-commit-IDs)
16+
17+
## Why did `git-filter-repo` rewrite commit hashes?
18+
19+
This is fundamental to how Git operates. In more detail...
20+
21+
Each commit in Git is a hash of its contents. Those contents include
22+
the commit message, the author (name, email, and time authored), the
23+
committer (name, email and time committed), the toplevel tree hash,
24+
and the parent(s) of the commit. This means that if any of the commit
25+
fields change, including the tree hash or the hash of the parent(s) of
26+
the commit, then the hash for the commit will change.
27+
28+
(The same is true for files ("blobs") and trees stored in git as well;
29+
each is a hash of its contents, so literally if anything changes, the
30+
commit hash will change.)
31+
32+
If you attempt to write commit (or tree or blob) objects with an
33+
incorrect hash, Git will reject it as corrupt.
34+
35+
## Why did `git-filter-repo` rewrite more commit hashes than I expected?
36+
37+
There are two aspects to this, or two possible underlying questions users
38+
might be asking here:
39+
* Why did commits newer than the ones I expected have their hash change?
40+
* Why did commits older than the ones I expected have their hash change?
41+
42+
For the first question, see [why filter-repo rewrites commit
43+
hashes](#why-did-git-filter-repo-rewrite-commit-hashes), and note that
44+
if you modify some old commit, perhaps to remove a file, then obviously
45+
that commit's hash must change. Further, since that commit will have a
46+
new hash, any other commit with that commit as a parent will need to
47+
have a new hash. That will need to chain all the way to the most recent
48+
commits in history. This is fundamental to Git and there is nothing you
49+
can do to change this.
50+
51+
For the second question, there are two causes: (1) the filter you
52+
specified applies to the older commits too, or (2) git-fast-export and
53+
git-fast-import (both of which git-filter-repo uses) canonicalize
54+
history in various ways. The second cause means that even if you have
55+
no filter, these tools sometimes change commit hashes. This can happen
56+
in any of these cases:
57+
58+
* If you have signed commits, the signatures will be stripped
59+
* If you have commits with extended headers, the extended headers will
60+
be stripped (signed commits are actually a special case of this)
61+
* If you have commits in an encoding other than UTF-8, they will by
62+
default be re-encoded into UTF-8
63+
* If you have a commit without an author, one will be added that
64+
matches the committer.
65+
* If you have trees that are not canonical (e.g. incorrect sorting
66+
order), they will be canonicalized
67+
68+
If this affects you and you really only want to rewrite newer commits in
69+
history, you can use the `--refs` argument to git-filter-repo to specify
70+
a range of history that you want rewritten.
71+
72+
(For those attempting to be clever and use `--refs` for the first
73+
question: Note that if you attempt to only rewrite a few old commits,
74+
then all you'll succeed in is adding new commits that won't be part of
75+
any branch and will be subject to garbage collection. The branches will
76+
still hold on to the unrewritten versions of the commits. Thus, you
77+
have to rewrite all the way to the branch tip for the rewrite to be
78+
meaningful. Said another way, the `--refs` trick is only useful for
79+
restricting the rewrite to newer commits, never for restricting the
80+
rewrite to older commits.)
81+
82+
## Why did `git-filter-repo` rewrite other branches too?
83+
84+
git-filter-repo's name is git-filter-**_repo_**. Obviously it is going
85+
to rewrite all branches by default.
86+
87+
`git-filter-repo` can restrict its rewriting to a subset of history,
88+
such as a single branch, using the `--refs` option. However, using that
89+
comes with the risk that one branch now has a different version of some
90+
commits than other branches do; usually, when you rewrite history, you
91+
want all branches that depend on what you are rewriting to be updated.
92+
93+
## Help! Can I recover or undo the filtering?
94+
95+
Sure, _if_ you followed the instructions. The instructions told you to
96+
make a fresh clone before running git-filter-repo. If you did that (and
97+
didn't force push your rewritten history back over it), you can just
98+
throw away your clone with the flubbed rewrite, and make a new clone.
99+
100+
If you didn't make a fresh clone, and you didn't run with `--force`, you
101+
would have seen the following warning:
102+
```
103+
Aborting: Refusing to destructively overwrite repo history since
104+
this does not look like a fresh clone.
105+
[...]
106+
Please operate on a fresh clone instead. If you want to proceed
107+
anyway, use --force.
108+
```
109+
If you then added `--force`, well, you were warned.
110+
111+
If you didn't make a fresh clone, and you started with `--force`, and you
112+
didn't think to read the description of the `--force` option:
113+
```
114+
Ignore fresh clone checks and rewrite history (an irreversible
115+
operation, especially since it by default ends with an
116+
immediate pruning of reflogs and old objects).
117+
```
118+
and you didn't read even the beginning of the manual
119+
```
120+
git-filter-repo destructively rewrites history
121+
```
122+
and you think it's okay to run a command with `--force` in it on
123+
something you don't have a backup of, then now is the time to reasses
124+
your life choices. `--force` should be a pretty clear warning sign.
125+
(If someone on the internet suggested `--force`, you can complain at
126+
_them_, but either way you should learn to carefully vet commands
127+
suggested by others on the internet. Sadly, even sites like Stack
128+
Overflow where someone really ought to be able to correct bad guidance
129+
still unfortunately has a fair amount of this bad advice.)
130+
131+
See also the next question.
132+
133+
## Can you change `git-filter-repo` to allow future folks to recover from --force'd rewrites?
134+
135+
This will never be supported.
136+
137+
* Providing an alternate method to restore would require storing both
138+
the original history and the new history, meaning that those who are
139+
trying to shrink their repository size instead see it grow and have to
140+
figure out extra steps to expunge the old history to see the actual
141+
size savings. Experience showed with other tools that this was
142+
frustrating and difficult to figure out for many users. Providing an
143+
alternate method to restore would mean that users who are trying to
144+
purge sensitive data from their repository still find the sensitive
145+
data after the rewrite because it hasn't actually been purged. In
146+
order to actually purge it, they have to take extra steps, which again
147+
has made things difficult for users in the past with other tools.
148+
149+
* Providing an alternate method to restore would also mean trying to
150+
figure out what should be backed up and how. The obvious choices used
151+
by previous tools only actually provided partial backups (reflogs
152+
would be ignored for example, as would uncommitted changes whether
153+
staged or not). The only reasonable full backup mechanism is making a
154+
separate clone, which is both expensive and something the user can and
155+
should understand how to do on their own.
156+
157+
* Providing an alternate method to restore would also mean providing
158+
documentation on how to restore. Past methods by other tools in the
159+
history rewriting space suggested that it was rather difficult for
160+
users to figure out. Difficult enough, in fact, that users simply
161+
didn't ever use them. They instead made a separate clone before
162+
rewriting history and if they didn't like the rewrite, then they just
163+
blew it away and made a new clone to work with. Since that was
164+
observed to be the easy restoration method, I simply enforced it with
165+
this tool, requiring users who look like they might not be operating
166+
on a fresh clone to use the --force flag.
167+
168+
But more than all that, if there were an alternate method to restore,
169+
why would you have needed to specify the --force flag? Doesn't its
170+
existence (and the wording of its documentation) make it pretty clear on
171+
its own that there isn't going to be a way to restore?
172+
173+
## Can I use `git-filter-repo` to fix a repository with corruption?
174+
175+
Some kinds of corruption can be fixed, in conjunction with `git
176+
replace`. If `git fsck` reports warnings/errors for certain objects,
177+
you can often [replace them and rewrite
178+
history](examples-from-user-filed-issues.md#Handling-repository-corruption).
179+
180+
## What kinds of problems does `git-filter-repo` not try to solve?
181+
182+
This question is often asked in the form of "How do I..." or even
183+
written as a statement such as "I found a bug with `git-filter-repo`;
184+
the behavior I got was different than I expected..." But if you're
185+
trying to do one of the things below, then `git-filter-repo` is behaving
186+
as designed and either there is no solution to your problem, or you need
187+
to use a different tool to solve your problem. The following subsections
188+
addresses some of these common requests:
189+
190+
### Filtering history but magically keeping the same commit IDs
191+
192+
This is impossible. If you modify commits, or the files contained in
193+
them, then you change their commit IDs; this is [fundamental to
194+
Git](#why-did-git-filter-repo-rewrite-commit-hashes).
195+
196+
However, _if_ you don't need to modify commits, but just don't want to
197+
download everything, then look into one of the following:
198+
* [partial clones](https://git-scm.com/docs/partial-clone)
199+
* the ugly, retarded hack known as [shallow clones](https://git-scm.com/docs/shallow)
200+
* a massive hack like [cheap fake
201+
clones](https://github.com/newren/sequester-old-big-blobs) that at
202+
least let you put your evil overlord laugh to use
203+
204+
### Bidirectional development between a filtered and unfiltered repository
205+
206+
Some folks want to extract a subset of a repository, do development work
207+
on it, then bring those changes back to the original repository, and
208+
send further changes in both directions. Such a tool can be written
209+
using fast-export and fast-import, but would need to make very different
210+
design decisions than `git-filter-repo` did. Such a tool would be
211+
capable of supporting this kind of development, but lose the ability
212+
["to write arbitrary filters using a scripting
213+
language"](https://josh-project.github.io/josh/#concept) and other
214+
features that `git-filter-repo` has.
215+
216+
Such a tool exists; it's called [Josh](https://github.com/josh-project/josh).
217+
Use it if this is your usecase.
218+
219+
### Removing specific commits, or filtering based on the difference (a.k.a. patch or change) between commits
220+
221+
You are probably looking for `git rebase`. `git rebase` operates on the
222+
difference between commits ("diff"), allowing you to e.g. drop or modify
223+
the diff, but then runs the risk of conflicts as it attempts to apply
224+
future diffs. If you tweak one diff in the middle, since it just applies
225+
more diffs for the remaining patches, you'll still see your changes at
226+
the end.
227+
228+
filter-repo, by contrast, uses fast-export and fast-import. Those tools
229+
treat every commit not as a diff but as a "use the same versions of most
230+
files from the parent commit, but make these five files have these exact
231+
contents". Since you don't have either the diff or ready access to the
232+
version of files from the parent commit, that makes it hard to "undo"
233+
part of the changes to some file. Further, if you attempt to drop an
234+
entire commit or tweak the contents of those new files in that commit,
235+
those changes will be reverted by the next commit in the stream that
236+
mentions that file because handling the next commit does not involve
237+
applying a diff but a "make this file have these exact contents". So,
238+
filter-repo works well for things like removing a file entirely, but if
239+
you want to make any tweaks to any files you have to make the exact same
240+
tweak over and over for every single commit that touches that file.
241+
242+
In short, `git rebase` is the tool you want for removing specific
243+
commits or otherwise operating on the diff between commits.
244+
245+
### Filtering two different clones of the same repository and getting the same new commit IDs
246+
247+
Sometimes two co-workers have a clone of the same repository and they
248+
run the same `git-filter-repo` command, and they expect to get the same
249+
new commit IDs. Often they do get the same new commit IDs, but
250+
sometimes they don't.
251+
252+
When people get the same commit IDs, it is only by luck; not by design.
253+
There are three reasons this is unsupported and will never be reliable:
254+
255+
* Different Git versions used could cause differences in filtering
256+
257+
Since `git fast-export` and `git fast-import` do various
258+
canonicalizations of history, and these could change over time,
259+
having different versions of Git installed can result in differences
260+
in filtering.
261+
262+
* Different git-filter-repo versions used could cause differences in
263+
filtering
264+
265+
Over time, `git-filter-repo` may include new filterings by default,
266+
or fix existing filterings, or make any other number of changes. As
267+
such, having different versions of `git-filter-repo` installed can
268+
result in differences in filtering.
269+
270+
* Different amounts of the repository cloned or differences in
271+
local-only commits can cause differences in filtering
272+
273+
If the clones weren't made at the same time, one clone may have more
274+
commits than the other. Also, both may have made local commits the
275+
other doesn't have. These additional commits could cause history to
276+
be traversed in a different order, and filtering rules are allowed
277+
to have order-dependent rules for how they filter. Further,
278+
filtering rules are allowed to depend upon what history exists in
279+
your clone. As one example, filter-repo's default to update commit
280+
messages which refer to other commits by abbreviated hash, may be
281+
unable to find these other commits in your clone but find them in
282+
your coworkers' clone. Relatedly, filter-repo's update of
283+
abbreviated hashes in commit messages only works for commits that
284+
have already been filtered, and thus depends on the order in which
285+
fast-export traverses the history.
286+
287+
`git-filter-repo` is designed as a _one_-shot history rewriting tool.
288+
Once you have filtered one clone of the repository, you should not be
289+
using it to filter other clones. All other clones of the repository
290+
should either be discarded and recloned, or [have all their history
291+
rebased on top of the rewritten
292+
history](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#_sensitive_data_removal).
293+
294+
<!--
295+
## How do I see what was removed?
296+
297+
Run `git rev-list --objects --all` in both a separate fresh clone from
298+
before the rewrite and in the repo where the rewrite was done. Then
299+
find the objects that exist in the old but not the new.
300+
301+
-->

0 commit comments

Comments
 (0)