Skip to content

Commit 3206638

Browse files
committed
FAQ: wip
1 parent dfc2d00 commit 3206638

File tree

1 file changed

+298
-0
lines changed

1 file changed

+298
-0
lines changed

Documentation/FAQ.md

Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# Frequently Answered Questions
2+
3+
## Table of Contents
4+
5+
* [Why did `git-filter-repo` rewrite commit hashes?](#why-did-git-filter-repo-rewrite-commit-hashes)
6+
* [Why did `git-filter-repo` rewrite more commit hashes than I expected?](#why-did-git-filter-repo-rewrite-more-commit-hashes-than-i-expected)
7+
* [Why did `git-filter-repo` rewrite other branches too?](#why-did-git-filter-repo-rewrite-other-branches-too)
8+
* [Help! Can I recover or undo the filtering?](#help-can-i-recover-or-undo-the-filtering)
9+
* [Can you change `git-filter-repo` to allow future folks to recover from `--force`'d rewrites?](#can-you-change-git-filter-repo-to-allow-future-folks-to-recover-from---forced-rewrites)
10+
* [Can I use `git-filter-repo` to fix a repository with corruption?](#Can-I-use-git-filter-repo-to-fix-a-repository-with-corruption)
11+
* [What kinds of problems does `git-filter-repo` not try to solve?](#What-kinds-of-problems-does-git-filter-repo-not-try-to-solve)
12+
13+
14+
## Why did `git-filter-repo` rewrite commit hashes?
15+
16+
This is fundamental to how Git operates. In more detail...
17+
18+
Each commit in Git is a hash of its contents. Those contents include
19+
the commit message, the author (name, email, and time authored), the
20+
committer (name, email and time committed), the toplevel tree hash,
21+
and the parent(s) of the commit. This means that if any of the commit
22+
fields change, including the tree hash or the hash of the parent(s) of
23+
the commit, then the hash for the commit will change.
24+
25+
(The same is true for files ("blobs") and trees stored in git as well;
26+
each is a hash of its contents, so literally if anything changes, the
27+
commit hash will change.)
28+
29+
If you attempt to write commit (or tree or blob) objects with an
30+
incorrect hash, Git will reject it as corrupt.
31+
32+
## Why did `git-filter-repo` rewrite more commit hashes than I expected?
33+
34+
There are two aspects to this, or two possible underlying questions users
35+
might be asking here:
36+
* Why did commits newer than the ones I expected have their hash change?
37+
* Why did commits older than the ones I expected have their hash change?
38+
39+
For the first question, see [why filter-repo rewrites commit
40+
hashes](#why-did-git-filter-repo-rewrite-commit-hashes), and note that
41+
if you modify some old commit, perhaps to remove a file, then obviously
42+
that commit's hash must change. Further, since that commit will have a
43+
new hash, any other commit with that commit as a parent will need to
44+
have a new hash. That will need to chain all the way to the most recent
45+
commits in history. This is fundamental to Git and there is nothing you
46+
can do to change this.
47+
48+
For the second question, there are two causes: (1) the filter you
49+
specified applies to the older commits too, or (2) git-fast-export and
50+
git-fast-import (both of which git-filter-repo uses) canonicalize
51+
history in various ways. The second cause means that even if you have
52+
no filter, these tools sometimes change commit hashes. This can happen
53+
in any of these cases:
54+
55+
* If you have signed commits, the signatures will be stripped
56+
* If you have commits with extended headers, the extended headers will
57+
be stripped (signed commits are actually a special case of this)
58+
* If you have commits in an encoding other than UTF-8, they will by
59+
default be re-encoded into UTF-8
60+
* If you have a commit without an author, one will be added that
61+
matches the committer.
62+
* If you have trees that are not canonical (e.g. incorrect sorting
63+
order), they will be canonicalized
64+
65+
If this affects you and you really only want to rewrite newer commits in
66+
history, you can use the `--refs` argument to git-filter-repo to specify
67+
a range of history that you want rewritten.
68+
69+
(For those attempting to be clever and use `--refs` for the first
70+
question: Note that if you attempt to only rewrite a few old commits,
71+
then all you'll succeed in is adding new commits that won't be part of
72+
any branch and will be subject to garbage collection. The branches will
73+
still hold on to the unrewritten versions of the commits. Thus, you
74+
have to rewrite all the way to the branch tip for the rewrite to be
75+
meaningful. Said another way, the `--refs` trick is only useful for
76+
restricting the rewrite to newer commits, never for restricting the
77+
rewrite to older commits.)
78+
79+
## Why did `git-filter-repo` rewrite other branches too?
80+
81+
git-filter-repo's name is git-filter-**_repo_**. Obviously it is going
82+
to rewrite all branches by default.
83+
84+
`git-filter-repo` can restrict its rewriting to a subset of history,
85+
such as a single branch, using the `--refs` option. However, using that
86+
comes with the risk that one branch now has a different version of some
87+
commits than other branches do; usually, when you rewrite history, you
88+
want all branches that depend on what you are rewriting to be updated.
89+
90+
## Help! Can I recover or undo the filtering?
91+
92+
Sure, _if_ you followed the instructions. The instructions told you to
93+
make a fresh clone before running git-filter-repo. If you did that, you
94+
can just throw away your clone with the flubbed rewrite, and make a new
95+
clone.
96+
97+
If you didn't make a fresh clone, and you didn't run with `--force`, you
98+
would have seen the following warning:
99+
```
100+
Aborting: Refusing to destructively overwrite repo history since
101+
this does not look like a fresh clone.
102+
[...]
103+
Please operate on a fresh clone instead. If you want to proceed
104+
anyway, use --force.
105+
```
106+
If you then added `--force`, well, you were warned.
107+
108+
If you didn't make a fresh clone, and you started with `--force`, and you
109+
didn't think to read the description of the `--force` option:
110+
```
111+
Ignore fresh clone checks and rewrite history (an irreversible
112+
operation, especially since it by default ends with an
113+
immediate pruning of reflogs and old objects).
114+
```
115+
and you didn't read even the beginning of the manual
116+
```
117+
git-filter-repo destructively rewrites history
118+
```
119+
and you think it's okay to run a command with `--force` in it on something
120+
you don't have a backup of, then now is the time to reasses your life
121+
choices. `--force` should be a pretty clear warning sign. (If someone
122+
on the internet suggested `--force`, you should complain at them very
123+
loudly, especially if it was on Stack Overflow or some similar site. And
124+
you should also learn to carefully vet commands suggested by others on the
125+
internet.)
126+
127+
See also the next question.
128+
129+
## Can you change `git-filter-repo` to allow future folks to recover from --force'd rewrites?
130+
131+
This will never be supported.
132+
133+
* Providing an alternate method to restore would require storing both
134+
the original history and the new history, meaning that those who are
135+
trying to shrink their repository size instead see it grow and have to
136+
figure out extra steps to expunge the old history to see the actual
137+
size savings. Experience showed with other tools that this was
138+
frustrating and difficult to figure out for many users. Providing an
139+
alternate method to restore would mean that users who are trying to
140+
purge sensitive data from their repository still find the sensitive
141+
data after the rewrite because it hasn't actually been purged. In
142+
order to actually purge it, they have to take extra steps, which again
143+
has made things difficult for users in the past with other tools.
144+
145+
* Providing an alternate method to restore would also mean trying to
146+
figure out what should be backed up and how. The obvious choices used
147+
by previous tools only actually provided partial backups (reflogs
148+
would be ignored for example, as would uncommitted changes whether
149+
staged or not). The only reasonable full backup mechanism is making a
150+
separate clone, which is both expensive and something the user can and
151+
should understand how to do on their own.
152+
153+
* Providing an alternate method to restore would also mean providing
154+
documentation on how to restore. Past methods by other tools in the
155+
history rewriting space suggested that it was rather difficult for
156+
users to figure out. Difficult enough, in fact, that users simply
157+
didn't ever use them. They instead made a separate clone before
158+
rewriting history and if they didn't like the rewrite, then they just
159+
blew it away and made a new clone to work with. Since that was
160+
observed to be the easy restoration method, I simply enforced it with
161+
this tool, requiring users who look like they might not be operating
162+
on a fresh clone to use the --force flag.
163+
164+
But more than all that, if there were an alternate method to restore,
165+
why would you have needed to specify the --force flag? Doesn't its
166+
existence (and the wording of its documentation) make it pretty clear on
167+
its own that there isn't going to be a way to restore?
168+
169+
## Can I use `git-filter-repo` to fix a repository with corruption?
170+
171+
Some kinds of corruption can be fixed, in conjunction with `git
172+
replace`. If `git fsck` reports warnings/errors for certain objects,
173+
you can often [replace them and rewrite
174+
history](examples-from-user-filed-issues.md#Handling-repository-corruption).
175+
176+
## What kinds of problems does `git-filter-repo` not try to solve?
177+
178+
This question is often asked in the form of "How do I..." or even
179+
written as a statement such as "I found a bug with `git-filter-repo`;
180+
the behavior I got was different than I expected..." But if you're
181+
trying to do one of the things below, then `git-filter-repo` is behaving
182+
as designed and either there is no solution to your problem, or you need
183+
to use a different tool to solve your problem. The following subsections
184+
addresses some of these common requests:
185+
186+
### Filtering history but magically keeping the same commit IDs
187+
188+
This is impossible. If you modify commits, or the files contained in
189+
them, then you change their commit IDs; this is [fundamental to
190+
Git](#why-did-git-filter-repo-rewrite-commit-hashes).
191+
192+
However, _if_ you don't need to modify commits, but just don't want to
193+
download everything, then look into one of the following:
194+
* [partial clones](https://git-scm.com/docs/partial-clone)
195+
* the ugly, retarded hack known as [shallow clones](https://git-scm.com/docs/shallow)
196+
* a massive hack like [cheap fake
197+
clones](https://github.com/newren/sequester-old-big-blobs) that at
198+
least let you put your evil overlord laugh to use
199+
200+
### Bidirectional development between a filtered and unfiltered repository
201+
202+
Some folks want to extract a subset of a repository, do development work
203+
on it, then bring those changes back to the original repository, and
204+
send further changes in both directions. Such a tool can be written
205+
using fast-export and fast-import, but would need to make very different
206+
design decisions than `git-filter-repo` did. Such a tool would be
207+
capable of supporting this kind of development, but lose the ability
208+
["to write arbitrary filters using a scripting
209+
language"](https://josh-project.github.io/josh/#concept) and other
210+
features that `git-filter-repo` has.
211+
212+
Such a tool exists; it's called [Josh](https://github.com/josh-project/josh).
213+
Use it if this is your usecase.
214+
215+
### Removing specific commits, or filtering based on the difference (a.k.a. patch or change) between commits
216+
217+
You are probably looking for `git rebase`. `git rebase` operates on the
218+
difference between commits ("diff"), allowing you to e.g. drop or modify
219+
the diff, but then runs the risk of conflicts as it attempts to apply
220+
future diffs. If you tweak one diff in the middle, since it just applies
221+
more diffs for the remaining patches, you'll still see your changes at
222+
the end.
223+
224+
filter-repo, by contrast, uses fast-export and fast-import. Those tools
225+
treat every commit not as a diff but as a "use the same versions of most
226+
files from the parent commit, but make these five files have these exact
227+
contents". So, you don't even have access to the version of files from
228+
the parent commit, make it hard to "undo" part of the changes to some
229+
file. Further, if you decide to drop a commit or tweak the contents of
230+
those new files in that commit, those changes will be reverted by the
231+
next commit in the stream that mentions that file because it's not
232+
applying a diff but a "make this file have these exact contents". So,
233+
filter-repo works well for things like removing a file entirely, but if
234+
you want to make any tweaks to any files you have to make the exact same
235+
tweak over and over for every single commit that touches that file.
236+
237+
In short, `git rebase` is the tool you want for removing specific
238+
commits or otherwise operating on the diff between commits.
239+
240+
### Filtering two different clones of the same repository and getting the same new commit IDs
241+
242+
Sometimes two co-workers have a clone of the same repository and they
243+
run the same `git-filter-repo` command, and they expect to get the same
244+
new commit IDs. Often they do get the same new commit IDs, but
245+
sometimes they don't.
246+
247+
When people get the same commit IDs, it is only by luck; not by design.
248+
There are three reasons this is unsupported and will never be reliable:
249+
250+
* Different Git versions used could cause differences in filtering
251+
252+
Since `git fast-export` and `git fast-import` do various
253+
canonicalizations of history, and these could change over time,
254+
having different versions of Git installed can result in differences
255+
in filtering.
256+
257+
* Different git-filter-repo versions used could cause differences in
258+
filtering
259+
260+
Over time, `git-filter-repo` may include new filterings by default,
261+
or fix existing filterings, or make any other number of changes. As
262+
such, having different versions of `git-filter-repo` installed can
263+
result in differences in filtering.
264+
265+
* Different amounts of the repository cloned or differences in
266+
local-only commits can cause differences in filtering
267+
268+
If the clones weren't made at the same time, one clone may have more
269+
commits than the other. Also, both may have made local commits the
270+
other doesn't have. These additional commits could cause history to
271+
be traversed in a different order, and filtering rules are allowed
272+
to have order-dependent rules for how they filter. Further,
273+
filtering rules are allowed to depend upon what history exists in
274+
your clone. As one example, filter-repo's default to update commit
275+
messages which refer to other commits by abbreviated hash, may be
276+
unable to find these other commits in your clone but find them in
277+
your coworkers' clone. Relatedly, filter-repo's update of
278+
abbreviated hashes in commit messages only works for commits that
279+
have already been filtered, and thus depends on the order in which
280+
fast-export traverses the history.
281+
282+
`git-filter-repo` is designed as a _one_-shot history rewriting tool.
283+
Once you have filtered one clone of the repository, you should not be
284+
using it to filter other clones. All other clones of the repository
285+
should either be discarded and recloned, or [have all their history
286+
rebased on top of the rewritten
287+
history](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#_sensitive_data_removal).
288+
289+
### Conversion between different version control systems (reposurgeon)
290+
291+
<!--
292+
## How do I see what was removed?
293+
294+
* Give answer in terms of `git rev-list --objects --all` in both a
295+
separate fresh clone from before the rewrite and in the repo where
296+
the rewrite was done. Then find the objects that exist in the old
297+
but not the new.
298+
-->

0 commit comments

Comments
 (0)