Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --max-depth=N #110

Open
jgowdy opened this issue Oct 5, 2020 · 5 comments
Open

Add --max-depth=N #110

jgowdy opened this issue Oct 5, 2020 · 5 comments
Labels
enhancement New feature or request
Milestone

Comments

@jgowdy
Copy link
Contributor

jgowdy commented Oct 5, 2020

Feature Request

In order to better handle larger repos with a great deal of history, it would be valuable to provide an option to scan up to a certain depth of commits. By checking out / materializing the oldest commit specified, you can then scan the deltas only up to that commit for secrets.

The logic would effectively be:

Walk the git log for the Nth commit back

Checkout the selected commit

Scan the checked out commit and any deltas between the selected commit all the way to HEAD, ignoring commits prior to the selected commit.

tartufo --max-depth=50

@jgowdy jgowdy added the enhancement New feature or request label Oct 5, 2020
@tarkatronic
Copy link
Contributor

Hey @jgowdy, thanks for the feature request -- I'm pretty sure we've already got this.

@click.option(
"--max-depth",
default=1000000,
show_default=True,
help="The max commit depth to go back when searching for secrets.",
)

@click.option(
"--max-depth",
default=1000000,
show_default=True,
help="The max commit depth to go back when searching for secrets.",
)

tartufo/tartufo/types.py

Lines 23 to 27 in a345eb9

@dataclass
class GitOptions:
since_commit: Optional[str]
max_depth: int
branch: Optional[str]

tartufo/tartufo/scanner.py

Lines 438 to 440 in a345eb9

for curr_commit in repo.iter_commits(
branch.name, max_count=self.git_options.max_depth
):

Is there more to this that we should be doing? Or is this not working as advertised?

@jgowdy
Copy link
Contributor Author

jgowdy commented Oct 5, 2020

🤦‍♂️ I should have looked first.

@jgowdy jgowdy closed this as completed Oct 5, 2020
@jgowdy jgowdy reopened this Oct 5, 2020
@jgowdy
Copy link
Contributor Author

jgowdy commented Oct 5, 2020

I believe the existing --max-depth command should materialize / checkout the Nth commit and scan that checkout plus the 49 newer deltas, rather than only scanning the 50 newest deltas.

@tarkatronic
Copy link
Contributor

That's an interesting idea. So what you're suggesting is, if I do --max-depth=50, it will scroll back 50 commits, get the entire point-in-time copy of the codebase, scan that, then scan all deltas that were committed since then?

If so, I think this could use some different naming. I really like the idea, and I think it's a valuable one; but I think that the current --max-depth also has a use-case. So maybe we can look at some different naming for this option? --start-from? --materialize-from? Those are pretty terrible. I'd love to work on figuring this out!

@rscottbailey
Copy link
Contributor

How about --from-commit= and accept the usual git references? Especially if the git wrapper could process those transparently...

tartufo --from-commit=abcdef1234 ...
tartufo --from-commit=HEAD~50 ...
tartufo --from-commit=v2.0.0a2 ...

# For extra credit (maybe it would be better not to overload the flag this much)
tartufo --from-commit="01 Aug 2020 14:30" ...
tartufo --from-commit="2 weeks ago" ...

I purposely suggest from-commit instead of since-commit or after-commit in a Quixotic attempt to clarify that the specified commit is included within the scope of the scan. I don't hate start-from although you might consider also start-at.

At which point, if you can do that much folding, spindling, and mutilating, you might as well go whole-hog and do both start-at and end-at (with defaults "beginning of time" and "end of time", respectively). However, there's another problem that became apparent to me as I was researching this note. Quoting the gittutorial manpage on the subject of logging:

The git log command has a weakness: it must present commits in a list. When the history has lines of development
that diverged and then merged back together, the order in which git log presents those commits is meaningless.

More concretely, if the user says "check the last 50 commits" and backtracking down the commit chain leads you past the merge of two active branches, what does "the last 50 commits" even mean? The 50 most chronologically recent commits? The 50 most recent commits reachable from HEAD? All commits which are 50-degrees-of-Kevin-Bacon-from-HEAD or closer? Something else? Is your answer deterministic?

This leaves me thinking that it would be safer to think in terms of chronological boundaries (i.e. starting and/or ending at timestamps) rather than particular commits. That way you can always reason about what you are requesting. Arguably you could have another flag that constrains these further (say to commits on a specified branch, or which are ancestors of a specified commit). However, I wonder if you can implement that without having to scan the entire repo anyway to ensure you didn't miss anything, at which point we've sacrificed the entire motivation for the feature.

@mxhenry-godaddy mxhenry-godaddy added this to the Version 2.2 milestone Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants