Skip to content
This repository has been archived by the owner on Apr 9, 2024. It is now read-only.

Document how to disable diff-aware mode on CI #352

Open
CAM-Gerlach opened this issue Aug 22, 2021 · 15 comments
Open

Document how to disable diff-aware mode on CI #352

CAM-Gerlach opened this issue Aug 22, 2021 · 15 comments

Comments

@CAM-Gerlach
Copy link

I've scoured the docs, searched Google, looked through every issue in this repo and tried a bunch of things myself, and I can't seem to find anywhere that documents how to disable diff-aware mode for PRs, e.g. for the initial run of Semgrep on a new repo, and to ensure that the entire codebase is clean. Could that be clarified? Thanks!

@CAM-Gerlach
Copy link
Author

In fact, I ended up screwing up my repo's public history and having to do a bunch of Git black magic over this issue, as I couldn't figure out a way to add semgrep in a PR, and then in that same PR have semgrep identify the outstanding issues with my repo so I could fix them. I had to merge the PR and then have my main branch build fail just to be able to see the issues. I then had to fixup solving the issues into the commit adding semgrep itself, so my commits would be clean and atomic and each pass CI and pre-commit (which is something that I naturally would otherwise avoid at all costs on the main branch). I ended up screwing up and fixing up the fixes into the merge commit instead, and ended up having to roll back the repo's entire history to the previous PR merge commit, make a new PR and re-merge that.

While this is doable, if hacky and uncough, on a small project that just went public, this would be completely untenable if and when I adopt semgrep on a larger one. Surely there must be a better way than pushing known-failing commits to the public main branch and then rewriting history, or hacky though somewhat safer workarounds like enabling the checks for a specific test branch, pushing to that branch, fixuping to fix any identified issues, then fixiupng setting the check back to run on the correct branches and finally opening and merging a PR off that branch. I'd appreciate any insight you have on how to do this properly, and ensure semgrep is run on my full repo for every PR commit.

@underyx
Copy link
Member

underyx commented Aug 22, 2021

Hi @CAM-Gerlach! Sorry this was such a painful experience. We were planning to change the default behavior to full scan on first run next week. I'll make sure to also document a manual override as part of that work!

@CAM-Gerlach
Copy link
Author

Great, @underyx ! Thanks for your quick response and all your work on the project, and looking forward to it!

@brendongo
Copy link
Member

@CAM-Gerlach yeah there's some magic going on under the hood that automatically does diff aware mode if the triggering event is a pull-request in github actions. You should be able to get a full scan by setting a triggering event to be push instead. Also if you want to keep the triggering event be a pull_request you can also set the SEMGREP_BASELINE_REF to be your first commit. (Unfortunately will not trigger on findings in that first commit). The DevEx of semgrep-action is due for some love sorry for the dust :D

@underyx
Copy link
Member

underyx commented Aug 26, 2021

@CAM-Gerlach other than the documentation task, I wanted to ask about your opinion on some further changes we're about to introduce. But first, a quick note about philosophy: the core semgrep project is designed to be a predictable, versatile tool with a clean API. This project however, the semgrep-agent, has less strictness in its design and has a budget for 'magical' behavior, so that it can behave as one would expect with zero configuration. With that in mind

1) Full scan on first run

Just like you wanted your initial scan to be a full scan, we suspect almost everyone else does too. So we're thinking of ways to detect when the first scan happens. For users who are connecting to the Semgrep App this is easy, as we keep a record of scans on our database.

But to be useful for users without an app account, we were thinking we could detect if this is likely the first scan by looking at the git diff being scanned. If the diff includes image: returntocorp/semgrep-agent in .gitlab-ci.yml or uses: returntocorp/semgrep-action in .github/**.yml, we'll guess that it's the first scan.

So based on this, we'd adjust the default behavior like so:

1st scan Nth scan
on push full scan full scan
on PR full scan diff scan

Of course, manually overriding the mode will also be a documented option.

2) Full scan's log output

We find that full scans often return too many results to be reasonably scanned by eye.
We were thinking of adjusting the scan output to look like in this gist:

https://gist.github.com/underyx/0367f8bca91ab1a60721a6af210448d4

I'd be very curious to hear any and all comments you'd have about this presentation! Either via comments here… or if by any chance you're up for a 30 minute call, we could schedule that as well and compensate your time with a Semgrep swag pack :)

@CAM-Gerlach
Copy link
Author

CAM-Gerlach commented Aug 28, 2021

@brendongo

You should be able to get a full scan by setting a triggering event to be push instead.

Yeah, thanks; as I describe in my comment, it was this very fact that I had to push to a main branch that had CIs enabled (or do some hacky stuff with a new throwaway branch on the public upstream repo, which is what I should have done in retrospect) to get a full baseline scan, resulting in my mainline build failing and requiring rewriting Git history directly on master to clean things up, thus the whole mess I describe in detail above. Furthermore, thanks to the particular project being brand new and I being the sole developer so far it was practical to recover from (one of its main purposes is to test things like this), but on a bigger project with continuing development and lots of contributors, it would have been much more of a problem.

Also if you want to keep the triggering event be a pull_request you can also set the SEMGREP_BASELINE_REF to be your first commit. (Unfortunately will not trigger on findings in that first commit).

Yeah, I should have mentioned it above but I actually tried this initially (and a few other settings), but the behavior didn't seem to change. I even tried the second commit (since I always commit an empty first commit), but that didn't seem to work either.

The DevEx of semgrep-action is due for some love sorry for the dust :D

No worries, it seems like a relatively young project relatively speaking (maybe I'm completely wrong about that, lol) and I appreciate the help!

@CAM-Gerlach
Copy link
Author

@underyx

But to be useful for users without an app account, we were thinking we could detect if this is likely the first scan by looking at the git diff being scanned. If the diff includes image: returntocorp/semgrep-agent in .gitlab-ci.yml or uses: returntocorp/semgrep-action in .github/**.yml, we'll guess that it's the first scan.

Sounds great! One thing you might consider is triggering a full scan by default if the semgrep workflow/pipeline are detected to have been modified, as this means errors from newly added rules/rule groups or other changes could be present in existing code and otherwise would go under the radar until the files get touched at some indefinite point in the future, and it seems a pretty safe assumption that if a developer modifies either of these files, they are specific interested in dealing with semgrep and its output, and thus it would make more sense to make them responsible for any new errors by default and ensure they are fixed immediately rather than an unsuspecting dev who happens to touch the files with the newly erroring code somewhere down the line.

If you had a config file ( semgrep/semgrep#1054 ) that listed the rules and any other config settings, it would be more straightforward since it had a deterministic name and exclusively semgrep content, but you could still detect it pretty accurately via hueristics, just like you are proposing already. For example, you could check for adding/modifying a yml|yaml file in the .github/workflows directory for Github, or the equivalent for Gitlab that contained semgrep in the name, which is likely to be the great majority of users given this is what at least Github (and I assume Gitlab) as well as what your instructions specify by default. For the perhaps small minority of users who embed semgrep in a later general purpose workflow file, you could fall back on the approach you suggest above. Assuming, of course, this doesn't exceed your magic budget 😉

@CAM-Gerlach
Copy link
Author

CAM-Gerlach commented Aug 28, 2021

We were thinking of adjusting the scan output to look like in this gist:

Happy to give me feedback; my UX experience is more on the GUI side of things (e.g. Spyder) but I can certainly share my perspective as a developer/user. Also, my usefulness is tempered by the fact that so far, I've only applied full linting suites like Semgrep to smaller (on the order of 10k LoC), well-linted projects with few/one active contributor, not the big ones like Spyder with lots of devs and inflight PRs (in part due to the magnitude of the scope, as well as difficulty getting widespread buyin from the community for the value of it relative to the effort and potential disruption.

Would this be the replacement for the logs I see in the regular GH Actions UI (here's the example of my initial run on the small project in question above https://github.com/r-spacex/submanager/runs/3391738017?check_suite_focus=true ), would this be a separate log file generated as an artifact of the build, or something else?

I like having a high-level summary, and what you have looks great overall so far; and is both more informative in most critical respects and easier to read/parse than the current form. There are some small things I might suggest fixing/clarifying, but let's just start with the main high level question: Would this be a replacement for the logs we currently see in GH Actions, that lists the actual errors and shows where they occur, along with the context? Or would this go somewhere else (if so, where? as a summary? as a comment?). If so, where would we be able to actually see what the errors are, without using the app?

Not sure if this is what you're looking for, but on a high level, what I might suggest is something like:

  • Revise the log output to look mostly like that depicted in the gist (with minor formatting tweaks, etc)
  • Retain the full display of issues identified, but sort them by category, with most important coming first (e.g. blocking, then audit, then inventory)
  • Have the action post a comment on the issue, like Mega-Linter does, with the summary/scan report and links to the full report, where practical.

But that's very much one newer user's opinion, so take that with a big grain of salt.

For a point of reference, I based some of these suggestions on what Mega-Linter, and by extension Github's popular Super-Linter, does; see e.g. r-spacex/submanager#11 for a simple but representative example. With respect to the log output, semgrep is already far less verbose than Mega Linter, and is much easier to quickly find errors. GraphQL is more of a comparable product, but its output is very different and integrated with Github, rather than being standalone.

I'd be flattered to hop on a call if you really think it would be helpful, given the caveats above. Just be aware I'm not even a "real" developer, I just play one on the internet. In "real life", I'm a NASA satellite data scientist and machine learning specialist.

@underyx
Copy link
Member

underyx commented Aug 30, 2021

rather than an unsuspecting dev who happens to touch the files with the newly erroring code somewhere down the line.

Before I send my other replies: just to clarify, touching a file with a pre-existing issue will not trigger a new finding. Even re-indenting or moving the exact line of code with existing issues will not trigger a new finding. I'm just noting this in case it helps with community adoption 🙃

@CAM-Gerlach
Copy link
Author

CAM-Gerlach commented Aug 30, 2021

@underyx Ah okay; sorry for misunderstanding and thanks for clarifying. In that case, are the new findings simply never raised at all in the situations I mentioned (adding a new policy, adding Semgrep as a whole) unless the specific line is touched, or what ends up happening?

@underyx
Copy link
Member

underyx commented Aug 30, 2021

@underyx Ah okay; sorry for misunderstanding and thanks for clarifying. In that case, are the new findings simply never raised at all in the situations I mentioned (adding a new policy, adding Semgrep as a whole) unless the specific line is touched, or what ends up happening?

What actually happens is that for any current finding, we scan the old version of the code from the base branch and see if the same rule fires on an identical line (minus whitespace.) If yes, we don't report the finding, assuming that it's pre-existing.

By the way, I'm working on some PRs for this issue right now, based on your feedback! Will respond to all of it soon!

@mykter
Copy link

mykter commented Nov 24, 2021

Just like you wanted your initial scan to be a full scan, we suspect almost everyone else does too.

In my case I don't really want special magic so it behaves differently on a first scan vs later ones. I want predictable behaviour, so I know whether there are any issues in the repo or not.

The workflow I prefer depends on the scale of the issues found, but is either:

  1. Add semgrep to CI and see all the issues
  2. Work through the backlog until they're all resolved, merge
  3. Know that every run is reporting on any findings

Or:

  1. Add semgrep to CI, see that there is an impractically large backlog, enable audit mode
  2. Merge
  3. work through backlog over time; diff-aware mode preventing introduction of new issues

In either scenario, I don't want my main branch ever failing and I don't want issues to be lurking but not shown.

I assumed that setting SEMGREP_BASELINE_REF to an empty string would do the job of disabling the diff-aware logic - that could be an option for how to configure it?

@CAM-Gerlach
Copy link
Author

👍

I assumed that setting SEMGREP_BASELINE_REF to an empty string would do the job of disabling the diff-aware logic - that could be an option for how to configure it?

Yeah; that was my assumption as well, but at least when I experimented with it, it didn't seem to work at least at present.

@olastor
Copy link

olastor commented Apr 19, 2022

I assumed that setting SEMGREP_BASELINE_REF to an empty string would do the job of disabling the diff-aware logic - that could be an option for how to configure it?

I second this. Please make it possible to do a full scan on merge requests, as well. I think semgrep is a great tool and it's nice to see you are taking care of sane defaults, but if you provide configuration options for people that should always override any defaults, please.

--baseline-commit TEXT Only show results that are not found in this commit
hash. Aborts run if not currently in a git directory,
there are unstaged changes, or given baseline hash
doesn't exist

I also tried --baseline-commit "" which does not abort the run, but automatically makes a baseline scan on merge requests, as well.

@amdei
Copy link

amdei commented Apr 25, 2022

As for me, I would be satisfied with command-line option, forcing full scan.
Thus it makes it possible to use such option in CI for MRs to default branch only.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

6 participants