-
Notifications
You must be signed in to change notification settings - Fork 33
Document how to disable diff-aware mode on CI #352
Comments
In fact, I ended up screwing up my repo's public history and having to do a bunch of Git black magic over this issue, as I couldn't figure out a way to add semgrep in a PR, and then in that same PR have semgrep identify the outstanding issues with my repo so I could fix them. I had to merge the PR and then have my main branch build fail just to be able to see the issues. I then had to fixup solving the issues into the commit adding semgrep itself, so my commits would be clean and atomic and each pass CI and pre-commit (which is something that I naturally would otherwise avoid at all costs on the main branch). I ended up screwing up and fixing up the fixes into the merge commit instead, and ended up having to roll back the repo's entire history to the previous PR merge commit, make a new PR and re-merge that. While this is doable, if hacky and uncough, on a small project that just went public, this would be completely untenable if and when I adopt semgrep on a larger one. Surely there must be a better way than pushing known-failing commits to the public main branch and then rewriting history, or hacky though somewhat safer workarounds like enabling the checks for a specific test branch, pushing to that branch, fixuping to fix any identified issues, then fixiupng setting the check back to run on the correct branches and finally opening and merging a PR off that branch. I'd appreciate any insight you have on how to do this properly, and ensure semgrep is run on my full repo for every PR commit. |
Hi @CAM-Gerlach! Sorry this was such a painful experience. We were planning to change the default behavior to full scan on first run next week. I'll make sure to also document a manual override as part of that work! |
Great, @underyx ! Thanks for your quick response and all your work on the project, and looking forward to it! |
@CAM-Gerlach yeah there's some magic going on under the hood that automatically does diff aware mode if the triggering event is a pull-request in github actions. You should be able to get a full scan by setting a triggering event to be push instead. Also if you want to keep the triggering event be a pull_request you can also set the SEMGREP_BASELINE_REF to be your first commit. (Unfortunately will not trigger on findings in that first commit). The DevEx of semgrep-action is due for some love sorry for the dust :D |
@CAM-Gerlach other than the documentation task, I wanted to ask about your opinion on some further changes we're about to introduce. But first, a quick note about philosophy: the core semgrep project is designed to be a predictable, versatile tool with a clean API. This project however, the semgrep-agent, has less strictness in its design and has a budget for 'magical' behavior, so that it can behave as one would expect with zero configuration. With that in mind 1) Full scan on first runJust like you wanted your initial scan to be a full scan, we suspect almost everyone else does too. So we're thinking of ways to detect when the first scan happens. For users who are connecting to the Semgrep App this is easy, as we keep a record of scans on our database. But to be useful for users without an app account, we were thinking we could detect if this is likely the first scan by looking at the git diff being scanned. If the diff includes So based on this, we'd adjust the default behavior like so:
Of course, manually overriding the mode will also be a documented option. 2) Full scan's log outputWe find that full scans often return too many results to be reasonably scanned by eye. https://gist.github.com/underyx/0367f8bca91ab1a60721a6af210448d4 I'd be very curious to hear any and all comments you'd have about this presentation! Either via comments here… or if by any chance you're up for a 30 minute call, we could schedule that as well and compensate your time with a Semgrep swag pack :) |
Yeah, thanks; as I describe in my comment, it was this very fact that I had to push to a main branch that had CIs enabled (or do some hacky stuff with a new throwaway branch on the public upstream repo, which is what I should have done in retrospect) to get a full baseline scan, resulting in my mainline build failing and requiring rewriting Git history directly on
Yeah, I should have mentioned it above but I actually tried this initially (and a few other settings), but the behavior didn't seem to change. I even tried the second commit (since I always commit an empty first commit), but that didn't seem to work either.
No worries, it seems like a relatively young project relatively speaking (maybe I'm completely wrong about that, lol) and I appreciate the help! |
Sounds great! One thing you might consider is triggering a full scan by default if the semgrep workflow/pipeline are detected to have been modified, as this means errors from newly added rules/rule groups or other changes could be present in existing code and otherwise would go under the radar until the files get touched at some indefinite point in the future, and it seems a pretty safe assumption that if a developer modifies either of these files, they are specific interested in dealing with semgrep and its output, and thus it would make more sense to make them responsible for any new errors by default and ensure they are fixed immediately rather than an unsuspecting dev who happens to touch the files with the newly erroring code somewhere down the line. If you had a config file ( semgrep/semgrep#1054 ) that listed the rules and any other config settings, it would be more straightforward since it had a deterministic name and exclusively semgrep content, but you could still detect it pretty accurately via hueristics, just like you are proposing already. For example, you could check for adding/modifying a |
Happy to give me feedback; my UX experience is more on the GUI side of things (e.g. Spyder) but I can certainly share my perspective as a developer/user. Also, my usefulness is tempered by the fact that so far, I've only applied full linting suites like Semgrep to smaller (on the order of 10k LoC), well-linted projects with few/one active contributor, not the big ones like Spyder with lots of devs and inflight PRs (in part due to the magnitude of the scope, as well as difficulty getting widespread buyin from the community for the value of it relative to the effort and potential disruption. Would this be the replacement for the logs I see in the regular GH Actions UI (here's the example of my initial run on the small project in question above https://github.com/r-spacex/submanager/runs/3391738017?check_suite_focus=true ), would this be a separate log file generated as an artifact of the build, or something else? I like having a high-level summary, and what you have looks great overall so far; and is both more informative in most critical respects and easier to read/parse than the current form. There are some small things I might suggest fixing/clarifying, but let's just start with the main high level question: Would this be a replacement for the logs we currently see in GH Actions, that lists the actual errors and shows where they occur, along with the context? Or would this go somewhere else (if so, where? as a summary? as a comment?). If so, where would we be able to actually see what the errors are, without using the app? Not sure if this is what you're looking for, but on a high level, what I might suggest is something like:
But that's very much one newer user's opinion, so take that with a big grain of salt. For a point of reference, I based some of these suggestions on what Mega-Linter, and by extension Github's popular Super-Linter, does; see e.g. r-spacex/submanager#11 for a simple but representative example. With respect to the log output, semgrep is already far less verbose than Mega Linter, and is much easier to quickly find errors. GraphQL is more of a comparable product, but its output is very different and integrated with Github, rather than being standalone. I'd be flattered to hop on a call if you really think it would be helpful, given the caveats above. Just be aware I'm not even a "real" developer, I just play one on the internet. In "real life", I'm a NASA satellite data scientist and machine learning specialist. |
Before I send my other replies: just to clarify, touching a file with a pre-existing issue will not trigger a new finding. Even re-indenting or moving the exact line of code with existing issues will not trigger a new finding. I'm just noting this in case it helps with community adoption 🙃 |
@underyx Ah okay; sorry for misunderstanding and thanks for clarifying. In that case, are the new findings simply never raised at all in the situations I mentioned (adding a new policy, adding Semgrep as a whole) unless the specific line is touched, or what ends up happening? |
What actually happens is that for any current finding, we scan the old version of the code from the base branch and see if the same rule fires on an identical line (minus whitespace.) If yes, we don't report the finding, assuming that it's pre-existing. By the way, I'm working on some PRs for this issue right now, based on your feedback! Will respond to all of it soon! |
In my case I don't really want special magic so it behaves differently on a first scan vs later ones. I want predictable behaviour, so I know whether there are any issues in the repo or not. The workflow I prefer depends on the scale of the issues found, but is either:
Or:
In either scenario, I don't want my main branch ever failing and I don't want issues to be lurking but not shown. I assumed that setting |
👍
Yeah; that was my assumption as well, but at least when I experimented with it, it didn't seem to work at least at present. |
I second this. Please make it possible to do a full scan on merge requests, as well. I think semgrep is a great tool and it's nice to see you are taking care of sane defaults, but if you provide configuration options for people that should always override any defaults, please.
I also tried |
As for me, I would be satisfied with command-line option, forcing full scan. |
I've scoured the docs, searched Google, looked through every issue in this repo and tried a bunch of things myself, and I can't seem to find anywhere that documents how to disable diff-aware mode for PRs, e.g. for the initial run of Semgrep on a new repo, and to ensure that the entire codebase is clean. Could that be clarified? Thanks!
The text was updated successfully, but these errors were encountered: