Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-Agent filter list in the UI / new crawler request #12753

Open
dryoma opened this issue Apr 12, 2019 · 4 comments
Open

User-Agent filter list in the UI / new crawler request #12753

dryoma opened this issue Apr 12, 2019 · 4 comments

Comments

@dryoma
Copy link

dryoma commented Apr 12, 2019

Summary

Add a setting in the interface to ban bots based on a User-Agent. Or maybe we at least can file the User-Agents so that you could block them manually?

Motivation

There is a toggle called "Filter out known web crawlers" on the Inbound Filters settings page. Sometimes new crawlers appear that slip by that filter. The bad thing is that ignoring them in beforeSend doesn't always work even if a User-Agent is clear. They might be using some kind of cached versions of pages or they might be stripping script objects from pages - in any case, for us the flood of about 200 event per hour hasn't stopped even after adding this code in beforeSend():

  if (/lyticsbot/.test(window.navigator.userAgent) ||
    event.request && event.request.headers && event.request.headers['User-Agent'].search('lyticsbot') !== -1) {
    return null;
  }

Additional Context

Here is the issue page: https://sentry.io/organizations/policeone/issues/982063112/events/latest/?project=67360 It actually started after upgrading to the new JS SDK (4.6.6) from raven (3.26.4). Prior to that even the window.navigator.userAgent was sufficient for blocking errors from that bot.

The crawler's UA in almost 100% of the cases is

User-Agent: lyticsbot-external
@donaldpipowitch
Copy link

That would be awesome. A way to add custom crawlers (by adding a user agent reg ex) which are not part of "Filter out known web crawlers". In my case requests coming from site24x7.com.

@josephwynn-sc
Copy link

We would find this feature really useful. Right now we get a lot of events from a variety of bots, and it would be useful if we could apply inbound filters to ignore these. The majority of the bots we get are:

  • Performance testing tools (WebPageTest, SpeedCurve, Lighthouse, etc)
  • Headless browsers that are scraping content or trying to find vulnerabilities.

@getsantry
Copy link
Contributor

getsantry bot commented Nov 14, 2024

Routing to @getsentry/product-owners-issues for triage ⏲️

@lobsterkatie
Copy link
Member

One relatively easy first step would be to add a note in product (or a link in product to the docs) suggesting either filing an issue asking to update (or just creating a PR to update) either https://github.com/getsentry/relay/blob/master/relay-filter/src/web_crawlers.rs or https://github.com/getsentry/relay/blob/master/relay-filter/src/browser_extensions.rs as appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

10 participants