Skip to content

Latest commit

 

History

History
25 lines (20 loc) · 2.2 KB

monitoring.md

File metadata and controls

25 lines (20 loc) · 2.2 KB

Monitoring in RDSS

Where to start

Look here first: RDSS Status Page

However, this only checks the front page of our production systems. There can be other things going wrong even if that page is all green.

Goals of monitoring:

  1. We should know before our users do when there is a problem with systems we manage.
  2. During emergency production outages, we should have information to help us resolve the issue quickly.
  3. During development, we should have performance telemetry telling us whether the systems we are building will scale and perform as needed.

The Monitoring Systems

  1. Honeybadger - for capturing exceptions in a running application. This makes it easier for us to fix bugs, because it gives us a stack trace and lots of context about how the exception occurred. We sometimes use honeybadger uptime monitoring too, but it is very simple (just a binary Up or Down) and only works for systems that are available on the open Internet. New staff need to be added to Honeybadger manually. It is also advisable that each team member subscribe to uptime notifications via email or SMS.
  2. DataDog - for Application Performance Monitoring (APM). This service can be expensive, so we only turn it on as needed, and we follow practices to keep costs down. However, it can be invaluable for diagnosing performance issues.
  3. Icinga - coming soon!

Monitoring by Application Group

Princeton Data Commons (PDC*)

The suite of applications that make up the Princeton Data Commons include (so far):