Look here first: RDSS Status Page
However, this only checks the front page of our production systems. There can be other things going wrong even if that page is all green.
- We should know before our users do when there is a problem with systems we manage.
- During emergency production outages, we should have information to help us resolve the issue quickly.
- During development, we should have performance telemetry telling us whether the systems we are building will scale and perform as needed.
- Honeybadger - for capturing exceptions in a running application. This makes it easier for us to fix bugs, because it gives us a stack trace and lots of context about how the exception occurred. We sometimes use honeybadger uptime monitoring too, but it is very simple (just a binary Up or Down) and only works for systems that are available on the open Internet. New staff need to be added to Honeybadger manually. It is also advisable that each team member subscribe to uptime notifications via email or SMS.
- DataDog - for Application Performance Monitoring (APM). This service can be expensive, so we only turn it on as needed, and we follow practices to keep costs down. However, it can be invaluable for diagnosing performance issues.
- Icinga - coming soon!
The suite of applications that make up the Princeton Data Commons include (so far):
- PDC Discovery
- Our current setup is to check the application once a minute and report after two failures (i.e. we get notified after two minutes), see PDC Discovery Monitoring.
- PDC Describe