diff --git a/docs/source/incident-reports/2022-01-27-prime-version.md b/docs/source/incident-reports/2022-01-27-prime-version.md
new file mode 100644
index 000000000..e80edcd5c
--- /dev/null
+++ b/docs/source/incident-reports/2022-01-27-prime-version.md
@@ -0,0 +1,131 @@
+# 2022-01-27, stale prime version
+
+## Summary
+
+A bug in the federation-redirect health checks
+kept the reference version from being updated,
+preventing traffic from being sent to other federation members.
+
+This effect lasted approximately nine hours before normal operation was restored without intervention.
+
+![Grafana chart showing traffic to other federation members dropping for hours, then rising again](img/2022-01-27-pods.png)
+
+## Timeline
+
+All times in CET
+
+### 2022-01-27 11:00
+
+A routine bump of binderhub and reop2docker versions is deployed.
+Traffic to federation members other than GKE begins to drop.
+
+At this point, the prime cluster (GKE) is 'full' according to its pod quota.
+
+Due to a bug in federation-redirect,
+as long as a cluster is full,
+its version is not checked.
+
+However, the version of the prime cluster is used as a reference
+when comparing with other federation members.
+
+As long as the prime cluster stays 'full' (or unhealthy for any reason),
+the reference version will be out of date,
+Any actually up-to-date federation members will be considered to have mismatched version,
+and will not receive any redirect traffic.
+
+### 14:35
+
+Federation members notice drop in traffic.
+
+### 18:00
+
+GKE load drops below quota,
+allowing the prime version to be updated in the federation-redirector.
+At this point, traffic to all federation members returns to normal.
+
+### 2022-01-28 09:00
+
+Investigations starts.
+
+federation-redirect logs show that the version mismatch is the cause of the drop in traffic,
+and that the 'prime' version is the one that's out-of-date.
+
+Initial suspicion was an out-of-date binder pod was still receiving traffic.
+Exploring logs in GCP logs explorer eventually eliminated this possibility.
+
+Useful log query:
+
+```
+resource.type="k8s_container" resource.labels.cluster_name="prod"
+resource.labels.container_name="binder"
+"GET / "
+```
+
+It was useful to expand the "Log Fields" to see the counts by pod name within the time frame:
+
+![GCP log fields showing log events per pod name](img/2022-01-27-log-fields.png)
+
+A brief digression reduced trust in GCP logs because it is clear that the "Kubernetes Pod" resource logs are missing many entries:
+
+```
+resource.type="k8s_pod" resource.labels.cluster_name="prod"
+resource.labels.pod_name=~"^binder.*"
+```
+
+
+
+### 2022-01-28 10:00
+
+Underlying issue in federation-redirect is identified and [fixed][fix-pr].
+
+## Lessons learnt
+
+### What went well
+
+1. Cluster healed itself when traffic dropped below the prime cluster's pod quota.
+2. federation-redirect logs clearly showed it was the version mismatch preventing traffic.
+
+
+### What went wrong
+
+1. federation-redirect did not have the correct 'prime version' for 9 hours, due to load on the prime cluster ([fixed][fix-pr])
+2. Response time was low. The issue went unnoticed for 3 hours,
+   and without action for at least 9 hours (when it resolved itself).
+   We do not currently have much int he way of proactive 'alerting' ([issue][alerting-issue]),
+   so only when grafana or the status page is being watched do we notice issues like this.
+3. The health of the cluster is not immediately visible except in logs or long-term traffic trends,
+   which can take a while to be noticed ([issue][metrics-issue]).
+   We do have most of these checks reimplemented on [the status page][status],
+   but not the actual values as considered by the redirector (bugs and all).
+   Notably, the status page does not include the version check,
+   which caused this incident.
+
+[status]: https://mybinder.readthedocs.io/en/latest/about/status.html
+
+
+### Where we got lucky
+
+1. Federation 'as a whole' handled the traffic okay,
+   because the prime cluster can actually handle more traffic than its pod quota.
+   That said, if we actually stopped routing traffic to the prime cluster while it is unhealthy,
+   the issue would hav been resolved in minutes
+   because it would not have stayed full.
+
+## Action items
+
+### Process improvements
+
+1. Implement proactive alerts. This is a longstanding issue we have yet to address ([issue][alerting-issue]).
+
+
+### Technical improvements
+
+1. Fix bug preventing prime cluster's version from being updated while it is full ([pull request][fix-pr])
+2. Add metrics for the federation-redirect service to improve visibility of federation health issues ([issue][metrics-issue])
+3. Consider allowing the prime cluster to stop accepting traffic when it is unhealthy ([issue][full-issue])
+
+
+[alerting-issue]: https://github.com/jupyterhub/mybinder.org-deploy/issues/611
+[metrics-issue]: https://github.com/jupyterhub/mybinder.org-deploy/issues/2114
+[fix-pr]: https://github.com/jupyterhub/mybinder.org-deploy/pull/2113
+[full-issue]: https://github.com/jupyterhub/mybinder.org-deploy/issues/2115
diff --git a/docs/source/incident-reports/img/2022-01-27-log-fields.png b/docs/source/incident-reports/img/2022-01-27-log-fields.png
new file mode 100644
index 000000000..c89bebf39
Binary files /dev/null and b/docs/source/incident-reports/img/2022-01-27-log-fields.png differ
diff --git a/docs/source/incident-reports/img/2022-01-27-pods.png b/docs/source/incident-reports/img/2022-01-27-pods.png
new file mode 100644
index 000000000..6e2d3ed65
Binary files /dev/null and b/docs/source/incident-reports/img/2022-01-27-pods.png differ
diff --git a/docs/source/incident-reports/index.rst b/docs/source/incident-reports/index.rst
index e70908b06..ab46fa77a 100644
--- a/docs/source/incident-reports/index.rst
+++ b/docs/source/incident-reports/index.rst
@@ -26,6 +26,7 @@ Incident history
 .. toctree::
    :maxdepth: 1
 
+   2022-01-27-prime-version.md
    2020-07-09-simultaneous-launches-scipy.md
    2019-04-03-ingress-cordoned
    2019-03-24-r2d-upgrade