aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems
diff options
context:
space:
mode:
authorGravatar Johannes Christ <[email protected]>2024-07-25 20:41:55 +0200
committerGravatar Johannes Christ <[email protected]>2024-07-25 20:56:43 +0200
commitc1551cbce8527beb07e8cb38bbcc1f57f4185210 (patch)
tree2b5ff651bc9922af239c58c30d18d4aa40f63a4a /docs/postmortems
parentFix index page formatting (diff)
Update document titles
Diffstat (limited to 'docs/postmortems')
-rw-r--r--docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst18
-rw-r--r--docs/postmortems/2021-07-11-cascading-node-failures.rst18
2 files changed, 18 insertions, 18 deletions
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
index 57f9fd8..f621782 100644
--- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
+++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
@@ -5,7 +5,7 @@ At 03:01 UTC on Tuesday 12th January we experienced a momentary outage
of our PostgreSQL database, causing some very minor service downtime.
⚠️ Leadup
-=========
+---------
*List the sequence of events that led to the incident*
@@ -35,7 +35,7 @@ before 3:40 where no statistics were reported is the actual outage
period where the Kubernetes node had some networking errors.
πŸ₯ Impact
-=========
+---------
*Describe how internal and external users were impacted during the
incident*
@@ -48,7 +48,7 @@ Most services stayed up that did not depend on PostgreSQL, and the site
remained stable after the sync had been cancelled.
πŸ‘οΈ Detection
-============
+---------------
*Report when the team detected the incident, and how we could improve
detection time*
@@ -69,7 +69,7 @@ time of crashing by examining the logs which pointed us towards the user
sync process.
πŸ™‹πŸΏβ€β™‚οΈ Response
-================
+-----------------------
*Who responded to the incident, and what obstacles did they encounter?*
@@ -78,7 +78,7 @@ encountered other than the node being less performant than we would like
due to the CPU starvation.
πŸ™†πŸ½β€β™€οΈ Recovery
-================
+---------------------------
*How was the incident resolved? How can we improve future mitigation?*
@@ -113,7 +113,7 @@ as you can see site hit this twice (during the periods where it was
trying to sync 80k users at once)
πŸ”Ž Five Why’s
-=============
+---------------------------
*Run a 5-whys analysis to understand the true cause of the incident.*
@@ -125,7 +125,7 @@ trying to sync 80k users at once)
resulting in 80k users needing updating.
🌱 Blameless root cause
-=======================
+-----------------------
*Note the final root cause and describe what needs to change to prevent
reoccurrance*
@@ -137,14 +137,14 @@ See the follow up tasks on exactly how we can avoid this in future, it’s
a relatively easy mitigation.
πŸ€” Lessons learned
-==================
+-----------------------
*What did we learn from this incident?*
- Django (or DRF) does not like huge update queries.
β˜‘οΈ Follow-up tasks
-==================
+------------------
*List any tasks we should complete that are relevant to this incident*
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst
index 6cd30f3..b2e5cdf 100644
--- a/docs/postmortems/2021-07-11-cascading-node-failures.rst
+++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst
@@ -7,7 +7,7 @@ affected node volumes were placed into a state where they could not be
mounted.
⚠️ Leadup
-=========
+----------
*List the sequence of events that led to the incident*
@@ -32,7 +32,7 @@ mounted.
service was not healthy.
πŸ₯ Impact
-=========
+----------
*Describe how internal and external users were impacted during the
incident*
@@ -48,7 +48,7 @@ PostgreSQL was restored early on so for the most part Moderation could
continue.
πŸ‘οΈ Detection
-============
+---------------
*Report when the team detected the incident, and how we could improve
detection time*
@@ -61,7 +61,7 @@ conn delta vs.Β conn total), but for the most part alerting time was
satisfactory here.
πŸ™‹πŸΏβ€β™‚οΈ Response
-================
+-----------------
*Who responded to the incident, and what obstacles did they encounter?*
@@ -70,7 +70,7 @@ at Linode to remount the affected volumes, a support ticket has been
created.
πŸ™†πŸ½β€β™€οΈ Recovery
-================
+------------------
*How was the incident resolved? How can we improve future mitigation?*
@@ -262,7 +262,7 @@ Joe
</details>
πŸ”Ž Five Why’s
-=============
+---------------
*Run a 5-whys analysis to understand the true cause of the incident.*
@@ -305,7 +305,7 @@ have. I presume it’s some sort of worker race condition, but I’ve not
been able to reproduce it.
🌱 Blameless root cause
-=======================
+-----------------------
*Note the final root cause and describe what needs to change to prevent
reoccurrence*
@@ -315,7 +315,7 @@ mounting was difficult, it took multiple hours for this to be resolved
since we had to wait for the volumes to unlock so they could be cloned.
πŸ€” Lessons learned
-==================
+------------------
*What did we learn from this incident?*
@@ -325,7 +325,7 @@ We need to look at why Django is doing this and mitigations of the fault
to prevent this from occurring again.
β˜‘οΈ Follow-up tasks
-==================
+------------------
*List any tasks we should complete that are relevant to this incident*