Update document titles

author: Johannes Christ <[email protected]> 2024-07-25 20:41:55 +0200
committer: Johannes Christ <[email protected]> 2024-07-25 20:56:43 +0200
commit: c1551cbce8527beb07e8cb38bbcc1f57f4185210 (patch)
tree: 2b5ff651bc9922af239c58c30d18d4aa40f63a4a /docs/postmortems
parent: Fix index page formatting (diff)
2 files changed, 18 insertions, 18 deletions
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
index 57f9fd8..f621782 100644
--- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
+++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
@@ -5,7 +5,7 @@ At 03:01 UTC on Tuesday 12th January we experienced a momentary outage
 of our PostgreSQL database, causing some very minor service downtime.
 
 ⚠️ Leadup
-=========
+---------
 
 *List the sequence of events that led to the incident*
 
@@ -35,7 +35,7 @@ before 3:40 where no statistics were reported is the actual outage
 period where the Kubernetes node had some networking errors.
 
 🥏 Impact
-=========
+---------
 
 *Describe how internal and external users were impacted during the
 incident*
@@ -48,7 +48,7 @@ Most services stayed up that did not depend on PostgreSQL, and the site
 remained stable after the sync had been cancelled.
 
 👁️ Detection
-============
+---------------
 
 *Report when the team detected the incident, and how we could improve
 detection time*
@@ -69,7 +69,7 @@ time of crashing by examining the logs which pointed us towards the user
 sync process.
 
 🙋🏿‍♂️ Response
-================
+-----------------------
 
 *Who responded to the incident, and what obstacles did they encounter?*
 
@@ -78,7 +78,7 @@ encountered other than the node being less performant than we would like
 due to the CPU starvation.
 
 🙆🏽‍♀️ Recovery
-================
+---------------------------
 
 *How was the incident resolved? How can we improve future mitigation?*
 
@@ -113,7 +113,7 @@ as you can see site hit this twice (during the periods where it was
 trying to sync 80k users at once)
 
 🔎 Five Why’s
-=============
+---------------------------
 
 *Run a 5-whys analysis to understand the true cause of the incident.*
 
@@ -125,7 +125,7 @@ trying to sync 80k users at once)
    resulting in 80k users needing updating.
 
 🌱 Blameless root cause
-=======================
+-----------------------
 
 *Note the final root cause and describe what needs to change to prevent
 reoccurrance*
@@ -137,14 +137,14 @@ See the follow up tasks on exactly how we can avoid this in future, it’s
 a relatively easy mitigation.
 
 🤔 Lessons learned
-==================
+-----------------------
 
 *What did we learn from this incident?*
 
 -  Django (or DRF) does not like huge update queries.
 
 ☑️ Follow-up tasks
-==================
+------------------
 
 *List any tasks we should complete that are relevant to this incident*
 
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst
index 6cd30f3..b2e5cdf 100644
--- a/docs/postmortems/2021-07-11-cascading-node-failures.rst
+++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst
@@ -7,7 +7,7 @@ affected node volumes were placed into a state where they could not be
 mounted.
 
 ⚠️ Leadup
-=========
+----------
 
 *List the sequence of events that led to the incident*
 
@@ -32,7 +32,7 @@ mounted.
    service was not healthy.
 
 🥏 Impact
-=========
+----------
 
 *Describe how internal and external users were impacted during the
 incident*
@@ -48,7 +48,7 @@ PostgreSQL was restored early on so for the most part Moderation could
 continue.
 
 👁️ Detection
-============
+---------------
 
 *Report when the team detected the incident, and how we could improve
 detection time*
@@ -61,7 +61,7 @@ conn delta vs. conn total), but for the most part alerting time was
 satisfactory here.
 
 🙋🏿‍♂️ Response
-================
+-----------------
 
 *Who responded to the incident, and what obstacles did they encounter?*
 
@@ -70,7 +70,7 @@ at Linode to remount the affected volumes, a support ticket has been
 created.
 
 🙆🏽‍♀️ Recovery
-================
+------------------
 
 *How was the incident resolved? How can we improve future mitigation?*
 
@@ -262,7 +262,7 @@ Joe
    </details>
 
 🔎 Five Why’s
-=============
+---------------
 
 *Run a 5-whys analysis to understand the true cause of the incident.*
 
@@ -305,7 +305,7 @@ have. I presume it’s some sort of worker race condition, but I’ve not
 been able to reproduce it.
 
 🌱 Blameless root cause
-=======================
+-----------------------
 
 *Note the final root cause and describe what needs to change to prevent
 reoccurrence*
@@ -315,7 +315,7 @@ mounting was difficult, it took multiple hours for this to be resolved
 since we had to wait for the volumes to unlock so they could be cloned.
 
 🤔 Lessons learned
-==================
+------------------
 
 *What did we learn from this incident?*
 
@@ -325,7 +325,7 @@ We need to look at why Django is doing this and mitigations of the fault
 to prevent this from occurring again.
 
 ☑️ Follow-up tasks
-==================
+------------------
 
 *List any tasks we should complete that are relevant to this incident*
author	Johannes Christ <[email protected]>	2024-07-25 20:41:55 +0200
committer	Johannes Christ <[email protected]>	2024-07-25 20:56:43 +0200
commit	c1551cbce8527beb07e8cb38bbcc1f57f4185210 (patch)
tree	2b5ff651bc9922af239c58c30d18d4aa40f63a4a /docs/postmortems
parent	Fix index page formatting (diff)