diff options
author | 2024-07-25 20:41:55 +0200 | |
---|---|---|
committer | 2024-07-25 20:56:43 +0200 | |
commit | c1551cbce8527beb07e8cb38bbcc1f57f4185210 (patch) | |
tree | 2b5ff651bc9922af239c58c30d18d4aa40f63a4a /docs/postmortems | |
parent | Fix index page formatting (diff) |
Update document titles
Diffstat (limited to 'docs/postmortems')
-rw-r--r-- | docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst | 18 | ||||
-rw-r--r-- | docs/postmortems/2021-07-11-cascading-node-failures.rst | 18 |
2 files changed, 18 insertions, 18 deletions
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst index 57f9fd8..f621782 100644 --- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst +++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst @@ -5,7 +5,7 @@ At 03:01 UTC on Tuesday 12th January we experienced a momentary outage of our PostgreSQL database, causing some very minor service downtime. β οΈ Leadup -========= +--------- *List the sequence of events that led to the incident* @@ -35,7 +35,7 @@ before 3:40 where no statistics were reported is the actual outage period where the Kubernetes node had some networking errors. π₯ Impact -========= +--------- *Describe how internal and external users were impacted during the incident* @@ -48,7 +48,7 @@ Most services stayed up that did not depend on PostgreSQL, and the site remained stable after the sync had been cancelled. ποΈ Detection -============ +--------------- *Report when the team detected the incident, and how we could improve detection time* @@ -69,7 +69,7 @@ time of crashing by examining the logs which pointed us towards the user sync process. ππΏββοΈ Response -================ +----------------------- *Who responded to the incident, and what obstacles did they encounter?* @@ -78,7 +78,7 @@ encountered other than the node being less performant than we would like due to the CPU starvation. ππ½ββοΈ Recovery -================ +--------------------------- *How was the incident resolved? How can we improve future mitigation?* @@ -113,7 +113,7 @@ as you can see site hit this twice (during the periods where it was trying to sync 80k users at once) π Five Whyβs -============= +--------------------------- *Run a 5-whys analysis to understand the true cause of the incident.* @@ -125,7 +125,7 @@ trying to sync 80k users at once) resulting in 80k users needing updating. π± Blameless root cause -======================= +----------------------- *Note the final root cause and describe what needs to change to prevent reoccurrance* @@ -137,14 +137,14 @@ See the follow up tasks on exactly how we can avoid this in future, itβs a relatively easy mitigation. π€ Lessons learned -================== +----------------------- *What did we learn from this incident?* - Django (or DRF) does not like huge update queries. βοΈ Follow-up tasks -================== +------------------ *List any tasks we should complete that are relevant to this incident* diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst index 6cd30f3..b2e5cdf 100644 --- a/docs/postmortems/2021-07-11-cascading-node-failures.rst +++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst @@ -7,7 +7,7 @@ affected node volumes were placed into a state where they could not be mounted. β οΈ Leadup -========= +---------- *List the sequence of events that led to the incident* @@ -32,7 +32,7 @@ mounted. service was not healthy. π₯ Impact -========= +---------- *Describe how internal and external users were impacted during the incident* @@ -48,7 +48,7 @@ PostgreSQL was restored early on so for the most part Moderation could continue. ποΈ Detection -============ +--------------- *Report when the team detected the incident, and how we could improve detection time* @@ -61,7 +61,7 @@ conn delta vs.Β conn total), but for the most part alerting time was satisfactory here. ππΏββοΈ Response -================ +----------------- *Who responded to the incident, and what obstacles did they encounter?* @@ -70,7 +70,7 @@ at Linode to remount the affected volumes, a support ticket has been created. ππ½ββοΈ Recovery -================ +------------------ *How was the incident resolved? How can we improve future mitigation?* @@ -262,7 +262,7 @@ Joe </details> π Five Whyβs -============= +--------------- *Run a 5-whys analysis to understand the true cause of the incident.* @@ -305,7 +305,7 @@ have. I presume itβs some sort of worker race condition, but Iβve not been able to reproduce it. π± Blameless root cause -======================= +----------------------- *Note the final root cause and describe what needs to change to prevent reoccurrence* @@ -315,7 +315,7 @@ mounting was difficult, it took multiple hours for this to be resolved since we had to wait for the volumes to unlock so they could be cloned. π€ Lessons learned -================== +------------------ *What did we learn from this incident?* @@ -325,7 +325,7 @@ We need to look at why Django is doing this and mitigations of the fault to prevent this from occurring again. βοΈ Follow-up tasks -================== +------------------ *List any tasks we should complete that are relevant to this incident* |