diff options
Diffstat (limited to 'docs/postmortems')
| -rw-r--r-- | docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst | 18 | ||||
| -rw-r--r-- | docs/postmortems/2021-07-11-cascading-node-failures.rst | 18 | 
2 files changed, 18 insertions, 18 deletions
| diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst index 57f9fd8..f621782 100644 --- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst +++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst @@ -5,7 +5,7 @@ At 03:01 UTC on Tuesday 12th January we experienced a momentary outage  of our PostgreSQL database, causing some very minor service downtime.  β οΈ Leadup -========= +---------  *List the sequence of events that led to the incident* @@ -35,7 +35,7 @@ before 3:40 where no statistics were reported is the actual outage  period where the Kubernetes node had some networking errors.  π₯ Impact -========= +---------  *Describe how internal and external users were impacted during the  incident* @@ -48,7 +48,7 @@ Most services stayed up that did not depend on PostgreSQL, and the site  remained stable after the sync had been cancelled.  ποΈ Detection -============ +---------------  *Report when the team detected the incident, and how we could improve  detection time* @@ -69,7 +69,7 @@ time of crashing by examining the logs which pointed us towards the user  sync process.  ππΏββοΈ Response -================ +-----------------------  *Who responded to the incident, and what obstacles did they encounter?* @@ -78,7 +78,7 @@ encountered other than the node being less performant than we would like  due to the CPU starvation.  ππ½ββοΈ Recovery -================ +---------------------------  *How was the incident resolved? How can we improve future mitigation?* @@ -113,7 +113,7 @@ as you can see site hit this twice (during the periods where it was  trying to sync 80k users at once)  π Five Whyβs -============= +---------------------------  *Run a 5-whys analysis to understand the true cause of the incident.* @@ -125,7 +125,7 @@ trying to sync 80k users at once)     resulting in 80k users needing updating.  π± Blameless root cause -======================= +-----------------------  *Note the final root cause and describe what needs to change to prevent  reoccurrance* @@ -137,14 +137,14 @@ See the follow up tasks on exactly how we can avoid this in future, itβs  a relatively easy mitigation.  π€ Lessons learned -================== +-----------------------  *What did we learn from this incident?*  -  Django (or DRF) does not like huge update queries.  βοΈ Follow-up tasks -================== +------------------  *List any tasks we should complete that are relevant to this incident* diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst index 6cd30f3..b2e5cdf 100644 --- a/docs/postmortems/2021-07-11-cascading-node-failures.rst +++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst @@ -7,7 +7,7 @@ affected node volumes were placed into a state where they could not be  mounted.  β οΈ Leadup -========= +----------  *List the sequence of events that led to the incident* @@ -32,7 +32,7 @@ mounted.     service was not healthy.  π₯ Impact -========= +----------  *Describe how internal and external users were impacted during the  incident* @@ -48,7 +48,7 @@ PostgreSQL was restored early on so for the most part Moderation could  continue.  ποΈ Detection -============ +---------------  *Report when the team detected the incident, and how we could improve  detection time* @@ -61,7 +61,7 @@ conn delta vs.Β conn total), but for the most part alerting time was  satisfactory here.  ππΏββοΈ Response -================ +-----------------  *Who responded to the incident, and what obstacles did they encounter?* @@ -70,7 +70,7 @@ at Linode to remount the affected volumes, a support ticket has been  created.  ππ½ββοΈ Recovery -================ +------------------  *How was the incident resolved? How can we improve future mitigation?* @@ -262,7 +262,7 @@ Joe     </details>  π Five Whyβs -============= +---------------  *Run a 5-whys analysis to understand the true cause of the incident.* @@ -305,7 +305,7 @@ have. I presume itβs some sort of worker race condition, but Iβve not  been able to reproduce it.  π± Blameless root cause -======================= +-----------------------  *Note the final root cause and describe what needs to change to prevent  reoccurrence* @@ -315,7 +315,7 @@ mounting was difficult, it took multiple hours for this to be resolved  since we had to wait for the volumes to unlock so they could be cloned.  π€ Lessons learned -================== +------------------  *What did we learn from this incident?* @@ -325,7 +325,7 @@ We need to look at why Django is doing this and mitigations of the fault  to prevent this from occurring again.  βοΈ Follow-up tasks -================== +------------------  *List any tasks we should complete that are relevant to this incident* | 
