Move documentation to Hugo

Shortly before merge, the repository settings need to be updated to set GitHub Actions as the deployment source, to prevent GitHub from trying to build with Jekyll.
author: Johannes Christ <[email protected]> 2024-04-27 21:21:51 +0200
committer: Joe Banks <[email protected]> 2024-04-27 21:21:04 +0100
commit: de9307796340070c0b44e6325a902184ad65492a (patch)
tree: f7a873d1a4b14281580b0450ba77ee9290b22c3c /docs/postmortems
parent: Use same indent for all fail2ban options (diff)
7 files changed, 0 insertions, 676 deletions
diff --git a/docs/postmortems/2020-12-11-all-services-outage.md b/docs/postmortems/2020-12-11-all-services-outage.md
deleted file mode 100644
index 35c6d70..0000000
--- a/docs/postmortems/2020-12-11-all-services-outage.md
+++ /dev/null
@@ -1,86 +0,0 @@
----
-layout: default
-title: "2020-12-11: All services outage"
-parent: Postmortems
-nav_order: 2
----
-
-# 2020-12-11: All services outage
-
-At **19:55 UTC, all services became unresponsive**. The DevOps were already in a call, and immediately started to investigate.
-
-Postgres was running at 100% CPU usage due to a **VACUUM**, which caused all services that depended on it to stop working. The high CPU left the host unresponsive and it shutdown. Linode Lassie noticed this and triggered a restart.
-
-It did not recover gracefully from this restart, with numerous core services reporting an error, so we had to manually restart core system services using Lens in order to get things working again.
-
-## ⚠️ Leadup
-
-*List the sequence of events that led to the incident*
-
-Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This made Postgres run at 100% CPU and was unresponsive, which caused services to stop responding. This lead to a  restart of the node, from which we did not recover gracefully.
-
-## 🥏 Impact
-
-*Describe how internal and external users were impacted during the incident*
-
-All services went down. Catastrophic failure. We did not pass go, we did not collect $200.
-
-- Help channel system unavailable, so people are not able to effectively ask for help.
-- Gates unavailable, so people can't successfully get into the community.
-- Moderation and raid prevention unavailable, which leaves us defenseless against attacks.
-
-## 👁️ Detection
-
-*Report when the team detected the incident, and how we could improve detection time*
-
-We noticed that all PyDis services had stopped responding, coincidentally our DevOps team were in a call at the time, so that was helpful.
-
-We may be able to improve detection time by adding monitoring of resource usage. To this end, we've added alerts for high CPU usage and low memory.
-
-## 🙋🏿‍♂️ Response
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident.
-
-We noticed our node was entirely unresponsive and within minutes a restart had been triggered by Lassie after a high CPU shutdown occurred.
-
-The node came back and we saw a number of core services offline (e.g. Calico, CoreDNS, Linode CSI).
-
-**Obstacle: no recent database back-up available**{: .text-red-200 }
-
-## 🙆🏽‍♀️ Recovery
-
-*How was the incident resolved? How can we improve future mitigation times?*
-
-Through [Lens](https://k8slens.dev/) we restarted core services one by one until they stabilised, after these core services were up other services began to come back online.
-
-We finally provisioned PostgreSQL which had been removed as a component before the restart (but too late to prevent the CPU errors). Once PostgreSQL was up we restarted any components that were acting buggy (e.g. site and bot).
-
-## 🔎 Five Why's
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- Major service outage
-- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI)
-- **Why?** Kubernetes worker node restart
-- **Why?** High CPU shutdown
-- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike
-
-## 🌱 Blameless root cause
-
-*Note the final root cause and describe what needs to change to prevent reoccurrance*
-
-## 🤔 Lessons learned
-
-*What did we learn from this incident?*
-
-- We must ensure we have working database backups. We are lucky that we did not lose any data this time. If this problem had caused volume corruption, we would be screwed.
-- Sentry is broken for the bot. It was missing a DSN secret, which we have now restored.
-- The [https://sentry.pydis.com](https://sentry.pydis.com) redirect was never migrated to the cluster. **We should do that.**
-
-## ☑️ Follow-up tasks
-
-*List any tasks we've created as a result of this incident*
-
-- [x] Push forward with backup plans
diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.md b/docs/postmortems/2020-12-11-postgres-conn-surge.md
deleted file mode 100644
index 3e5360c..0000000
--- a/docs/postmortems/2020-12-11-postgres-conn-surge.md
+++ /dev/null
@@ -1,96 +0,0 @@
----
-layout: default
-title: "2020-12-11: Postgres connection surge"
-parent: Postmortems
-nav_order: 1
----
-
-# 2020-12-11: Postgres connection surge
-
-At **13:24 UTC,** we noticed the bot was not able to infract, and [pythondiscord.com](http://pythondiscord.com) was unavailable. The DevOps team started to investigate.
-
-We discovered that Postgres was not accepting new connections because it had hit 100 clients. This made it unavailable to all services that depended on it.
-
-Ultimately this was resolved by taking down Postgres, remounting the associated volume, and bringing it back up again.
-
-## ⚠️ Leadup
-
-*List the sequence of events that led to the incident*
-
-The bot infractions stopped working, and we started investigating.
-
-## 🥏 Impact
-
-*Describe how internal and external users were impacted during the incident*
-
-Services were unavailable both for internal and external users.
-
-- The Help Channel System was unavailable.
-- Voice Gate and Server Gate were not working.
-- Moderation commands were unavailable.
-- Python Discord site & API were unavailable. CloudFlare automatically switched us to Always Online.
-
-## 👁️ Detection
-
-*Report when the team detected the incident, and how we could improve detection time*
-
-We noticed HTTP 524s coming from CloudFlare, upon attempting database connection we observed the maximum client limit.
-
-We noticed this log in site:
-
-```yaml
-django.db.utils.OperationalError: FATAL:  sorry, too many clients already
-```
-
-We should be monitoring number of clients, and the monitor should alert us when we're approaching the max. That would have allowed for earlier detection, and possibly allowed us to prevent the incident altogether.
-
-We will look at [wrouesnel/postgres_exporter](https://github.com/wrouesnel/postgres_exporter) for monitoring this.
-
-## 🙋🏿‍♂️ Response
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident. The obstacles were mostly a lack of a clear response strategy.
-
-We should document our recovery procedure so that we're not so dependent on Joe Banks should this happen again while he's unavailable.
-
-## 🙆🏽‍♀️ Recovery
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-- Delete PostgreSQL deployment `kubectl delete deployment/postgres`
-- Delete any remaining pods, WITH force. `kubectl delete <pod name> --force --grace-period=0`
-- Unmount volume at Linode
-- Remount volume at Linode
-- Reapply deployment `kubectl apply -f postgres/deployment.yaml`
-
-## 🔎 Five Why's
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- Postgres was unavailable, so our services died.
-- **Why?** Postgres hit max clients, and could not respond.
-- **Why?** Unknown, but we saw a number of connections from previous deployments of site. This indicates that database connections are not being terminated properly. Needs further investigation.
-
-## 🌱 Blameless root cause
-
-*Note the final root cause and describe what needs to change to prevent reoccurrance*
-
-We're not sure what the root cause is, but suspect site is not terminating database connections properly in some cases. We were unable to reproduce this problem.
-
-We've set up new telemetry on Grafana with alerts so that we can investigate this more closely. We will be let know if the number of connections from site exceeds 32, or if the total number of connections exceeds 90.
-
-## 🤔 Lessons learned
-
-*What did we learn from this incident?*
-
-- We must ensure the DevOps team has access to Linode and other key services even if our Bitwarden is down.
-- We need to ensure we're alerted of any risk factors that have the potential to make Postgres unavailable, since this causes a catastrophic outage of practically all services.
-- We absolutely need backups for the databases, so that this sort of problem carries less of a risk.
-- We may need to consider something like [pg_bouncer](https://wiki.postgresql.org/wiki/PgBouncer) to manage a connection pool so that we don't exceed 100 *legitimate* clients connected as we connect more services to the postgres database.
-
-## ☑️ Follow-up tasks
-
-*List any tasks we should complete that are relevant to this incident*
-
-- [x] All database backup
diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.md b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.md
deleted file mode 100644
index a8fb815..0000000
--- a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.md
+++ /dev/null
@@ -1,86 +0,0 @@
----
-layout: default
-title: "2021-01-10: Primary Kubernetes node outage"
-parent: Postmortems
-nav_order: 3
----
-
-# 2021-01-10: Primary Kubernetes node outage
-
-
-We had an outage of our highest spec node due to CPU exhaustion. The outage lasted from around 20:20 to 20:46 UTC, but was not a full service outage.
-
-## ⚠️ Leadup
-
-*List the sequence of events that led to the incident*
-
-I ran a query on Prometheus to try figure out some statistics on the number of metrics we are holding, this ended up scanning a lot of data in the TSDB database that Prometheus uses.
-
-This scan caused a CPU exhaustion which caused issues with the Kubernetes node status.
-
-## 🥏 Impact
-
-*Describe how internal and external users were impacted during the incident*
-
-This brought down the primary node which meant there was some service outage. Most services transferred successfully to our secondary node which kept up some key services such as the Moderation bot and Modmail bot, as well as MongoDB.
-
-## 👁️ Detection
-
-*Report when the team detected the incident, and how we could improve detection time*
-
-This was noticed when Discord services started having failures. The primary detection was through alerts though! I was paged 1 minute after we started encountering CPU exhaustion issues.
-
-## 🙋🏿‍♂️ Response
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident.
-
-No major obstacles were encountered during this.
-
-## 🙆🏽‍♀️ Recovery
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-It was noted that in the response to `kubectl get nodes` the primary node's status was reported as `NotReady`. Looking into the reason it was because the node had stopped responding.
-
-The quickest way to fix this was triggering a node restart. This shifted a lot of pods over to node 2 which encountered some capacity issues since it's not as highly specified as the first node.
-
-I brought this back the first node by restarting it at Linode's end. Once this node was reporting as `Ready` again I drained the second node by running `kubectl drain lke13311-20304-5ffa4d11faab`. This command stops the node from being available for scheduling and moves existing pods onto other nodes.
-
-Services gradually recovered as the dependencies started. The incident lasted overall around 26 minutes, though this was not a complete outage for the whole time and the bot remained functional throughout (meaning systems like the help channels were still functional).
-
-## 🔎 Five Why's
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-**Why?** Partial service outage
-
-**Why?** We had a node outage.
-
-**Why?** CPU exhaustion of our primary node.
-
-**Why?** Large prometheus query using a lot of CPU.
-
-**Why?** Prometheus had to scan millions of TSDB records which consumed all cores.
-
-## 🌱 Blameless root cause
-
-*Note the final root cause and describe what needs to change to prevent reoccurrance*
-
-A large query was run on Prometheus, so the solution is just to not run said queries.
-
-To protect against this more precisely though we should write resource constraints for services like this that are vulnerable to CPU exhaustion or memory consumption, which are the causes of our two past outages as well.
-
-## 🤔 Lessons learned
-
-*What did we learn from this incident?*
-
-- Don't run large queries, it consumes CPU!
-- Write resource constraints for our services.
-
-## ☑️ Follow-up tasks
-
-*List any tasks we should complete that are relevant to this incident*
-
-- [x]  Write resource constraints for our services.
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.md b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.md
deleted file mode 100644
index 6935f02..0000000
--- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.md
+++ /dev/null
@@ -1,112 +0,0 @@
----
-layout: default
-title: "2021-01-12: Django site CPU/RAM exhaustion outage"
-parent: Postmortems
-nav_order: 4
----
-
-# 2021-01-12: Django site CPU/RAM exhaustion outage
-
-At 03:01 UTC on Tuesday 12th January we experienced a momentary outage of our PostgreSQL database, causing some very minor service downtime.
-
-# ⚠️ Leadup
-
-*List the sequence of events that led to the incident*
-
-We deleted the Developers role which led to a large user diff for all the users where we had to update their roles on the site.
-
-The bot was trying to post this for over 24 hours repeatedly after every restart.
-
-We deployed the bot at 2:55 UTC on 12th January and the user sync process began once again.
-
-This caused a CPU & RAM spike on our Django site, which in turn triggered an OOM error on the server which killed the Postgres process, sending it into a recovery state where queries could not be executed.
-
-Django site did not have any tools in place to batch the requests so was trying to process all 80k user updates in a single query, something that PostgreSQL probably could handle, but not the Django ORM. During the incident site jumped from it's average RAM usage of 300-400MB to **1.5GB.**
-
-![{{site.baseurl}}/static/images/2021-01-12/site_resource_abnormal.png]({{site.baseurl}}/static/images/2021-01-12/site_resource_abnormal.png)
-
-RAM and CPU usage of site throughout the incident. The period just before 3:40 where no statistics were reported is the actual outage period where the Kubernetes node had some networking errors.
-
-# 🥏 Impact
-
-*Describe how internal and external users were impacted during the incident*
-
-This database outage lasted mere minutes, since Postgres recovered and healed itself and the sync process was aborted, but it did leave us with a large user diff and our database becoming further out of sync.
-
-Most services stayed up that did not depend on PostgreSQL, and the site remained stable after the sync had been cancelled.
-
-# 👁️ Detection
-
-*Report when the team detected the incident, and how we could improve detection time*
-
-We were immediately alerted to the PostgreSQL outage on Grafana and through Sentry, meaning our response time was under a minute.
-
-We reduced some alert thresholds in order to catch RAM & CPU spikes faster in the future.
-
-It was hard to immediately see the cause of things since there is minimal logging on the site and the bot logs were not evident that anything was at fault, therefore our only detection was through machine metrics.
-
-We did manage to recover exactly what PostgreSQL was trying to do at the time of crashing by examining the logs which pointed us towards the user sync process.
-
-# 🙋🏿‍♂️ Response
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the issue, there were no real obstacles encountered other than the node being less performant than we would like due to the CPU starvation.
-
-# 🙆🏽‍♀️ Recovery
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-The incident was resolved by stopping the sync process and writing a more efficient one through an internal eval script. We batched the updates into 1,000 users and instead of doing one large one did 80 smaller updates. This led to much higher efficiency with a cost of taking a little longer (~7 minutes).
-
-```python
-from bot.exts.backend.sync import _syncers
-syncer = _syncers.UserSyncer
-diff = await syncer._get_diff(ctx.guild)
-
-def chunks(lst, n):
-    for i in range(0, len(lst), n):
-        yield lst[i:i + n]
-
-for chunk in chunks(diff.updated, 1000):
-    await bot.api_client.patch("bot/users/bulk_patch", json=chunk)
-```
-
-Resource limits were also put into place on site to prevent RAM and CPU spikes, and throttle the CPU usage in these situations. This can be seen in the below graph:
-
-![{{site.baseurl}}/static/images/2021-01-12/site_cpu_throttle.png]({{site.baseurl}}/static/images/2021-01-12/site_cpu_throttle.png)
-
-CPU throttling is where a container has hit the limits and we need to reel it in. Ideally this value stays as closes to 0 as possible, however as you can see site hit this twice (during the periods where it was trying to sync 80k users at once)
-
-# 🔎 Five Why's
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- We experienced a major PostgreSQL outage
-- PostgreSQL was killed by the system OOM due to the RAM spike on site.
-- The RAM spike on site was caused by a large query.
-- This was because we do not chunk queries on the bot.
-- The large query was caused by the removal of the Developers role resulting in 80k users needing updating.
-
-# 🌱 Blameless root cause
-
-*Note the final root cause and describe what needs to change to prevent reoccurrance*
-
-The removal of the Developers role created a large diff which could not be applied by Django in a single request.
-
-See the follow up tasks on exactly how we can avoid this in future, it's a relatively easy mitigation.
-
-# 🤔 Lessons learned
-
-*What did we learn from this incident?*
-
-- Django (or DRF) does not like huge update queries.
-
-# ☑️ Follow-up tasks
-
-*List any tasks we should complete that are relevant to this incident*
-
-- [x]  Make the bot syncer more efficient (batch requests)
-- [ ]  Increase logging on bot, state when an error has been hit (we had no indication of this inside Discord, we need that)
-- [x]  Adjust resource alerts to page DevOps members earlier.
-- [x]  Apply resource limits to site to prevent major spikes
diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.md b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.md
deleted file mode 100644
index dd2d624..0000000
--- a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.md
+++ /dev/null
@@ -1,101 +0,0 @@
----
-layout: default
-title: "2021-01-30: NodeBalancer networking faults due to memory pressure"
-parent: Postmortems
-nav_order: 5
----
-
-# 2021-01-30: NodeBalancer networking faults due to memory pressure
-
-At around 14:30 UTC on Saturday 30th January we started experiencing networking issues at the LoadBalancer level between Cloudflare and our Kubernetes cluster. It seems that the misconfiguration was due to memory and CPU pressure.
-
-~~This post-mortem is preliminary, we are still awaiting word from Linode's SysAdmins on any problems they detected.~~
-
-**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a different machine.
-
-## ⚠️ Leadup
-
-*List the sequence of events that led to the incident*
-
-At 14:30 we started receiving alerts that services were becoming unreachable. We first experienced some momentary DNS errors which resolved themselves, however traffic ingress was still degraded.
-
-Upon checking Linode our NodeBalancer, the service which balances traffic between our Kubernetes nodes was reporting the backends (the services it balances to) as down. It reported all 4 as down (two for port 80 + two for port 443). This status was fluctuating between up and down, meaning traffic was not reaching our cluster correctly. Scaleios correctly noted:
-
-![{{site.baseurl}}/static/images/2021-01-30/scaleios.png]({{site.baseurl}}/static/images/2021-01-30/scaleios.png)
-
-The config seems to have been set incorrectly due to memory and CPU pressure on one of our nodes. Here is the memory throughout the incident:
-
-![{{site.baseurl}}/static/images/2021-01-30/memory_charts.png]({{site.baseurl}}/static/images/2021-01-30/memory_charts.png)
-
-Here is the display from Linode:
-
-![{{site.baseurl}}/static/images/2021-01-30/linode_loadbalancers.png]({{site.baseurl}}/static/images/2021-01-30/linode_loadbalancers.png)
-
-## 🥏 Impact
-
-*Describe how internal and external users were impacted during the incident*
-
-Since traffic could not correctly enter our cluster multiple services which were web based were offline, including services such as site, grafana and bitwarden. It appears that no inter-node communication was affected as this uses a WireGuard tunnel between the nodes which was not affected by the NodeBalancer.
-
-The lack of Grafana made diagnosis slightly more difficult, but even then it was only a short trip to the
-
-## 👁️ Detection
-
-*Report when the team detected the incident, and how we could improve detection time*
-
-We were alerted fairly promptly through statping which reported services as being down and posted a Discord notification. Subsequent alerts came in from Grafana but were limited since outbound communication was faulty.
-
-## 🙋🏿‍♂️ Response
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded!
-
-Primary obstacle was the DevOps tools being out due to the traffic ingress problems.
-
-## 🙆🏽‍♀️ Recovery
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-The incident resolved itself upstream at Linode, we've opened a ticket with Linode to let them know of the faults, this might give us a better indication of what caused the issues. Our Kubernetes cluster continued posting updates to Linode to refresh the NodeBalancer configuration, inspecting these payloads the configuration looked correct.
-
-We've set up alerts for when Prometheus services stop responding since this seems to be a fairly tell-tale symptom of networking problems, this was the Prometheus status graph throughout the incident:
-
-![{{site.baseurl}}/static/images/2021-01-30/prometheus_status.png]({{site.baseurl}}/static/images/2021-01-30/prometheus_status.png)
-
-## 🔎 Five Why's
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-**What?** Our service experienced an outage due to networking faults.
-
-**Why?** Incoming traffic could not reach our Kubernetes nodes
-
-**Why?** Our Linode NodeBalancers were not using correct configuration
-
-**Why?** Memory & CPU pressure seemed to cause invalid configuration errors upstream at Linode.
-
-**Why?** Unknown at this stage, NodeBalancer migrated.
-
-## 🌱 Blameless root cause
-
-*Note the final root cause and describe what needs to change to prevent reoccurrance*
-
-The configuration of our NodeBalancer was invalid, we cannot say why at this point since we are awaiting contact back from Linode, but indicators point to it being an upstream fault since memory & CPU pressure should **not** cause a load balancer misconfiguration.
-
-Linode are going to follow up with us at some point during the week with information from their System Administrators.
-
-**Update 2nd February 2021:** Linode have concluded investigations at their end, taken notes and migrated our NodeBalancer to a new machine. We haven't experienced problems since.
-
-## 🤔 Lessons learned
-
-*What did we learn from this incident?*
-
-We should be careful over-scheduling onto nodes since even while operating within reasonable constraints we risk sending invalid configuration upstream to Linode and therefore preventing traffic from entering our cluster.
-
-## ☑️ Follow-up tasks
-
-*List any tasks we should complete that are relevant to this incident*
-
-- [x]  Monitor for follow up from Linode
-- [x]  Carefully monitor the allocation rules for our services
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.md b/docs/postmortems/2021-07-11-cascading-node-failures.md
deleted file mode 100644
index adf0d57..0000000
--- a/docs/postmortems/2021-07-11-cascading-node-failures.md
+++ /dev/null
@@ -1,185 +0,0 @@
----
-layout: default
-title: "2021-07-11: Cascading node failures and ensuing volume problems"
-parent: Postmortems
-nav_order: 6
----
-
-# 2021-07-11: Cascading node failures and ensuing volume problems
-
-A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node to an unresponsive state (00:55 UTC), upon performing a recycle of the affected node volumes were placed into a state where they could not be mounted.
-
-# ⚠️ Leadup
-
-*List the sequence of events that led to the incident*
-
-- **00:27 UTC:** Django starts rapidly using connections to our PostgreSQL database
-- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated it's 115 max connections limit. Joe is paged.
-- **00:33 UTC:** DevOps team is alerted that a service has claimed 34 dangerous table locks (it peaked at 61).
-- **00:42 UTC:** Status incident created and backdated to 00:25 UTC. [Status incident](https://status.pythondiscord.com/incident/92712)
-- **00:55 UTC:** It's clear that the node which PostgreSQL was on is no longer healthy after the Django connection surge, so it's recycled and a new one is to be added to the pool.
-- **01:01 UTC:** Node `lke13311-16405-5fafd1b46dcf` begins it's restart
-- **01:13 UTC:** Node has restored and regained healthy status, but volumes will not mount to the node. Support ticket opened at Linode for assistance.
-- **06:36 UTC:** DevOps team alerted that Python is offline. This is due to Redis being a dependency of the bot, which as a stateful service was not healthy.
-
-# 🥏 Impact
-
-*Describe how internal and external users were impacted during the incident*
-
-Initially, this manifested as a standard node outage where services on that node experienced some downtime as the node was restored.
-
-Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) were unexecutable due to the volume issues, and so any dependent services (e.g. Site, Bot, Hastebin) also had trouble starting.
-
-PostgreSQL was restored early on so for the most part Moderation could continue.
-
-# 👁️ Detection
-
-*Report when the team detected the incident, and how we could improve detection time*
-
-DevOps were initially alerted at 00:32 UTC due to the PostgreSQL connection surge, and acknowledged at the same time.
-
-Further alerting could be used to catch surges earlier on (looking at conn delta vs. conn total), but for the most part alerting time was satisfactory here.
-
-# 🙋🏿‍♂️ Response
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded. The primary issue encountered was failure upstream at Linode to remount the affected volumes, a support ticket has been created.
-
-# 🙆🏽‍♀️ Recovery
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-Initial node restoration was performed by @Joe Banks by recycling the affected node.
-
-Subsequent volume restoration was also @Joe Banks and once Linode had unlocked the volumes affected pods were scaled down to 0, the volumes were unmounted at the Linode side and then the deployments were recreated.
-
-<details markdown="block">
-<summary>Support ticket sent</summary>
-
-<blockquote markdown="block">
-Good evening,
-
-We experienced a resource surge on one of our Kubernetes nodes at 00:32 UTC, causing a node to go unresponsive. To mitigate problems here the node was recycled and began restarting at 1:01 UTC.
-
-The node has now rejoined the ring and started picking up services, but volumes will not attach to it, meaning pods with stateful storage will not start.
-
-An example events log for one such pod:
-
-```
-  Type     Reason       Age    From               Message
-  ----     ------       ----   ----               -------
-  Normal   Scheduled    2m45s  default-scheduler  Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf
-  Warning  FailedMount  103s   kubelet            MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f]
-  Warning  FailedMount  43s    kubelet            Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition
-
-```
-
-I've been trying to manually resolve this through the Linode Web UI but get presented with attachment errors upon doing so. Please could you advise on the best way forward to restore Volumes & Nodes to a functioning state? As far as I can see there is something going on upstream since the Linode UI presents these nodes as mounted however as shown above LKE nodes are not locating them, there is also a few failed attachment logs in the Linode Audit Log.
-
-Thanks,
-
-Joe
-</blockquote>
-</details>
-
-<details markdown="block">
-<summary>Response received from Linode</summary>
-
-<blockquote markdown="block">
-Hi Joe,
-
-> Were there any known issues with Block Storage in Frankfurt today?
-
-Not today, though there were service issues reported for Block Storage and LKE in Frankfurt on July 8 and 9:
-
-- [Service Issue - Block Storage - EU-Central (Frankfurt)](https://status.linode.com/incidents/pqfxl884wbh4)
-- [Service Issue - Linode Kubernetes Engine - Frankfurt](https://status.linode.com/incidents/13fpkjd32sgz)
-
-There was also an API issue reported on the 10th (resolved on the 11th), mentioned here:
-
-- [Service Issue - Cloud Manager and API](https://status.linode.com/incidents/vhjm0xpwnnn5)
-
-Regarding the specific error you were receiving:
-
-> `Unable to find device path out of attempted paths`
-
-I'm not certain it's specifically related to those Service Issues, considering this isn't the first time a customer has reported this error in their LKE logs. In fact, if I recall correctly, I've run across this before too, since our volumes are RWO and I had too many replicas in my deployment that I was trying to attach to, for example.
-
-> is this a known bug/condition that occurs with Linode CSI/LKE?
-
-From what I understand, yes, this is a known condition that crops up from time to time, which we are tracking. However, since there is a workaround at the moment (e.g. - "After some more manual attempts to fix things, scaling down deployments, unmounting at Linode and then scaling up the deployments seems to have worked and all our services have now been restored."), there is no ETA for addressing this. With that said, I've let our Storage team know that you've run into this, so as to draw further attention to it.
-
-If you have any further questions or concerns regarding this, let us know.
-
-Best regards,
-[Redacted]
-
-Linode Support Team
-</blockquote>
-</details>
-
-<details markdown="block">
-<summary>Concluding response from Joe Banks</summary>
-
-<blockquote markdown="block">
-Hey [Redacted]!
-
-Thanks for the response. We ensure that stateful pods only ever have one volume assigned to them, either with a single replica deployment or a statefulset. It appears that the error generally manifests when a deployment is being migrated from one node to another during a redeploy, which makes sense if there is some delay on the unmount/remount.
-
-Confusion occurred because Linode was reporting the volume as attached when the node had been recycled, but I assume that was because the node did not cleanly shutdown and therefore could not cleanly unmount volumes.
-
-We've not seen any resurgence of such issues, and we'll address the software fault which overloaded the node which will helpfully mitigate such problems in the future.
-
-Thanks again for the response, have a great week!
-
-Best,
-
-Joe
-</blockquote>
-</details>
-
-# 🔎 Five Why's
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-### **What?**
-
-Several of our services became unavailable because their volumes could not be mounted.
-
-### Why?
-
-A node recycle left the node unable to mount volumes using the Linode CSI.
-
-### Why?
-
-A node recycle was used because PostgreSQL had a connection surge.
-
-### Why?
-
-A Django feature deadlocked a table 62 times and suddenly started using ~70 connections to the database, saturating the maximum connections limit.
-
-### Why?
-
-The root cause of why Django does this is unclear, and someone with more Django proficiency is absolutely welcome to share any knowledge they may have. I presume it's some sort of worker race condition, but I've not been able to reproduce it.
-
-# 🌱 Blameless root cause
-
-*Note the final root cause and describe what needs to change to prevent reoccurrence*
-
-A node being forcefully restarted left volumes in a limbo state where mounting was difficult, it took multiple hours for this to be resolved since we had to wait for the volumes to unlock so they could be cloned.
-
-# 🤔 Lessons learned
-
-*What did we learn from this incident?*
-
-Volumes are painful.
-
-We need to look at why Django is doing this and mitigations of the fault to prevent this from occurring again.
-
-# ☑️ Follow-up tasks
-
-*List any tasks we should complete that are relevant to this incident*
-
-- [x] [Follow up on ticket at Linode](https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001)
-- [ ]  Investigate why Django could be connection surging and locking tables
diff --git a/docs/postmortems/index.md b/docs/postmortems/index.md
deleted file mode 100644
index 5e8b509..0000000
--- a/docs/postmortems/index.md
+++ /dev/null
@@ -1,10 +0,0 @@
----
-title: Postmortems
-layout: default
-has_children: true
-has_toc: false
----
-
-# Postmortems
-
-Browse the pages under this category to view historical postmortems for Python Discord outages.
author	Johannes Christ <[email protected]>	2024-04-27 21:21:51 +0200
committer	Joe Banks <[email protected]>	2024-04-27 21:21:04 +0100
commit	de9307796340070c0b44e6325a902184ad65492a (patch)
tree	f7a873d1a4b14281580b0450ba77ee9290b22c3c /docs/postmortems
parent	Use same indent for all fail2ban options (diff)