aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems
diff options
context:
space:
mode:
Diffstat (limited to 'docs/postmortems')
-rw-r--r--docs/postmortems/2020-12-11-all-services-outage.rst121
-rw-r--r--docs/postmortems/2020-12-11-postgres-conn-surge.rst130
-rw-r--r--docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst117
-rw-r--r--docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst155
-rw-r--r--docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst146
-rw-r--r--docs/postmortems/2021-07-11-cascading-node-failures.rst335
-rw-r--r--docs/postmortems/images/2021-01-12/site_cpu_throttle.pngbin0 -> 227245 bytes
-rw-r--r--docs/postmortems/images/2021-01-12/site_resource_abnormal.pngbin0 -> 232260 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/linode_loadbalancers.pngbin0 -> 50882 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/memory_charts.pngbin0 -> 211053 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/prometheus_status.pngbin0 -> 291122 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/scaleios.pngbin0 -> 18294 bytes
-rw-r--r--docs/postmortems/index.rst15
13 files changed, 1019 insertions, 0 deletions
diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst
new file mode 100644
index 0000000..9c29303
--- /dev/null
+++ b/docs/postmortems/2020-12-11-all-services-outage.rst
@@ -0,0 +1,121 @@
+2020-12-11: All services outage
+===============================
+
+At **19:55 UTC, all services became unresponsive**. The DevOps were
+already in a call, and immediately started to investigate.
+
+Postgres was running at 100% CPU usage due to a **VACUUM**, which caused
+all services that depended on it to stop working. The high CPU left the
+host unresponsive and it shutdown. Linode Lassie noticed this and
+triggered a restart.
+
+It did not recover gracefully from this restart, with numerous core
+services reporting an error, so we had to manually restart core system
+services using Lens in order to get things working again.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This
+made Postgres run at 100% CPU and was unresponsive, which caused
+services to stop responding. This lead to a restart of the node, from
+which we did not recover gracefully.
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+All services went down. Catastrophic failure. We did not pass go, we did
+not collect $200.
+
+- Help channel system unavailable, so people are not able to
+ effectively ask for help.
+- Gates unavailable, so people can’t successfully get into the
+ community.
+- Moderation and raid prevention unavailable, which leaves us
+ defenseless against attacks.
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We noticed that all PyDis services had stopped responding,
+coincidentally our DevOps team were in a call at the time, so that was
+helpful.
+
+We may be able to improve detection time by adding monitoring of
+resource usage. To this end, we’ve added alerts for high CPU usage and
+low memory.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the incident.
+
+We noticed our node was entirely unresponsive and within minutes a
+restart had been triggered by Lassie after a high CPU shutdown occurred.
+
+The node came back and we saw a number of core services offline
+(e.g. Calico, CoreDNS, Linode CSI).
+
+**Obstacle: no recent database back-up available**
+
+🙆🏽‍♀️ Recovery
+-----------------
+
+*How was the incident resolved? How can we improve future mitigation
+times?*
+
+Through `Lens <https://k8slens.dev/>`__ we restarted core services one
+by one until they stabilised, after these core services were up other
+services began to come back online.
+
+We finally provisioned PostgreSQL which had been removed as a component
+before the restart (but too late to prevent the CPU errors). Once
+PostgreSQL was up we restarted any components that were acting buggy
+(e.g. site and bot).
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+- Major service outage
+- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI)
+- **Why?** Kubernetes worker node restart
+- **Why?** High CPU shutdown
+- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+- We must ensure we have working database backups. We are lucky that we
+ did not lose any data this time. If this problem had caused volume
+ corruption, we would be screwed.
+- Sentry is broken for the bot. It was missing a DSN secret, which we
+ have now restored.
+- The https://sentry.pydis.com redirect was never migrated to the
+ cluster. **We should do that.**
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we’ve created as a result of this incident*
+
+- ☒ Push forward with backup plans
diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst
new file mode 100644
index 0000000..6ebcb01
--- /dev/null
+++ b/docs/postmortems/2020-12-11-postgres-conn-surge.rst
@@ -0,0 +1,130 @@
+2020-12-11: Postgres connection surge
+=====================================
+
+At **13:24 UTC,** we noticed the bot was not able to infract, and
+`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The
+DevOps team started to investigate.
+
+We discovered that Postgres was not accepting new connections because it
+had hit 100 clients. This made it unavailable to all services that
+depended on it.
+
+Ultimately this was resolved by taking down Postgres, remounting the
+associated volume, and bringing it back up again.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+The bot infractions stopped working, and we started investigating.
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Services were unavailable both for internal and external users.
+
+- The Help Channel System was unavailable.
+- Voice Gate and Server Gate were not working.
+- Moderation commands were unavailable.
+- Python Discord site & API were unavailable. CloudFlare automatically
+ switched us to Always Online.
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We noticed HTTP 524s coming from CloudFlare, upon attempting database
+connection we observed the maximum client limit.
+
+We noticed this log in site:
+
+.. code:: yaml
+
+ django.db.utils.OperationalError: FATAL: sorry, too many clients already
+
+We should be monitoring number of clients, and the monitor should alert
+us when we’re approaching the max. That would have allowed for earlier
+detection, and possibly allowed us to prevent the incident altogether.
+
+We will look at
+`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__
+for monitoring this.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the incident. The obstacles were mostly a lack of
+a clear response strategy.
+
+We should document our recovery procedure so that we’re not so dependent
+on Joe Banks should this happen again while he’s unavailable.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+- Delete PostgreSQL deployment ``kubectl delete deployment/postgres``
+- Delete any remaining pods, WITH force.
+ ``kubectl delete <pod name> --force --grace-period=0``
+- Unmount volume at Linode
+- Remount volume at Linode
+- Reapply deployment ``kubectl apply -f postgres/deployment.yaml``
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+- Postgres was unavailable, so our services died.
+- **Why?** Postgres hit max clients, and could not respond.
+- **Why?** Unknown, but we saw a number of connections from previous
+ deployments of site. This indicates that database connections are not
+ being terminated properly. Needs further investigation.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+We’re not sure what the root cause is, but suspect site is not
+terminating database connections properly in some cases. We were unable
+to reproduce this problem.
+
+We’ve set up new telemetry on Grafana with alerts so that we can
+investigate this more closely. We will be let know if the number of
+connections from site exceeds 32, or if the total number of connections
+exceeds 90.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+- We must ensure the DevOps team has access to Linode and other key
+ services even if our Bitwarden is down.
+- We need to ensure we’re alerted of any risk factors that have the
+ potential to make Postgres unavailable, since this causes a
+ catastrophic outage of practically all services.
+- We absolutely need backups for the databases, so that this sort of
+ problem carries less of a risk.
+- We may need to consider something like
+ `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage
+ a connection pool so that we don’t exceed 100 *legitimate* clients
+ connected as we connect more services to the postgres database.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ All database backup
diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst
new file mode 100644
index 0000000..5852c46
--- /dev/null
+++ b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst
@@ -0,0 +1,117 @@
+2021-01-10: Primary Kubernetes node outage
+==========================================
+
+We had an outage of our highest spec node due to CPU exhaustion. The
+outage lasted from around 20:20 to 20:46 UTC, but was not a full service
+outage.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+I ran a query on Prometheus to try figure out some statistics on the
+number of metrics we are holding, this ended up scanning a lot of data
+in the TSDB database that Prometheus uses.
+
+This scan caused a CPU exhaustion which caused issues with the
+Kubernetes node status.
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+This brought down the primary node which meant there was some service
+outage. Most services transferred successfully to our secondary node
+which kept up some key services such as the Moderation bot and Modmail
+bot, as well as MongoDB.
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+This was noticed when Discord services started having failures. The
+primary detection was through alerts though! I was paged 1 minute after
+we started encountering CPU exhaustion issues.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the incident.
+
+No major obstacles were encountered during this.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+It was noted that in the response to ``kubectl get nodes`` the primary
+node’s status was reported as ``NotReady``. Looking into the reason it
+was because the node had stopped responding.
+
+The quickest way to fix this was triggering a node restart. This shifted
+a lot of pods over to node 2 which encountered some capacity issues
+since it’s not as highly specified as the first node.
+
+I brought this back the first node by restarting it at Linode’s end.
+Once this node was reporting as ``Ready`` again I drained the second
+node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This
+command stops the node from being available for scheduling and moves
+existing pods onto other nodes.
+
+Services gradually recovered as the dependencies started. The incident
+lasted overall around 26 minutes, though this was not a complete outage
+for the whole time and the bot remained functional throughout (meaning
+systems like the help channels were still functional).
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**Why?** Partial service outage
+
+**Why?** We had a node outage.
+
+**Why?** CPU exhaustion of our primary node.
+
+**Why?** Large prometheus query using a lot of CPU.
+
+**Why?** Prometheus had to scan millions of TSDB records which consumed
+all cores.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+A large query was run on Prometheus, so the solution is just to not run
+said queries.
+
+To protect against this more precisely though we should write resource
+constraints for services like this that are vulnerable to CPU exhaustion
+or memory consumption, which are the causes of our two past outages as
+well.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+- Don’t run large queries, it consumes CPU!
+- Write resource constraints for our services.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Write resource constraints for our services.
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
new file mode 100644
index 0000000..57f9fd8
--- /dev/null
+++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
@@ -0,0 +1,155 @@
+2021-01-12: Django site CPU/RAM exhaustion outage
+=================================================
+
+At 03:01 UTC on Tuesday 12th January we experienced a momentary outage
+of our PostgreSQL database, causing some very minor service downtime.
+
+⚠️ Leadup
+=========
+
+*List the sequence of events that led to the incident*
+
+We deleted the Developers role which led to a large user diff for all
+the users where we had to update their roles on the site.
+
+The bot was trying to post this for over 24 hours repeatedly after every
+restart.
+
+We deployed the bot at 2:55 UTC on 12th January and the user sync
+process began once again.
+
+This caused a CPU & RAM spike on our Django site, which in turn
+triggered an OOM error on the server which killed the Postgres process,
+sending it into a recovery state where queries could not be executed.
+
+Django site did not have any tools in place to batch the requests so was
+trying to process all 80k user updates in a single query, something that
+PostgreSQL probably could handle, but not the Django ORM. During the
+incident site jumped from it’s average RAM usage of 300-400MB to
+**1.5GB.**
+
+.. image:: ./images/2021-01-12/site_resource_abnormal.png
+
+RAM and CPU usage of site throughout the incident. The period just
+before 3:40 where no statistics were reported is the actual outage
+period where the Kubernetes node had some networking errors.
+
+🥏 Impact
+=========
+
+*Describe how internal and external users were impacted during the
+incident*
+
+This database outage lasted mere minutes, since Postgres recovered and
+healed itself and the sync process was aborted, but it did leave us with
+a large user diff and our database becoming further out of sync.
+
+Most services stayed up that did not depend on PostgreSQL, and the site
+remained stable after the sync had been cancelled.
+
+👁️ Detection
+============
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We were immediately alerted to the PostgreSQL outage on Grafana and
+through Sentry, meaning our response time was under a minute.
+
+We reduced some alert thresholds in order to catch RAM & CPU spikes
+faster in the future.
+
+It was hard to immediately see the cause of things since there is
+minimal logging on the site and the bot logs were not evident that
+anything was at fault, therefore our only detection was through machine
+metrics.
+
+We did manage to recover exactly what PostgreSQL was trying to do at the
+time of crashing by examining the logs which pointed us towards the user
+sync process.
+
+🙋🏿‍♂️ Response
+================
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the issue, there were no real obstacles
+encountered other than the node being less performant than we would like
+due to the CPU starvation.
+
+🙆🏽‍♀️ Recovery
+================
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+The incident was resolved by stopping the sync process and writing a
+more efficient one through an internal eval script. We batched the
+updates into 1,000 users and instead of doing one large one did 80
+smaller updates. This led to much higher efficiency with a cost of
+taking a little longer (~7 minutes).
+
+.. code:: python
+
+ from bot.exts.backend.sync import _syncers
+ syncer = _syncers.UserSyncer
+ diff = await syncer._get_diff(ctx.guild)
+
+ def chunks(lst, n):
+ for i in range(0, len(lst), n):
+ yield lst[i:i + n]
+
+ for chunk in chunks(diff.updated, 1000):
+ await bot.api_client.patch("bot/users/bulk_patch", json=chunk)
+
+Resource limits were also put into place on site to prevent RAM and CPU
+spikes, and throttle the CPU usage in these situations. This can be seen
+in the below graph:
+
+.. image:: ./images/2021-01-12/site_cpu_throttle.png
+
+CPU throttling is where a container has hit the limits and we need to
+reel it in. Ideally this value stays as closes to 0 as possible, however
+as you can see site hit this twice (during the periods where it was
+trying to sync 80k users at once)
+
+🔎 Five Why’s
+=============
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+- We experienced a major PostgreSQL outage
+- PostgreSQL was killed by the system OOM due to the RAM spike on site.
+- The RAM spike on site was caused by a large query.
+- This was because we do not chunk queries on the bot.
+- The large query was caused by the removal of the Developers role
+ resulting in 80k users needing updating.
+
+🌱 Blameless root cause
+=======================
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+The removal of the Developers role created a large diff which could not
+be applied by Django in a single request.
+
+See the follow up tasks on exactly how we can avoid this in future, it’s
+a relatively easy mitigation.
+
+🤔 Lessons learned
+==================
+
+*What did we learn from this incident?*
+
+- Django (or DRF) does not like huge update queries.
+
+☑️ Follow-up tasks
+==================
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Make the bot syncer more efficient (batch requests)
+- ☐ Increase logging on bot, state when an error has been hit (we had
+ no indication of this inside Discord, we need that)
+- ☒ Adjust resource alerts to page DevOps members earlier.
+- ☒ Apply resource limits to site to prevent major spikes
diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
new file mode 100644
index 0000000..b13ecd7
--- /dev/null
+++ b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
@@ -0,0 +1,146 @@
+2021-01-30: NodeBalancer networking faults due to memory pressure
+=================================================================
+
+At around 14:30 UTC on Saturday 30th January we started experiencing
+networking issues at the LoadBalancer level between Cloudflare and our
+Kubernetes cluster. It seems that the misconfiguration was due to memory
+and CPU pressure.
+
+[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word
+from Linode’s SysAdmins on any problems they detected.]
+
+**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a
+different machine.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+At 14:30 we started receiving alerts that services were becoming
+unreachable. We first experienced some momentary DNS errors which
+resolved themselves, however traffic ingress was still degraded.
+
+Upon checking Linode our NodeBalancer, the service which balances
+traffic between our Kubernetes nodes was reporting the backends (the
+services it balances to) as down. It reported all 4 as down (two for
+port 80 + two for port 443). This status was fluctuating between up and
+down, meaning traffic was not reaching our cluster correctly. Scaleios
+correctly noted:
+
+.. image:: ./images/2021-01-30/scaleios.png
+
+The config seems to have been set incorrectly due to memory and CPU
+pressure on one of our nodes. Here is the memory throughout the
+incident:
+
+.. image:: ./images/2021-01-30/memory_charts.png
+
+Here is the display from Linode:
+
+.. image:: ./images/2021-01-30/linode_loadbalancers.png
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Since traffic could not correctly enter our cluster multiple services
+which were web based were offline, including services such as site,
+grafana and bitwarden. It appears that no inter-node communication was
+affected as this uses a WireGuard tunnel between the nodes which was not
+affected by the NodeBalancer.
+
+The lack of Grafana made diagnosis slightly more difficult, but even
+then it was only a short trip to the
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We were alerted fairly promptly through statping which reported services
+as being down and posted a Discord notification. Subsequent alerts came
+in from Grafana but were limited since outbound communication was
+faulty.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded!
+
+Primary obstacle was the DevOps tools being out due to the traffic
+ingress problems.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+The incident resolved itself upstream at Linode, we’ve opened a ticket
+with Linode to let them know of the faults, this might give us a better
+indication of what caused the issues. Our Kubernetes cluster continued
+posting updates to Linode to refresh the NodeBalancer configuration,
+inspecting these payloads the configuration looked correct.
+
+We’ve set up alerts for when Prometheus services stop responding since
+this seems to be a fairly tell-tale symptom of networking problems, this
+was the Prometheus status graph throughout the incident:
+
+.. image:: ./images/2021-01-30/prometheus_status.png
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**What?** Our service experienced an outage due to networking faults.
+
+**Why?** Incoming traffic could not reach our Kubernetes nodes
+
+**Why?** Our Linode NodeBalancers were not using correct configuration
+
+**Why?** Memory & CPU pressure seemed to cause invalid configuration
+errors upstream at Linode.
+
+**Why?** Unknown at this stage, NodeBalancer migrated.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+The configuration of our NodeBalancer was invalid, we cannot say why at
+this point since we are awaiting contact back from Linode, but
+indicators point to it being an upstream fault since memory & CPU
+pressure should **not** cause a load balancer misconfiguration.
+
+Linode are going to follow up with us at some point during the week with
+information from their System Administrators.
+
+**Update 2nd February 2021:** Linode have concluded investigations at
+their end, taken notes and migrated our NodeBalancer to a new machine.
+We haven’t experienced problems since.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+We should be careful over-scheduling onto nodes since even while
+operating within reasonable constraints we risk sending invalid
+configuration upstream to Linode and therefore preventing traffic from
+entering our cluster.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Monitor for follow up from Linode
+- ☒ Carefully monitor the allocation rules for our services
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst
new file mode 100644
index 0000000..6cd30f3
--- /dev/null
+++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst
@@ -0,0 +1,335 @@
+2021-07-11: Cascading node failures and ensuing volume problems
+===============================================================
+
+A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node
+to an unresponsive state (00:55 UTC), upon performing a recycle of the
+affected node volumes were placed into a state where they could not be
+mounted.
+
+⚠️ Leadup
+=========
+
+*List the sequence of events that led to the incident*
+
+- **00:27 UTC:** Django starts rapidly using connections to our
+ PostgreSQL database
+- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated
+ it’s 115 max connections limit. Joe is paged.
+- **00:33 UTC:** DevOps team is alerted that a service has claimed 34
+ dangerous table locks (it peaked at 61).
+- **00:42 UTC:** Status incident created and backdated to 00:25 UTC.
+ `Status incident <https://status.pythondiscord.com/incident/92712>`__
+- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no
+ longer healthy after the Django connection surge, so it’s recycled
+ and a new one is to be added to the pool.
+- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s
+ restart
+- **01:13 UTC:** Node has restored and regained healthy status, but
+ volumes will not mount to the node. Support ticket opened at Linode
+ for assistance.
+- **06:36 UTC:** DevOps team alerted that Python is offline. This is
+ due to Redis being a dependency of the bot, which as a stateful
+ service was not healthy.
+
+🥏 Impact
+=========
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Initially, this manifested as a standard node outage where services on
+that node experienced some downtime as the node was restored.
+
+Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop)
+were unexecutable due to the volume issues, and so any dependent
+services (e.g. Site, Bot, Hastebin) also had trouble starting.
+
+PostgreSQL was restored early on so for the most part Moderation could
+continue.
+
+👁️ Detection
+============
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+DevOps were initially alerted at 00:32 UTC due to the PostgreSQL
+connection surge, and acknowledged at the same time.
+
+Further alerting could be used to catch surges earlier on (looking at
+conn delta vs. conn total), but for the most part alerting time was
+satisfactory here.
+
+🙋🏿‍♂️ Response
+================
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded. The primary issue encountered was failure upstream
+at Linode to remount the affected volumes, a support ticket has been
+created.
+
+🙆🏽‍♀️ Recovery
+================
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+Initial node restoration was performed by @Joe Banks by recycling the
+affected node.
+
+Subsequent volume restoration was also @Joe Banks and once Linode had
+unlocked the volumes affected pods were scaled down to 0, the volumes
+were unmounted at the Linode side and then the deployments were
+recreated.
+
+.. raw:: html
+
+ <details>
+
+.. raw:: html
+
+ <summary>
+
+Support ticket sent
+
+.. raw:: html
+
+ </summary>
+
+.. raw:: html
+
+ <blockquote>
+
+Good evening,
+
+We experienced a resource surge on one of our Kubernetes nodes at 00:32
+UTC, causing a node to go unresponsive. To mitigate problems here the
+node was recycled and began restarting at 1:01 UTC.
+
+The node has now rejoined the ring and started picking up services, but
+volumes will not attach to it, meaning pods with stateful storage will
+not start.
+
+An example events log for one such pod:
+
+::
+
+ Type Reason Age From Message
+ ---- ------ ---- ---- -------
+ Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf
+ Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f]
+ Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition
+
+I’ve been trying to manually resolve this through the Linode Web UI but
+get presented with attachment errors upon doing so. Please could you
+advise on the best way forward to restore Volumes & Nodes to a
+functioning state? As far as I can see there is something going on
+upstream since the Linode UI presents these nodes as mounted however as
+shown above LKE nodes are not locating them, there is also a few failed
+attachment logs in the Linode Audit Log.
+
+Thanks,
+
+Joe
+
+.. raw:: html
+
+ </blockquote>
+
+.. raw:: html
+
+ </details>
+
+.. raw:: html
+
+ <details>
+
+.. raw:: html
+
+ <summary>
+
+Response received from Linode
+
+.. raw:: html
+
+ </summary>
+
+.. raw:: html
+
+ <blockquote>
+
+Hi Joe,
+
+ Were there any known issues with Block Storage in Frankfurt today?
+
+Not today, though there were service issues reported for Block Storage
+and LKE in Frankfurt on July 8 and 9:
+
+- `Service Issue - Block Storage - EU-Central
+ (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__
+- `Service Issue - Linode Kubernetes Engine -
+ Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__
+
+There was also an API issue reported on the 10th (resolved on the 11th),
+mentioned here:
+
+- `Service Issue - Cloud Manager and
+ API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__
+
+Regarding the specific error you were receiving:
+
+ ``Unable to find device path out of attempted paths``
+
+I’m not certain it’s specifically related to those Service Issues,
+considering this isn’t the first time a customer has reported this error
+in their LKE logs. In fact, if I recall correctly, I’ve run across this
+before too, since our volumes are RWO and I had too many replicas in my
+deployment that I was trying to attach to, for example.
+
+ is this a known bug/condition that occurs with Linode CSI/LKE?
+
+From what I understand, yes, this is a known condition that crops up
+from time to time, which we are tracking. However, since there is a
+workaround at the moment (e.g. - “After some more manual attempts to fix
+things, scaling down deployments, unmounting at Linode and then scaling
+up the deployments seems to have worked and all our services have now
+been restored.”), there is no ETA for addressing this. With that said,
+I’ve let our Storage team know that you’ve run into this, so as to draw
+further attention to it.
+
+If you have any further questions or concerns regarding this, let us
+know.
+
+Best regards, [Redacted]
+
+Linode Support Team
+
+.. raw:: html
+
+ </blockquote>
+
+.. raw:: html
+
+ </details>
+
+.. raw:: html
+
+ <details>
+
+.. raw:: html
+
+ <summary>
+
+Concluding response from Joe Banks
+
+.. raw:: html
+
+ </summary>
+
+.. raw:: html
+
+ <blockquote>
+
+Hey [Redacted]!
+
+Thanks for the response. We ensure that stateful pods only ever have one
+volume assigned to them, either with a single replica deployment or a
+statefulset. It appears that the error generally manifests when a
+deployment is being migrated from one node to another during a redeploy,
+which makes sense if there is some delay on the unmount/remount.
+
+Confusion occurred because Linode was reporting the volume as attached
+when the node had been recycled, but I assume that was because the node
+did not cleanly shutdown and therefore could not cleanly unmount
+volumes.
+
+We’ve not seen any resurgence of such issues, and we’ll address the
+software fault which overloaded the node which will helpfully mitigate
+such problems in the future.
+
+Thanks again for the response, have a great week!
+
+Best,
+
+Joe
+
+.. raw:: html
+
+ </blockquote>
+
+.. raw:: html
+
+ </details>
+
+🔎 Five Why’s
+=============
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**What?**
+~~~~~~~~~
+
+Several of our services became unavailable because their volumes could
+not be mounted.
+
+Why?
+~~~~
+
+A node recycle left the node unable to mount volumes using the Linode
+CSI.
+
+.. _why-1:
+
+Why?
+~~~~
+
+A node recycle was used because PostgreSQL had a connection surge.
+
+.. _why-2:
+
+Why?
+~~~~
+
+A Django feature deadlocked a table 62 times and suddenly started using
+~70 connections to the database, saturating the maximum connections
+limit.
+
+.. _why-3:
+
+Why?
+~~~~
+
+The root cause of why Django does this is unclear, and someone with more
+Django proficiency is absolutely welcome to share any knowledge they may
+have. I presume it’s some sort of worker race condition, but I’ve not
+been able to reproduce it.
+
+🌱 Blameless root cause
+=======================
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrence*
+
+A node being forcefully restarted left volumes in a limbo state where
+mounting was difficult, it took multiple hours for this to be resolved
+since we had to wait for the volumes to unlock so they could be cloned.
+
+🤔 Lessons learned
+==================
+
+*What did we learn from this incident?*
+
+Volumes are painful.
+
+We need to look at why Django is doing this and mitigations of the fault
+to prevent this from occurring again.
+
+☑️ Follow-up tasks
+==================
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ `Follow up on ticket at
+ Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__
+- ☐ Investigate why Django could be connection surging and locking
+ tables
diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png
new file mode 100644
index 0000000..b530ec6
--- /dev/null
+++ b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png
new file mode 100644
index 0000000..e1e07af
--- /dev/null
+++ b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png
new file mode 100644
index 0000000..f0eae1f
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.png
new file mode 100644
index 0000000..370d19e
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/memory_charts.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.png
new file mode 100644
index 0000000..e95b8d7
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/prometheus_status.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.png
new file mode 100644
index 0000000..584d74d
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/scaleios.png
Binary files differ
diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst
new file mode 100644
index 0000000..43994a2
--- /dev/null
+++ b/docs/postmortems/index.rst
@@ -0,0 +1,15 @@
+Postmortems
+===========
+
+Browse the pages under this category to view historical postmortems for
+Python Discord outages.
+
+.. toctree::
+ :maxdepth: 2
+
+ 2020-12-11-all-services-outage
+ 2020-12-11-postgres-conn-surge
+ 2021-01-10-primary-kubernetes-node-outage
+ 2021-01-12-site-cpu-ram-exhaustion
+ 2021-01-30-nodebalancer-fails-memory
+ 2021-07-11-cascading-node-failures