aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems
diff options
context:
space:
mode:
authorGravatar Johannes Christ <[email protected]>2024-07-24 20:09:42 +0200
committerGravatar Johannes Christ <[email protected]>2024-07-25 20:06:54 +0200
commita4d7e92d544aeb43dbe1fcd8648d97e0dbf7b9d3 (patch)
tree183318852234388654c99514e45f095af8c21676 /docs/postmortems
parentAdd link to DevOps Kanban board in meeting template (#420) (diff)
Improve documentation
This commit ports our documentation to Sphinx. The reason for this is straightforward. We need to improve both the quality and the accessibility of our documentation. Hugo is not capable of doing this, as its primary output format is HTML. Sphinx builds plenty of high-quality output formats out of the box, and incentivizes writing good documentation.
Diffstat (limited to 'docs/postmortems')
-rw-r--r--docs/postmortems/2020-12-11-all-services-outage.rst121
-rw-r--r--docs/postmortems/2020-12-11-postgres-conn-surge.rst130
-rw-r--r--docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst117
-rw-r--r--docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst155
-rw-r--r--docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst146
-rw-r--r--docs/postmortems/2021-07-11-cascading-node-failures.rst335
-rw-r--r--docs/postmortems/images/2021-01-12/site_cpu_throttle.pngbin0 -> 227245 bytes
-rw-r--r--docs/postmortems/images/2021-01-12/site_resource_abnormal.pngbin0 -> 232260 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/linode_loadbalancers.pngbin0 -> 50882 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/memory_charts.pngbin0 -> 211053 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/prometheus_status.pngbin0 -> 291122 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/scaleios.pngbin0 -> 18294 bytes
-rw-r--r--docs/postmortems/index.rst15
13 files changed, 1019 insertions, 0 deletions
diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst
new file mode 100644
index 0000000..9c29303
--- /dev/null
+++ b/docs/postmortems/2020-12-11-all-services-outage.rst
@@ -0,0 +1,121 @@
+2020-12-11: All services outage
+===============================
+
+At **19:55 UTC, all services became unresponsive**. The DevOps were
+already in a call, and immediately started to investigate.
+
+Postgres was running at 100% CPU usage due to a **VACUUM**, which caused
+all services that depended on it to stop working. The high CPU left the
+host unresponsive and it shutdown. Linode Lassie noticed this and
+triggered a restart.
+
+It did not recover gracefully from this restart, with numerous core
+services reporting an error, so we had to manually restart core system
+services using Lens in order to get things working again.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This
+made Postgres run at 100% CPU and was unresponsive, which caused
+services to stop responding. This lead to a restart of the node, from
+which we did not recover gracefully.
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+All services went down. Catastrophic failure. We did not pass go, we did
+not collect $200.
+
+- Help channel system unavailable, so people are not able to
+ effectively ask for help.
+- Gates unavailable, so people can’t successfully get into the
+ community.
+- Moderation and raid prevention unavailable, which leaves us
+ defenseless against attacks.
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We noticed that all PyDis services had stopped responding,
+coincidentally our DevOps team were in a call at the time, so that was
+helpful.
+
+We may be able to improve detection time by adding monitoring of
+resource usage. To this end, we’ve added alerts for high CPU usage and
+low memory.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the incident.
+
+We noticed our node was entirely unresponsive and within minutes a
+restart had been triggered by Lassie after a high CPU shutdown occurred.
+
+The node came back and we saw a number of core services offline
+(e.g. Calico, CoreDNS, Linode CSI).
+
+**Obstacle: no recent database back-up available**
+
+🙆🏽‍♀️ Recovery
+-----------------
+
+*How was the incident resolved? How can we improve future mitigation
+times?*
+
+Through `Lens <https://k8slens.dev/>`__ we restarted core services one
+by one until they stabilised, after these core services were up other
+services began to come back online.
+
+We finally provisioned PostgreSQL which had been removed as a component
+before the restart (but too late to prevent the CPU errors). Once
+PostgreSQL was up we restarted any components that were acting buggy
+(e.g. site and bot).
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+- Major service outage
+- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI)
+- **Why?** Kubernetes worker node restart
+- **Why?** High CPU shutdown
+- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+- We must ensure we have working database backups. We are lucky that we
+ did not lose any data this time. If this problem had caused volume
+ corruption, we would be screwed.
+- Sentry is broken for the bot. It was missing a DSN secret, which we
+ have now restored.
+- The https://sentry.pydis.com redirect was never migrated to the
+ cluster. **We should do that.**
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we’ve created as a result of this incident*
+
+- ☒ Push forward with backup plans
diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst
new file mode 100644
index 0000000..6ebcb01
--- /dev/null
+++ b/docs/postmortems/2020-12-11-postgres-conn-surge.rst
@@ -0,0 +1,130 @@
+2020-12-11: Postgres connection surge
+=====================================
+
+At **13:24 UTC,** we noticed the bot was not able to infract, and
+`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The
+DevOps team started to investigate.
+
+We discovered that Postgres was not accepting new connections because it
+had hit 100 clients. This made it unavailable to all services that
+depended on it.
+
+Ultimately this was resolved by taking down Postgres, remounting the
+associated volume, and bringing it back up again.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+The bot infractions stopped working, and we started investigating.
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Services were unavailable both for internal and external users.
+
+- The Help Channel System was unavailable.
+- Voice Gate and Server Gate were not working.
+- Moderation commands were unavailable.
+- Python Discord site & API were unavailable. CloudFlare automatically
+ switched us to Always Online.
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We noticed HTTP 524s coming from CloudFlare, upon attempting database
+connection we observed the maximum client limit.
+
+We noticed this log in site:
+
+.. code:: yaml
+
+ django.db.utils.OperationalError: FATAL: sorry, too many clients already
+
+We should be monitoring number of clients, and the monitor should alert
+us when we’re approaching the max. That would have allowed for earlier
+detection, and possibly allowed us to prevent the incident altogether.
+
+We will look at
+`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__
+for monitoring this.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the incident. The obstacles were mostly a lack of
+a clear response strategy.
+
+We should document our recovery procedure so that we’re not so dependent
+on Joe Banks should this happen again while he’s unavailable.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+- Delete PostgreSQL deployment ``kubectl delete deployment/postgres``
+- Delete any remaining pods, WITH force.
+ ``kubectl delete <pod name> --force --grace-period=0``
+- Unmount volume at Linode
+- Remount volume at Linode
+- Reapply deployment ``kubectl apply -f postgres/deployment.yaml``
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+- Postgres was unavailable, so our services died.
+- **Why?** Postgres hit max clients, and could not respond.
+- **Why?** Unknown, but we saw a number of connections from previous
+ deployments of site. This indicates that database connections are not
+ being terminated properly. Needs further investigation.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+We’re not sure what the root cause is, but suspect site is not
+terminating database connections properly in some cases. We were unable
+to reproduce this problem.
+
+We’ve set up new telemetry on Grafana with alerts so that we can
+investigate this more closely. We will be let know if the number of
+connections from site exceeds 32, or if the total number of connections
+exceeds 90.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+- We must ensure the DevOps team has access to Linode and other key
+ services even if our Bitwarden is down.
+- We need to ensure we’re alerted of any risk factors that have the
+ potential to make Postgres unavailable, since this causes a
+ catastrophic outage of practically all services.
+- We absolutely need backups for the databases, so that this sort of
+ problem carries less of a risk.
+- We may need to consider something like
+ `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage
+ a connection pool so that we don’t exceed 100 *legitimate* clients
+ connected as we connect more services to the postgres database.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ All database backup
diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst
new file mode 100644
index 0000000..5852c46
--- /dev/null
+++ b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst
@@ -0,0 +1,117 @@
+2021-01-10: Primary Kubernetes node outage
+==========================================
+
+We had an outage of our highest spec node due to CPU exhaustion. The
+outage lasted from around 20:20 to 20:46 UTC, but was not a full service
+outage.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+I ran a query on Prometheus to try figure out some statistics on the
+number of metrics we are holding, this ended up scanning a lot of data
+in the TSDB database that Prometheus uses.
+
+This scan caused a CPU exhaustion which caused issues with the
+Kubernetes node status.
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+This brought down the primary node which meant there was some service
+outage. Most services transferred successfully to our secondary node
+which kept up some key services such as the Moderation bot and Modmail
+bot, as well as MongoDB.
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+This was noticed when Discord services started having failures. The
+primary detection was through alerts though! I was paged 1 minute after
+we started encountering CPU exhaustion issues.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the incident.
+
+No major obstacles were encountered during this.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+It was noted that in the response to ``kubectl get nodes`` the primary
+node’s status was reported as ``NotReady``. Looking into the reason it
+was because the node had stopped responding.
+
+The quickest way to fix this was triggering a node restart. This shifted
+a lot of pods over to node 2 which encountered some capacity issues
+since it’s not as highly specified as the first node.
+
+I brought this back the first node by restarting it at Linode’s end.
+Once this node was reporting as ``Ready`` again I drained the second
+node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This
+command stops the node from being available for scheduling and moves
+existing pods onto other nodes.
+
+Services gradually recovered as the dependencies started. The incident
+lasted overall around 26 minutes, though this was not a complete outage
+for the whole time and the bot remained functional throughout (meaning
+systems like the help channels were still functional).
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**Why?** Partial service outage
+
+**Why?** We had a node outage.
+
+**Why?** CPU exhaustion of our primary node.
+
+**Why?** Large prometheus query using a lot of CPU.
+
+**Why?** Prometheus had to scan millions of TSDB records which consumed
+all cores.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+A large query was run on Prometheus, so the solution is just to not run
+said queries.
+
+To protect against this more precisely though we should write resource
+constraints for services like this that are vulnerable to CPU exhaustion
+or memory consumption, which are the causes of our two past outages as
+well.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+- Don’t run large queries, it consumes CPU!
+- Write resource constraints for our services.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Write resource constraints for our services.
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
new file mode 100644
index 0000000..57f9fd8
--- /dev/null
+++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
@@ -0,0 +1,155 @@
+2021-01-12: Django site CPU/RAM exhaustion outage
+=================================================
+
+At 03:01 UTC on Tuesday 12th January we experienced a momentary outage
+of our PostgreSQL database, causing some very minor service downtime.
+
+⚠️ Leadup
+=========
+
+*List the sequence of events that led to the incident*
+
+We deleted the Developers role which led to a large user diff for all
+the users where we had to update their roles on the site.
+
+The bot was trying to post this for over 24 hours repeatedly after every
+restart.
+
+We deployed the bot at 2:55 UTC on 12th January and the user sync
+process began once again.
+
+This caused a CPU & RAM spike on our Django site, which in turn
+triggered an OOM error on the server which killed the Postgres process,
+sending it into a recovery state where queries could not be executed.
+
+Django site did not have any tools in place to batch the requests so was
+trying to process all 80k user updates in a single query, something that
+PostgreSQL probably could handle, but not the Django ORM. During the
+incident site jumped from it’s average RAM usage of 300-400MB to
+**1.5GB.**
+
+.. image:: ./images/2021-01-12/site_resource_abnormal.png
+
+RAM and CPU usage of site throughout the incident. The period just
+before 3:40 where no statistics were reported is the actual outage
+period where the Kubernetes node had some networking errors.
+
+🥏 Impact
+=========
+
+*Describe how internal and external users were impacted during the
+incident*
+
+This database outage lasted mere minutes, since Postgres recovered and
+healed itself and the sync process was aborted, but it did leave us with
+a large user diff and our database becoming further out of sync.
+
+Most services stayed up that did not depend on PostgreSQL, and the site
+remained stable after the sync had been cancelled.
+
+👁️ Detection
+============
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We were immediately alerted to the PostgreSQL outage on Grafana and
+through Sentry, meaning our response time was under a minute.
+
+We reduced some alert thresholds in order to catch RAM & CPU spikes
+faster in the future.
+
+It was hard to immediately see the cause of things since there is
+minimal logging on the site and the bot logs were not evident that
+anything was at fault, therefore our only detection was through machine
+metrics.
+
+We did manage to recover exactly what PostgreSQL was trying to do at the
+time of crashing by examining the logs which pointed us towards the user
+sync process.
+
+🙋🏿‍♂️ Response
+================
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded to the issue, there were no real obstacles
+encountered other than the node being less performant than we would like
+due to the CPU starvation.
+
+🙆🏽‍♀️ Recovery
+================
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+The incident was resolved by stopping the sync process and writing a
+more efficient one through an internal eval script. We batched the
+updates into 1,000 users and instead of doing one large one did 80
+smaller updates. This led to much higher efficiency with a cost of
+taking a little longer (~7 minutes).
+
+.. code:: python
+
+ from bot.exts.backend.sync import _syncers
+ syncer = _syncers.UserSyncer
+ diff = await syncer._get_diff(ctx.guild)
+
+ def chunks(lst, n):
+ for i in range(0, len(lst), n):
+ yield lst[i:i + n]
+
+ for chunk in chunks(diff.updated, 1000):
+ await bot.api_client.patch("bot/users/bulk_patch", json=chunk)
+
+Resource limits were also put into place on site to prevent RAM and CPU
+spikes, and throttle the CPU usage in these situations. This can be seen
+in the below graph:
+
+.. image:: ./images/2021-01-12/site_cpu_throttle.png
+
+CPU throttling is where a container has hit the limits and we need to
+reel it in. Ideally this value stays as closes to 0 as possible, however
+as you can see site hit this twice (during the periods where it was
+trying to sync 80k users at once)
+
+🔎 Five Why’s
+=============
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+- We experienced a major PostgreSQL outage
+- PostgreSQL was killed by the system OOM due to the RAM spike on site.
+- The RAM spike on site was caused by a large query.
+- This was because we do not chunk queries on the bot.
+- The large query was caused by the removal of the Developers role
+ resulting in 80k users needing updating.
+
+🌱 Blameless root cause
+=======================
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+The removal of the Developers role created a large diff which could not
+be applied by Django in a single request.
+
+See the follow up tasks on exactly how we can avoid this in future, it’s
+a relatively easy mitigation.
+
+🤔 Lessons learned
+==================
+
+*What did we learn from this incident?*
+
+- Django (or DRF) does not like huge update queries.
+
+☑️ Follow-up tasks
+==================
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Make the bot syncer more efficient (batch requests)
+- ☐ Increase logging on bot, state when an error has been hit (we had
+ no indication of this inside Discord, we need that)
+- ☒ Adjust resource alerts to page DevOps members earlier.
+- ☒ Apply resource limits to site to prevent major spikes
diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
new file mode 100644
index 0000000..b13ecd7
--- /dev/null
+++ b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
@@ -0,0 +1,146 @@
+2021-01-30: NodeBalancer networking faults due to memory pressure
+=================================================================
+
+At around 14:30 UTC on Saturday 30th January we started experiencing
+networking issues at the LoadBalancer level between Cloudflare and our
+Kubernetes cluster. It seems that the misconfiguration was due to memory
+and CPU pressure.
+
+[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word
+from Linode’s SysAdmins on any problems they detected.]
+
+**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a
+different machine.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+At 14:30 we started receiving alerts that services were becoming
+unreachable. We first experienced some momentary DNS errors which
+resolved themselves, however traffic ingress was still degraded.
+
+Upon checking Linode our NodeBalancer, the service which balances
+traffic between our Kubernetes nodes was reporting the backends (the
+services it balances to) as down. It reported all 4 as down (two for
+port 80 + two for port 443). This status was fluctuating between up and
+down, meaning traffic was not reaching our cluster correctly. Scaleios
+correctly noted:
+
+.. image:: ./images/2021-01-30/scaleios.png
+
+The config seems to have been set incorrectly due to memory and CPU
+pressure on one of our nodes. Here is the memory throughout the
+incident:
+
+.. image:: ./images/2021-01-30/memory_charts.png
+
+Here is the display from Linode:
+
+.. image:: ./images/2021-01-30/linode_loadbalancers.png
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Since traffic could not correctly enter our cluster multiple services
+which were web based were offline, including services such as site,
+grafana and bitwarden. It appears that no inter-node communication was
+affected as this uses a WireGuard tunnel between the nodes which was not
+affected by the NodeBalancer.
+
+The lack of Grafana made diagnosis slightly more difficult, but even
+then it was only a short trip to the
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We were alerted fairly promptly through statping which reported services
+as being down and posted a Discord notification. Subsequent alerts came
+in from Grafana but were limited since outbound communication was
+faulty.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded!
+
+Primary obstacle was the DevOps tools being out due to the traffic
+ingress problems.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+The incident resolved itself upstream at Linode, we’ve opened a ticket
+with Linode to let them know of the faults, this might give us a better
+indication of what caused the issues. Our Kubernetes cluster continued
+posting updates to Linode to refresh the NodeBalancer configuration,
+inspecting these payloads the configuration looked correct.
+
+We’ve set up alerts for when Prometheus services stop responding since
+this seems to be a fairly tell-tale symptom of networking problems, this
+was the Prometheus status graph throughout the incident:
+
+.. image:: ./images/2021-01-30/prometheus_status.png
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**What?** Our service experienced an outage due to networking faults.
+
+**Why?** Incoming traffic could not reach our Kubernetes nodes
+
+**Why?** Our Linode NodeBalancers were not using correct configuration
+
+**Why?** Memory & CPU pressure seemed to cause invalid configuration
+errors upstream at Linode.
+
+**Why?** Unknown at this stage, NodeBalancer migrated.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+The configuration of our NodeBalancer was invalid, we cannot say why at
+this point since we are awaiting contact back from Linode, but
+indicators point to it being an upstream fault since memory & CPU
+pressure should **not** cause a load balancer misconfiguration.
+
+Linode are going to follow up with us at some point during the week with
+information from their System Administrators.
+
+**Update 2nd February 2021:** Linode have concluded investigations at
+their end, taken notes and migrated our NodeBalancer to a new machine.
+We haven’t experienced problems since.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+We should be careful over-scheduling onto nodes since even while
+operating within reasonable constraints we risk sending invalid
+configuration upstream to Linode and therefore preventing traffic from
+entering our cluster.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Monitor for follow up from Linode
+- ☒ Carefully monitor the allocation rules for our services
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst
new file mode 100644
index 0000000..6cd30f3
--- /dev/null
+++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst
@@ -0,0 +1,335 @@
+2021-07-11: Cascading node failures and ensuing volume problems
+===============================================================
+
+A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node
+to an unresponsive state (00:55 UTC), upon performing a recycle of the
+affected node volumes were placed into a state where they could not be
+mounted.
+
+⚠️ Leadup
+=========
+
+*List the sequence of events that led to the incident*
+
+- **00:27 UTC:** Django starts rapidly using connections to our
+ PostgreSQL database
+- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated
+ it’s 115 max connections limit. Joe is paged.
+- **00:33 UTC:** DevOps team is alerted that a service has claimed 34
+ dangerous table locks (it peaked at 61).
+- **00:42 UTC:** Status incident created and backdated to 00:25 UTC.
+ `Status incident <https://status.pythondiscord.com/incident/92712>`__
+- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no
+ longer healthy after the Django connection surge, so it’s recycled
+ and a new one is to be added to the pool.
+- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s
+ restart
+- **01:13 UTC:** Node has restored and regained healthy status, but
+ volumes will not mount to the node. Support ticket opened at Linode
+ for assistance.
+- **06:36 UTC:** DevOps team alerted that Python is offline. This is
+ due to Redis being a dependency of the bot, which as a stateful
+ service was not healthy.
+
+🥏 Impact
+=========
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Initially, this manifested as a standard node outage where services on
+that node experienced some downtime as the node was restored.
+
+Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop)
+were unexecutable due to the volume issues, and so any dependent
+services (e.g. Site, Bot, Hastebin) also had trouble starting.
+
+PostgreSQL was restored early on so for the most part Moderation could
+continue.
+
+👁️ Detection
+============
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+DevOps were initially alerted at 00:32 UTC due to the PostgreSQL
+connection surge, and acknowledged at the same time.
+
+Further alerting could be used to catch surges earlier on (looking at
+conn delta vs. conn total), but for the most part alerting time was
+satisfactory here.
+
+🙋🏿‍♂️ Response
+================
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded. The primary issue encountered was failure upstream
+at Linode to remount the affected volumes, a support ticket has been
+created.
+
+🙆🏽‍♀️ Recovery
+================
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+Initial node restoration was performed by @Joe Banks by recycling the
+affected node.
+
+Subsequent volume restoration was also @Joe Banks and once Linode had
+unlocked the volumes affected pods were scaled down to 0, the volumes
+were unmounted at the Linode side and then the deployments were
+recreated.
+
+.. raw:: html
+
+ <details>
+
+.. raw:: html
+
+ <summary>
+
+Support ticket sent
+
+.. raw:: html
+
+ </summary>
+
+.. raw:: html
+
+ <blockquote>
+
+Good evening,
+
+We experienced a resource surge on one of our Kubernetes nodes at 00:32
+UTC, causing a node to go unresponsive. To mitigate problems here the
+node was recycled and began restarting at 1:01 UTC.
+
+The node has now rejoined the ring and started picking up services, but
+volumes will not attach to it, meaning pods with stateful storage will
+not start.
+
+An example events log for one such pod:
+
+::
+
+ Type Reason Age From Message
+ ---- ------ ---- ---- -------
+ Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf
+ Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f]
+ Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition
+
+I’ve been trying to manually resolve this through the Linode Web UI but
+get presented with attachment errors upon doing so. Please could you
+advise on the best way forward to restore Volumes & Nodes to a
+functioning state? As far as I can see there is something going on
+upstream since the Linode UI presents these nodes as mounted however as
+shown above LKE nodes are not locating them, there is also a few failed
+attachment logs in the Linode Audit Log.
+
+Thanks,
+
+Joe
+
+.. raw:: html
+
+ </blockquote>
+
+.. raw:: html
+
+ </details>
+
+.. raw:: html
+
+ <details>
+
+.. raw:: html
+
+ <summary>
+
+Response received from Linode
+
+.. raw:: html
+
+ </summary>
+
+.. raw:: html
+
+ <blockquote>
+
+Hi Joe,
+
+ Were there any known issues with Block Storage in Frankfurt today?
+
+Not today, though there were service issues reported for Block Storage
+and LKE in Frankfurt on July 8 and 9:
+
+- `Service Issue - Block Storage - EU-Central
+ (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__
+- `Service Issue - Linode Kubernetes Engine -
+ Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__
+
+There was also an API issue reported on the 10th (resolved on the 11th),
+mentioned here:
+
+- `Service Issue - Cloud Manager and
+ API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__
+
+Regarding the specific error you were receiving:
+
+ ``Unable to find device path out of attempted paths``
+
+I’m not certain it’s specifically related to those Service Issues,
+considering this isn’t the first time a customer has reported this error
+in their LKE logs. In fact, if I recall correctly, I’ve run across this
+before too, since our volumes are RWO and I had too many replicas in my
+deployment that I was trying to attach to, for example.
+
+ is this a known bug/condition that occurs with Linode CSI/LKE?
+
+From what I understand, yes, this is a known condition that crops up
+from time to time, which we are tracking. However, since there is a
+workaround at the moment (e.g. - “After some more manual attempts to fix
+things, scaling down deployments, unmounting at Linode and then scaling
+up the deployments seems to have worked and all our services have now
+been restored.”), there is no ETA for addressing this. With that said,
+I’ve let our Storage team know that you’ve run into this, so as to draw
+further attention to it.
+
+If you have any further questions or concerns regarding this, let us
+know.
+
+Best regards, [Redacted]
+
+Linode Support Team
+
+.. raw:: html
+
+ </blockquote>
+
+.. raw:: html
+
+ </details>
+
+.. raw:: html
+
+ <details>
+
+.. raw:: html
+
+ <summary>
+
+Concluding response from Joe Banks
+
+.. raw:: html
+
+ </summary>
+
+.. raw:: html
+
+ <blockquote>
+
+Hey [Redacted]!
+
+Thanks for the response. We ensure that stateful pods only ever have one
+volume assigned to them, either with a single replica deployment or a
+statefulset. It appears that the error generally manifests when a
+deployment is being migrated from one node to another during a redeploy,
+which makes sense if there is some delay on the unmount/remount.
+
+Confusion occurred because Linode was reporting the volume as attached
+when the node had been recycled, but I assume that was because the node
+did not cleanly shutdown and therefore could not cleanly unmount
+volumes.
+
+We’ve not seen any resurgence of such issues, and we’ll address the
+software fault which overloaded the node which will helpfully mitigate
+such problems in the future.
+
+Thanks again for the response, have a great week!
+
+Best,
+
+Joe
+
+.. raw:: html
+
+ </blockquote>
+
+.. raw:: html
+
+ </details>
+
+🔎 Five Why’s
+=============
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**What?**
+~~~~~~~~~
+
+Several of our services became unavailable because their volumes could
+not be mounted.
+
+Why?
+~~~~
+
+A node recycle left the node unable to mount volumes using the Linode
+CSI.
+
+.. _why-1:
+
+Why?
+~~~~
+
+A node recycle was used because PostgreSQL had a connection surge.
+
+.. _why-2:
+
+Why?
+~~~~
+
+A Django feature deadlocked a table 62 times and suddenly started using
+~70 connections to the database, saturating the maximum connections
+limit.
+
+.. _why-3:
+
+Why?
+~~~~
+
+The root cause of why Django does this is unclear, and someone with more
+Django proficiency is absolutely welcome to share any knowledge they may
+have. I presume it’s some sort of worker race condition, but I’ve not
+been able to reproduce it.
+
+🌱 Blameless root cause
+=======================
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrence*
+
+A node being forcefully restarted left volumes in a limbo state where
+mounting was difficult, it took multiple hours for this to be resolved
+since we had to wait for the volumes to unlock so they could be cloned.
+
+🤔 Lessons learned
+==================
+
+*What did we learn from this incident?*
+
+Volumes are painful.
+
+We need to look at why Django is doing this and mitigations of the fault
+to prevent this from occurring again.
+
+☑️ Follow-up tasks
+==================
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ `Follow up on ticket at
+ Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__
+- ☐ Investigate why Django could be connection surging and locking
+ tables
diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png
new file mode 100644
index 0000000..b530ec6
--- /dev/null
+++ b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png
new file mode 100644
index 0000000..e1e07af
--- /dev/null
+++ b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png
new file mode 100644
index 0000000..f0eae1f
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.png
new file mode 100644
index 0000000..370d19e
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/memory_charts.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.png
new file mode 100644
index 0000000..e95b8d7
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/prometheus_status.png
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.png
new file mode 100644
index 0000000..584d74d
--- /dev/null
+++ b/docs/postmortems/images/2021-01-30/scaleios.png
Binary files differ
diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst
new file mode 100644
index 0000000..43994a2
--- /dev/null
+++ b/docs/postmortems/index.rst
@@ -0,0 +1,15 @@
+Postmortems
+===========
+
+Browse the pages under this category to view historical postmortems for
+Python Discord outages.
+
+.. toctree::
+ :maxdepth: 2
+
+ 2020-12-11-all-services-outage
+ 2020-12-11-postgres-conn-surge
+ 2021-01-10-primary-kubernetes-node-outage
+ 2021-01-12-site-cpu-ram-exhaustion
+ 2021-01-30-nodebalancer-fails-memory
+ 2021-07-11-cascading-node-failures