aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems
diff options
context:
space:
mode:
authorGravatar Joe Banks <[email protected]>2024-08-07 18:41:02 +0100
committerGravatar Joe Banks <[email protected]>2024-08-07 18:41:02 +0100
commitdcbb78959177537cf1fdda813380996a4b2daf8f (patch)
tree0a53ded19896aaddf93cc8f1e4ff34ac3f70464e /docs/postmortems
parentRevert "Enable fail2ban jails for postfix" (diff)
Remove old documentation
Diffstat (limited to 'docs/postmortems')
-rw-r--r--docs/postmortems/2020-12-11-all-services-outage.rst121
-rw-r--r--docs/postmortems/2020-12-11-postgres-conn-surge.rst130
-rw-r--r--docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst117
-rw-r--r--docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst155
-rw-r--r--docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst146
-rw-r--r--docs/postmortems/2021-07-11-cascading-node-failures.rst335
-rw-r--r--docs/postmortems/images/2021-01-12/site_cpu_throttle.pngbin227245 -> 0 bytes
-rw-r--r--docs/postmortems/images/2021-01-12/site_resource_abnormal.pngbin232260 -> 0 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/linode_loadbalancers.pngbin50882 -> 0 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/memory_charts.pngbin211053 -> 0 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/prometheus_status.pngbin291122 -> 0 bytes
-rw-r--r--docs/postmortems/images/2021-01-30/scaleios.pngbin18294 -> 0 bytes
-rw-r--r--docs/postmortems/index.rst15
13 files changed, 0 insertions, 1019 deletions
diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst
deleted file mode 100644
index 9c29303..0000000
--- a/docs/postmortems/2020-12-11-all-services-outage.rst
+++ /dev/null
@@ -1,121 +0,0 @@
-2020-12-11: All services outage
-===============================
-
-At **19:55 UTC, all services became unresponsive**. The DevOps were
-already in a call, and immediately started to investigate.
-
-Postgres was running at 100% CPU usage due to a **VACUUM**, which caused
-all services that depended on it to stop working. The high CPU left the
-host unresponsive and it shutdown. Linode Lassie noticed this and
-triggered a restart.
-
-It did not recover gracefully from this restart, with numerous core
-services reporting an error, so we had to manually restart core system
-services using Lens in order to get things working again.
-
-⚠️ Leadup
----------
-
-*List the sequence of events that led to the incident*
-
-Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This
-made Postgres run at 100% CPU and was unresponsive, which caused
-services to stop responding. This lead to a restart of the node, from
-which we did not recover gracefully.
-
-🥏 Impact
----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-All services went down. Catastrophic failure. We did not pass go, we did
-not collect $200.
-
-- Help channel system unavailable, so people are not able to
- effectively ask for help.
-- Gates unavailable, so people can’t successfully get into the
- community.
-- Moderation and raid prevention unavailable, which leaves us
- defenseless against attacks.
-
-👁️ Detection
-------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-We noticed that all PyDis services had stopped responding,
-coincidentally our DevOps team were in a call at the time, so that was
-helpful.
-
-We may be able to improve detection time by adding monitoring of
-resource usage. To this end, we’ve added alerts for high CPU usage and
-low memory.
-
-🙋🏿‍♂️ Response
-----------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident.
-
-We noticed our node was entirely unresponsive and within minutes a
-restart had been triggered by Lassie after a high CPU shutdown occurred.
-
-The node came back and we saw a number of core services offline
-(e.g. Calico, CoreDNS, Linode CSI).
-
-**Obstacle: no recent database back-up available**
-
-🙆🏽‍♀️ Recovery
------------------
-
-*How was the incident resolved? How can we improve future mitigation
-times?*
-
-Through `Lens <https://k8slens.dev/>`__ we restarted core services one
-by one until they stabilised, after these core services were up other
-services began to come back online.
-
-We finally provisioned PostgreSQL which had been removed as a component
-before the restart (but too late to prevent the CPU errors). Once
-PostgreSQL was up we restarted any components that were acting buggy
-(e.g. site and bot).
-
-🔎 Five Why’s
--------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- Major service outage
-- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI)
-- **Why?** Kubernetes worker node restart
-- **Why?** High CPU shutdown
-- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrance*
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-- We must ensure we have working database backups. We are lucky that we
- did not lose any data this time. If this problem had caused volume
- corruption, we would be screwed.
-- Sentry is broken for the bot. It was missing a DSN secret, which we
- have now restored.
-- The https://sentry.pydis.com redirect was never migrated to the
- cluster. **We should do that.**
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we’ve created as a result of this incident*
-
-- ☒ Push forward with backup plans
diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst
deleted file mode 100644
index 6ebcb01..0000000
--- a/docs/postmortems/2020-12-11-postgres-conn-surge.rst
+++ /dev/null
@@ -1,130 +0,0 @@
-2020-12-11: Postgres connection surge
-=====================================
-
-At **13:24 UTC,** we noticed the bot was not able to infract, and
-`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The
-DevOps team started to investigate.
-
-We discovered that Postgres was not accepting new connections because it
-had hit 100 clients. This made it unavailable to all services that
-depended on it.
-
-Ultimately this was resolved by taking down Postgres, remounting the
-associated volume, and bringing it back up again.
-
-⚠️ Leadup
----------
-
-*List the sequence of events that led to the incident*
-
-The bot infractions stopped working, and we started investigating.
-
-🥏 Impact
----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-Services were unavailable both for internal and external users.
-
-- The Help Channel System was unavailable.
-- Voice Gate and Server Gate were not working.
-- Moderation commands were unavailable.
-- Python Discord site & API were unavailable. CloudFlare automatically
- switched us to Always Online.
-
-👁️ Detection
-------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-We noticed HTTP 524s coming from CloudFlare, upon attempting database
-connection we observed the maximum client limit.
-
-We noticed this log in site:
-
-.. code:: yaml
-
- django.db.utils.OperationalError: FATAL: sorry, too many clients already
-
-We should be monitoring number of clients, and the monitor should alert
-us when we’re approaching the max. That would have allowed for earlier
-detection, and possibly allowed us to prevent the incident altogether.
-
-We will look at
-`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__
-for monitoring this.
-
-🙋🏿‍♂️ Response
-----------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident. The obstacles were mostly a lack of
-a clear response strategy.
-
-We should document our recovery procedure so that we’re not so dependent
-on Joe Banks should this happen again while he’s unavailable.
-
-🙆🏽‍♀️ Recovery
-----------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-- Delete PostgreSQL deployment ``kubectl delete deployment/postgres``
-- Delete any remaining pods, WITH force.
- ``kubectl delete <pod name> --force --grace-period=0``
-- Unmount volume at Linode
-- Remount volume at Linode
-- Reapply deployment ``kubectl apply -f postgres/deployment.yaml``
-
-🔎 Five Why’s
--------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- Postgres was unavailable, so our services died.
-- **Why?** Postgres hit max clients, and could not respond.
-- **Why?** Unknown, but we saw a number of connections from previous
- deployments of site. This indicates that database connections are not
- being terminated properly. Needs further investigation.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrance*
-
-We’re not sure what the root cause is, but suspect site is not
-terminating database connections properly in some cases. We were unable
-to reproduce this problem.
-
-We’ve set up new telemetry on Grafana with alerts so that we can
-investigate this more closely. We will be let know if the number of
-connections from site exceeds 32, or if the total number of connections
-exceeds 90.
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-- We must ensure the DevOps team has access to Linode and other key
- services even if our Bitwarden is down.
-- We need to ensure we’re alerted of any risk factors that have the
- potential to make Postgres unavailable, since this causes a
- catastrophic outage of practically all services.
-- We absolutely need backups for the databases, so that this sort of
- problem carries less of a risk.
-- We may need to consider something like
- `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage
- a connection pool so that we don’t exceed 100 *legitimate* clients
- connected as we connect more services to the postgres database.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ All database backup
diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst
deleted file mode 100644
index 5852c46..0000000
--- a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst
+++ /dev/null
@@ -1,117 +0,0 @@
-2021-01-10: Primary Kubernetes node outage
-==========================================
-
-We had an outage of our highest spec node due to CPU exhaustion. The
-outage lasted from around 20:20 to 20:46 UTC, but was not a full service
-outage.
-
-⚠️ Leadup
----------
-
-*List the sequence of events that led to the incident*
-
-I ran a query on Prometheus to try figure out some statistics on the
-number of metrics we are holding, this ended up scanning a lot of data
-in the TSDB database that Prometheus uses.
-
-This scan caused a CPU exhaustion which caused issues with the
-Kubernetes node status.
-
-🥏 Impact
----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-This brought down the primary node which meant there was some service
-outage. Most services transferred successfully to our secondary node
-which kept up some key services such as the Moderation bot and Modmail
-bot, as well as MongoDB.
-
-👁️ Detection
-------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-This was noticed when Discord services started having failures. The
-primary detection was through alerts though! I was paged 1 minute after
-we started encountering CPU exhaustion issues.
-
-🙋🏿‍♂️ Response
-----------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident.
-
-No major obstacles were encountered during this.
-
-🙆🏽‍♀️ Recovery
-----------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-It was noted that in the response to ``kubectl get nodes`` the primary
-node’s status was reported as ``NotReady``. Looking into the reason it
-was because the node had stopped responding.
-
-The quickest way to fix this was triggering a node restart. This shifted
-a lot of pods over to node 2 which encountered some capacity issues
-since it’s not as highly specified as the first node.
-
-I brought this back the first node by restarting it at Linode’s end.
-Once this node was reporting as ``Ready`` again I drained the second
-node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This
-command stops the node from being available for scheduling and moves
-existing pods onto other nodes.
-
-Services gradually recovered as the dependencies started. The incident
-lasted overall around 26 minutes, though this was not a complete outage
-for the whole time and the bot remained functional throughout (meaning
-systems like the help channels were still functional).
-
-🔎 Five Why’s
--------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-**Why?** Partial service outage
-
-**Why?** We had a node outage.
-
-**Why?** CPU exhaustion of our primary node.
-
-**Why?** Large prometheus query using a lot of CPU.
-
-**Why?** Prometheus had to scan millions of TSDB records which consumed
-all cores.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrance*
-
-A large query was run on Prometheus, so the solution is just to not run
-said queries.
-
-To protect against this more precisely though we should write resource
-constraints for services like this that are vulnerable to CPU exhaustion
-or memory consumption, which are the causes of our two past outages as
-well.
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-- Don’t run large queries, it consumes CPU!
-- Write resource constraints for our services.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ Write resource constraints for our services.
diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
deleted file mode 100644
index f621782..0000000
--- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst
+++ /dev/null
@@ -1,155 +0,0 @@
-2021-01-12: Django site CPU/RAM exhaustion outage
-=================================================
-
-At 03:01 UTC on Tuesday 12th January we experienced a momentary outage
-of our PostgreSQL database, causing some very minor service downtime.
-
-⚠️ Leadup
----------
-
-*List the sequence of events that led to the incident*
-
-We deleted the Developers role which led to a large user diff for all
-the users where we had to update their roles on the site.
-
-The bot was trying to post this for over 24 hours repeatedly after every
-restart.
-
-We deployed the bot at 2:55 UTC on 12th January and the user sync
-process began once again.
-
-This caused a CPU & RAM spike on our Django site, which in turn
-triggered an OOM error on the server which killed the Postgres process,
-sending it into a recovery state where queries could not be executed.
-
-Django site did not have any tools in place to batch the requests so was
-trying to process all 80k user updates in a single query, something that
-PostgreSQL probably could handle, but not the Django ORM. During the
-incident site jumped from it’s average RAM usage of 300-400MB to
-**1.5GB.**
-
-.. image:: ./images/2021-01-12/site_resource_abnormal.png
-
-RAM and CPU usage of site throughout the incident. The period just
-before 3:40 where no statistics were reported is the actual outage
-period where the Kubernetes node had some networking errors.
-
-🥏 Impact
----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-This database outage lasted mere minutes, since Postgres recovered and
-healed itself and the sync process was aborted, but it did leave us with
-a large user diff and our database becoming further out of sync.
-
-Most services stayed up that did not depend on PostgreSQL, and the site
-remained stable after the sync had been cancelled.
-
-👁️ Detection
----------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-We were immediately alerted to the PostgreSQL outage on Grafana and
-through Sentry, meaning our response time was under a minute.
-
-We reduced some alert thresholds in order to catch RAM & CPU spikes
-faster in the future.
-
-It was hard to immediately see the cause of things since there is
-minimal logging on the site and the bot logs were not evident that
-anything was at fault, therefore our only detection was through machine
-metrics.
-
-We did manage to recover exactly what PostgreSQL was trying to do at the
-time of crashing by examining the logs which pointed us towards the user
-sync process.
-
-🙋🏿‍♂️ Response
------------------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the issue, there were no real obstacles
-encountered other than the node being less performant than we would like
-due to the CPU starvation.
-
-🙆🏽‍♀️ Recovery
----------------------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-The incident was resolved by stopping the sync process and writing a
-more efficient one through an internal eval script. We batched the
-updates into 1,000 users and instead of doing one large one did 80
-smaller updates. This led to much higher efficiency with a cost of
-taking a little longer (~7 minutes).
-
-.. code:: python
-
- from bot.exts.backend.sync import _syncers
- syncer = _syncers.UserSyncer
- diff = await syncer._get_diff(ctx.guild)
-
- def chunks(lst, n):
- for i in range(0, len(lst), n):
- yield lst[i:i + n]
-
- for chunk in chunks(diff.updated, 1000):
- await bot.api_client.patch("bot/users/bulk_patch", json=chunk)
-
-Resource limits were also put into place on site to prevent RAM and CPU
-spikes, and throttle the CPU usage in these situations. This can be seen
-in the below graph:
-
-.. image:: ./images/2021-01-12/site_cpu_throttle.png
-
-CPU throttling is where a container has hit the limits and we need to
-reel it in. Ideally this value stays as closes to 0 as possible, however
-as you can see site hit this twice (during the periods where it was
-trying to sync 80k users at once)
-
-🔎 Five Why’s
----------------------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- We experienced a major PostgreSQL outage
-- PostgreSQL was killed by the system OOM due to the RAM spike on site.
-- The RAM spike on site was caused by a large query.
-- This was because we do not chunk queries on the bot.
-- The large query was caused by the removal of the Developers role
- resulting in 80k users needing updating.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrance*
-
-The removal of the Developers role created a large diff which could not
-be applied by Django in a single request.
-
-See the follow up tasks on exactly how we can avoid this in future, it’s
-a relatively easy mitigation.
-
-🤔 Lessons learned
------------------------
-
-*What did we learn from this incident?*
-
-- Django (or DRF) does not like huge update queries.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ Make the bot syncer more efficient (batch requests)
-- ☐ Increase logging on bot, state when an error has been hit (we had
- no indication of this inside Discord, we need that)
-- ☒ Adjust resource alerts to page DevOps members earlier.
-- ☒ Apply resource limits to site to prevent major spikes
diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
deleted file mode 100644
index b13ecd7..0000000
--- a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
+++ /dev/null
@@ -1,146 +0,0 @@
-2021-01-30: NodeBalancer networking faults due to memory pressure
-=================================================================
-
-At around 14:30 UTC on Saturday 30th January we started experiencing
-networking issues at the LoadBalancer level between Cloudflare and our
-Kubernetes cluster. It seems that the misconfiguration was due to memory
-and CPU pressure.
-
-[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word
-from Linode’s SysAdmins on any problems they detected.]
-
-**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a
-different machine.
-
-⚠️ Leadup
----------
-
-*List the sequence of events that led to the incident*
-
-At 14:30 we started receiving alerts that services were becoming
-unreachable. We first experienced some momentary DNS errors which
-resolved themselves, however traffic ingress was still degraded.
-
-Upon checking Linode our NodeBalancer, the service which balances
-traffic between our Kubernetes nodes was reporting the backends (the
-services it balances to) as down. It reported all 4 as down (two for
-port 80 + two for port 443). This status was fluctuating between up and
-down, meaning traffic was not reaching our cluster correctly. Scaleios
-correctly noted:
-
-.. image:: ./images/2021-01-30/scaleios.png
-
-The config seems to have been set incorrectly due to memory and CPU
-pressure on one of our nodes. Here is the memory throughout the
-incident:
-
-.. image:: ./images/2021-01-30/memory_charts.png
-
-Here is the display from Linode:
-
-.. image:: ./images/2021-01-30/linode_loadbalancers.png
-
-🥏 Impact
----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-Since traffic could not correctly enter our cluster multiple services
-which were web based were offline, including services such as site,
-grafana and bitwarden. It appears that no inter-node communication was
-affected as this uses a WireGuard tunnel between the nodes which was not
-affected by the NodeBalancer.
-
-The lack of Grafana made diagnosis slightly more difficult, but even
-then it was only a short trip to the
-
-👁️ Detection
-------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-We were alerted fairly promptly through statping which reported services
-as being down and posted a Discord notification. Subsequent alerts came
-in from Grafana but were limited since outbound communication was
-faulty.
-
-🙋🏿‍♂️ Response
-----------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded!
-
-Primary obstacle was the DevOps tools being out due to the traffic
-ingress problems.
-
-🙆🏽‍♀️ Recovery
-----------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-The incident resolved itself upstream at Linode, we’ve opened a ticket
-with Linode to let them know of the faults, this might give us a better
-indication of what caused the issues. Our Kubernetes cluster continued
-posting updates to Linode to refresh the NodeBalancer configuration,
-inspecting these payloads the configuration looked correct.
-
-We’ve set up alerts for when Prometheus services stop responding since
-this seems to be a fairly tell-tale symptom of networking problems, this
-was the Prometheus status graph throughout the incident:
-
-.. image:: ./images/2021-01-30/prometheus_status.png
-
-🔎 Five Why’s
--------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-**What?** Our service experienced an outage due to networking faults.
-
-**Why?** Incoming traffic could not reach our Kubernetes nodes
-
-**Why?** Our Linode NodeBalancers were not using correct configuration
-
-**Why?** Memory & CPU pressure seemed to cause invalid configuration
-errors upstream at Linode.
-
-**Why?** Unknown at this stage, NodeBalancer migrated.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrance*
-
-The configuration of our NodeBalancer was invalid, we cannot say why at
-this point since we are awaiting contact back from Linode, but
-indicators point to it being an upstream fault since memory & CPU
-pressure should **not** cause a load balancer misconfiguration.
-
-Linode are going to follow up with us at some point during the week with
-information from their System Administrators.
-
-**Update 2nd February 2021:** Linode have concluded investigations at
-their end, taken notes and migrated our NodeBalancer to a new machine.
-We haven’t experienced problems since.
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-We should be careful over-scheduling onto nodes since even while
-operating within reasonable constraints we risk sending invalid
-configuration upstream to Linode and therefore preventing traffic from
-entering our cluster.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ Monitor for follow up from Linode
-- ☒ Carefully monitor the allocation rules for our services
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst
deleted file mode 100644
index b2e5cdf..0000000
--- a/docs/postmortems/2021-07-11-cascading-node-failures.rst
+++ /dev/null
@@ -1,335 +0,0 @@
-2021-07-11: Cascading node failures and ensuing volume problems
-===============================================================
-
-A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node
-to an unresponsive state (00:55 UTC), upon performing a recycle of the
-affected node volumes were placed into a state where they could not be
-mounted.
-
-⚠️ Leadup
-----------
-
-*List the sequence of events that led to the incident*
-
-- **00:27 UTC:** Django starts rapidly using connections to our
- PostgreSQL database
-- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated
- it’s 115 max connections limit. Joe is paged.
-- **00:33 UTC:** DevOps team is alerted that a service has claimed 34
- dangerous table locks (it peaked at 61).
-- **00:42 UTC:** Status incident created and backdated to 00:25 UTC.
- `Status incident <https://status.pythondiscord.com/incident/92712>`__
-- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no
- longer healthy after the Django connection surge, so it’s recycled
- and a new one is to be added to the pool.
-- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s
- restart
-- **01:13 UTC:** Node has restored and regained healthy status, but
- volumes will not mount to the node. Support ticket opened at Linode
- for assistance.
-- **06:36 UTC:** DevOps team alerted that Python is offline. This is
- due to Redis being a dependency of the bot, which as a stateful
- service was not healthy.
-
-🥏 Impact
-----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-Initially, this manifested as a standard node outage where services on
-that node experienced some downtime as the node was restored.
-
-Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop)
-were unexecutable due to the volume issues, and so any dependent
-services (e.g. Site, Bot, Hastebin) also had trouble starting.
-
-PostgreSQL was restored early on so for the most part Moderation could
-continue.
-
-👁️ Detection
----------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-DevOps were initially alerted at 00:32 UTC due to the PostgreSQL
-connection surge, and acknowledged at the same time.
-
-Further alerting could be used to catch surges earlier on (looking at
-conn delta vs. conn total), but for the most part alerting time was
-satisfactory here.
-
-🙋🏿‍♂️ Response
------------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded. The primary issue encountered was failure upstream
-at Linode to remount the affected volumes, a support ticket has been
-created.
-
-🙆🏽‍♀️ Recovery
-------------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-Initial node restoration was performed by @Joe Banks by recycling the
-affected node.
-
-Subsequent volume restoration was also @Joe Banks and once Linode had
-unlocked the volumes affected pods were scaled down to 0, the volumes
-were unmounted at the Linode side and then the deployments were
-recreated.
-
-.. raw:: html
-
- <details>
-
-.. raw:: html
-
- <summary>
-
-Support ticket sent
-
-.. raw:: html
-
- </summary>
-
-.. raw:: html
-
- <blockquote>
-
-Good evening,
-
-We experienced a resource surge on one of our Kubernetes nodes at 00:32
-UTC, causing a node to go unresponsive. To mitigate problems here the
-node was recycled and began restarting at 1:01 UTC.
-
-The node has now rejoined the ring and started picking up services, but
-volumes will not attach to it, meaning pods with stateful storage will
-not start.
-
-An example events log for one such pod:
-
-::
-
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf
- Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f]
- Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition
-
-I’ve been trying to manually resolve this through the Linode Web UI but
-get presented with attachment errors upon doing so. Please could you
-advise on the best way forward to restore Volumes & Nodes to a
-functioning state? As far as I can see there is something going on
-upstream since the Linode UI presents these nodes as mounted however as
-shown above LKE nodes are not locating them, there is also a few failed
-attachment logs in the Linode Audit Log.
-
-Thanks,
-
-Joe
-
-.. raw:: html
-
- </blockquote>
-
-.. raw:: html
-
- </details>
-
-.. raw:: html
-
- <details>
-
-.. raw:: html
-
- <summary>
-
-Response received from Linode
-
-.. raw:: html
-
- </summary>
-
-.. raw:: html
-
- <blockquote>
-
-Hi Joe,
-
- Were there any known issues with Block Storage in Frankfurt today?
-
-Not today, though there were service issues reported for Block Storage
-and LKE in Frankfurt on July 8 and 9:
-
-- `Service Issue - Block Storage - EU-Central
- (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__
-- `Service Issue - Linode Kubernetes Engine -
- Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__
-
-There was also an API issue reported on the 10th (resolved on the 11th),
-mentioned here:
-
-- `Service Issue - Cloud Manager and
- API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__
-
-Regarding the specific error you were receiving:
-
- ``Unable to find device path out of attempted paths``
-
-I’m not certain it’s specifically related to those Service Issues,
-considering this isn’t the first time a customer has reported this error
-in their LKE logs. In fact, if I recall correctly, I’ve run across this
-before too, since our volumes are RWO and I had too many replicas in my
-deployment that I was trying to attach to, for example.
-
- is this a known bug/condition that occurs with Linode CSI/LKE?
-
-From what I understand, yes, this is a known condition that crops up
-from time to time, which we are tracking. However, since there is a
-workaround at the moment (e.g. - “After some more manual attempts to fix
-things, scaling down deployments, unmounting at Linode and then scaling
-up the deployments seems to have worked and all our services have now
-been restored.”), there is no ETA for addressing this. With that said,
-I’ve let our Storage team know that you’ve run into this, so as to draw
-further attention to it.
-
-If you have any further questions or concerns regarding this, let us
-know.
-
-Best regards, [Redacted]
-
-Linode Support Team
-
-.. raw:: html
-
- </blockquote>
-
-.. raw:: html
-
- </details>
-
-.. raw:: html
-
- <details>
-
-.. raw:: html
-
- <summary>
-
-Concluding response from Joe Banks
-
-.. raw:: html
-
- </summary>
-
-.. raw:: html
-
- <blockquote>
-
-Hey [Redacted]!
-
-Thanks for the response. We ensure that stateful pods only ever have one
-volume assigned to them, either with a single replica deployment or a
-statefulset. It appears that the error generally manifests when a
-deployment is being migrated from one node to another during a redeploy,
-which makes sense if there is some delay on the unmount/remount.
-
-Confusion occurred because Linode was reporting the volume as attached
-when the node had been recycled, but I assume that was because the node
-did not cleanly shutdown and therefore could not cleanly unmount
-volumes.
-
-We’ve not seen any resurgence of such issues, and we’ll address the
-software fault which overloaded the node which will helpfully mitigate
-such problems in the future.
-
-Thanks again for the response, have a great week!
-
-Best,
-
-Joe
-
-.. raw:: html
-
- </blockquote>
-
-.. raw:: html
-
- </details>
-
-🔎 Five Why’s
----------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-**What?**
-~~~~~~~~~
-
-Several of our services became unavailable because their volumes could
-not be mounted.
-
-Why?
-~~~~
-
-A node recycle left the node unable to mount volumes using the Linode
-CSI.
-
-.. _why-1:
-
-Why?
-~~~~
-
-A node recycle was used because PostgreSQL had a connection surge.
-
-.. _why-2:
-
-Why?
-~~~~
-
-A Django feature deadlocked a table 62 times and suddenly started using
-~70 connections to the database, saturating the maximum connections
-limit.
-
-.. _why-3:
-
-Why?
-~~~~
-
-The root cause of why Django does this is unclear, and someone with more
-Django proficiency is absolutely welcome to share any knowledge they may
-have. I presume it’s some sort of worker race condition, but I’ve not
-been able to reproduce it.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrence*
-
-A node being forcefully restarted left volumes in a limbo state where
-mounting was difficult, it took multiple hours for this to be resolved
-since we had to wait for the volumes to unlock so they could be cloned.
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-Volumes are painful.
-
-We need to look at why Django is doing this and mitigations of the fault
-to prevent this from occurring again.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ `Follow up on ticket at
- Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__
-- ☐ Investigate why Django could be connection surging and locking
- tables
diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png
deleted file mode 100644
index b530ec6..0000000
--- a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png
+++ /dev/null
Binary files differ
diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png
deleted file mode 100644
index e1e07af..0000000
--- a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png
+++ /dev/null
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png
deleted file mode 100644
index f0eae1f..0000000
--- a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png
+++ /dev/null
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.png
deleted file mode 100644
index 370d19e..0000000
--- a/docs/postmortems/images/2021-01-30/memory_charts.png
+++ /dev/null
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.png
deleted file mode 100644
index e95b8d7..0000000
--- a/docs/postmortems/images/2021-01-30/prometheus_status.png
+++ /dev/null
Binary files differ
diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.png
deleted file mode 100644
index 584d74d..0000000
--- a/docs/postmortems/images/2021-01-30/scaleios.png
+++ /dev/null
Binary files differ
diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst
deleted file mode 100644
index e28dc7a..0000000
--- a/docs/postmortems/index.rst
+++ /dev/null
@@ -1,15 +0,0 @@
-Postmortems
-===========
-
-Browse the pages under this category to view historical postmortems for
-Python Discord outages.
-
-.. toctree::
- :maxdepth: 1
-
- 2020-12-11-all-services-outage
- 2020-12-11-postgres-conn-surge
- 2021-01-10-primary-kubernetes-node-outage
- 2021-01-12-site-cpu-ram-exhaustion
- 2021-01-30-nodebalancer-fails-memory
- 2021-07-11-cascading-node-failures