diff options
| author | 2024-08-07 18:41:02 +0100 | |
|---|---|---|
| committer | 2024-08-07 18:41:02 +0100 | |
| commit | dcbb78959177537cf1fdda813380996a4b2daf8f (patch) | |
| tree | 0a53ded19896aaddf93cc8f1e4ff34ac3f70464e /docs/postmortems | |
| parent | Revert "Enable fail2ban jails for postfix" (diff) | |
Remove old documentation
Diffstat (limited to 'docs/postmortems')
| -rw-r--r-- | docs/postmortems/2020-12-11-all-services-outage.rst | 121 | ||||
| -rw-r--r-- | docs/postmortems/2020-12-11-postgres-conn-surge.rst | 130 | ||||
| -rw-r--r-- | docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst | 117 | ||||
| -rw-r--r-- | docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst | 155 | ||||
| -rw-r--r-- | docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst | 146 | ||||
| -rw-r--r-- | docs/postmortems/2021-07-11-cascading-node-failures.rst | 335 | ||||
| -rw-r--r-- | docs/postmortems/images/2021-01-12/site_cpu_throttle.png | bin | 227245 -> 0 bytes | |||
| -rw-r--r-- | docs/postmortems/images/2021-01-12/site_resource_abnormal.png | bin | 232260 -> 0 bytes | |||
| -rw-r--r-- | docs/postmortems/images/2021-01-30/linode_loadbalancers.png | bin | 50882 -> 0 bytes | |||
| -rw-r--r-- | docs/postmortems/images/2021-01-30/memory_charts.png | bin | 211053 -> 0 bytes | |||
| -rw-r--r-- | docs/postmortems/images/2021-01-30/prometheus_status.png | bin | 291122 -> 0 bytes | |||
| -rw-r--r-- | docs/postmortems/images/2021-01-30/scaleios.png | bin | 18294 -> 0 bytes | |||
| -rw-r--r-- | docs/postmortems/index.rst | 15 |
13 files changed, 0 insertions, 1019 deletions
diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst deleted file mode 100644 index 9c29303..0000000 --- a/docs/postmortems/2020-12-11-all-services-outage.rst +++ /dev/null @@ -1,121 +0,0 @@ -2020-12-11: All services outage -=============================== - -At **19:55 UTC, all services became unresponsive**. The DevOps were -already in a call, and immediately started to investigate. - -Postgres was running at 100% CPU usage due to a **VACUUM**, which caused -all services that depended on it to stop working. The high CPU left the -host unresponsive and it shutdown. Linode Lassie noticed this and -triggered a restart. - -It did not recover gracefully from this restart, with numerous core -services reporting an error, so we had to manually restart core system -services using Lens in order to get things working again. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This -made Postgres run at 100% CPU and was unresponsive, which caused -services to stop responding. This lead to a restart of the node, from -which we did not recover gracefully. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -All services went down. Catastrophic failure. We did not pass go, we did -not collect $200. - -- Help channel system unavailable, so people are not able to - effectively ask for help. -- Gates unavailable, so people can’t successfully get into the - community. -- Moderation and raid prevention unavailable, which leaves us - defenseless against attacks. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We noticed that all PyDis services had stopped responding, -coincidentally our DevOps team were in a call at the time, so that was -helpful. - -We may be able to improve detection time by adding monitoring of -resource usage. To this end, we’ve added alerts for high CPU usage and -low memory. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. - -We noticed our node was entirely unresponsive and within minutes a -restart had been triggered by Lassie after a high CPU shutdown occurred. - -The node came back and we saw a number of core services offline -(e.g. Calico, CoreDNS, Linode CSI). - -**Obstacle: no recent database back-up available** - -🙆🏽♀️ Recovery ------------------ - -*How was the incident resolved? How can we improve future mitigation -times?* - -Through `Lens <https://k8slens.dev/>`__ we restarted core services one -by one until they stabilised, after these core services were up other -services began to come back online. - -We finally provisioned PostgreSQL which had been removed as a component -before the restart (but too late to prevent the CPU errors). Once -PostgreSQL was up we restarted any components that were acting buggy -(e.g. site and bot). - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -- Major service outage -- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI) -- **Why?** Kubernetes worker node restart -- **Why?** High CPU shutdown -- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -- We must ensure we have working database backups. We are lucky that we - did not lose any data this time. If this problem had caused volume - corruption, we would be screwed. -- Sentry is broken for the bot. It was missing a DSN secret, which we - have now restored. -- The https://sentry.pydis.com redirect was never migrated to the - cluster. **We should do that.** - -☑️ Follow-up tasks ------------------- - -*List any tasks we’ve created as a result of this incident* - -- ☒ Push forward with backup plans diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst deleted file mode 100644 index 6ebcb01..0000000 --- a/docs/postmortems/2020-12-11-postgres-conn-surge.rst +++ /dev/null @@ -1,130 +0,0 @@ -2020-12-11: Postgres connection surge -===================================== - -At **13:24 UTC,** we noticed the bot was not able to infract, and -`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The -DevOps team started to investigate. - -We discovered that Postgres was not accepting new connections because it -had hit 100 clients. This made it unavailable to all services that -depended on it. - -Ultimately this was resolved by taking down Postgres, remounting the -associated volume, and bringing it back up again. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -The bot infractions stopped working, and we started investigating. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Services were unavailable both for internal and external users. - -- The Help Channel System was unavailable. -- Voice Gate and Server Gate were not working. -- Moderation commands were unavailable. -- Python Discord site & API were unavailable. CloudFlare automatically - switched us to Always Online. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We noticed HTTP 524s coming from CloudFlare, upon attempting database -connection we observed the maximum client limit. - -We noticed this log in site: - -.. code:: yaml - - django.db.utils.OperationalError: FATAL: sorry, too many clients already - -We should be monitoring number of clients, and the monitor should alert -us when we’re approaching the max. That would have allowed for earlier -detection, and possibly allowed us to prevent the incident altogether. - -We will look at -`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__ -for monitoring this. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. The obstacles were mostly a lack of -a clear response strategy. - -We should document our recovery procedure so that we’re not so dependent -on Joe Banks should this happen again while he’s unavailable. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -- Delete PostgreSQL deployment ``kubectl delete deployment/postgres`` -- Delete any remaining pods, WITH force. - ``kubectl delete <pod name> --force --grace-period=0`` -- Unmount volume at Linode -- Remount volume at Linode -- Reapply deployment ``kubectl apply -f postgres/deployment.yaml`` - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -- Postgres was unavailable, so our services died. -- **Why?** Postgres hit max clients, and could not respond. -- **Why?** Unknown, but we saw a number of connections from previous - deployments of site. This indicates that database connections are not - being terminated properly. Needs further investigation. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -We’re not sure what the root cause is, but suspect site is not -terminating database connections properly in some cases. We were unable -to reproduce this problem. - -We’ve set up new telemetry on Grafana with alerts so that we can -investigate this more closely. We will be let know if the number of -connections from site exceeds 32, or if the total number of connections -exceeds 90. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -- We must ensure the DevOps team has access to Linode and other key - services even if our Bitwarden is down. -- We need to ensure we’re alerted of any risk factors that have the - potential to make Postgres unavailable, since this causes a - catastrophic outage of practically all services. -- We absolutely need backups for the databases, so that this sort of - problem carries less of a risk. -- We may need to consider something like - `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage - a connection pool so that we don’t exceed 100 *legitimate* clients - connected as we connect more services to the postgres database. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ All database backup diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst deleted file mode 100644 index 5852c46..0000000 --- a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst +++ /dev/null @@ -1,117 +0,0 @@ -2021-01-10: Primary Kubernetes node outage -========================================== - -We had an outage of our highest spec node due to CPU exhaustion. The -outage lasted from around 20:20 to 20:46 UTC, but was not a full service -outage. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -I ran a query on Prometheus to try figure out some statistics on the -number of metrics we are holding, this ended up scanning a lot of data -in the TSDB database that Prometheus uses. - -This scan caused a CPU exhaustion which caused issues with the -Kubernetes node status. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -This brought down the primary node which meant there was some service -outage. Most services transferred successfully to our secondary node -which kept up some key services such as the Moderation bot and Modmail -bot, as well as MongoDB. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -This was noticed when Discord services started having failures. The -primary detection was through alerts though! I was paged 1 minute after -we started encountering CPU exhaustion issues. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. - -No major obstacles were encountered during this. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -It was noted that in the response to ``kubectl get nodes`` the primary -node’s status was reported as ``NotReady``. Looking into the reason it -was because the node had stopped responding. - -The quickest way to fix this was triggering a node restart. This shifted -a lot of pods over to node 2 which encountered some capacity issues -since it’s not as highly specified as the first node. - -I brought this back the first node by restarting it at Linode’s end. -Once this node was reporting as ``Ready`` again I drained the second -node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This -command stops the node from being available for scheduling and moves -existing pods onto other nodes. - -Services gradually recovered as the dependencies started. The incident -lasted overall around 26 minutes, though this was not a complete outage -for the whole time and the bot remained functional throughout (meaning -systems like the help channels were still functional). - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**Why?** Partial service outage - -**Why?** We had a node outage. - -**Why?** CPU exhaustion of our primary node. - -**Why?** Large prometheus query using a lot of CPU. - -**Why?** Prometheus had to scan millions of TSDB records which consumed -all cores. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -A large query was run on Prometheus, so the solution is just to not run -said queries. - -To protect against this more precisely though we should write resource -constraints for services like this that are vulnerable to CPU exhaustion -or memory consumption, which are the causes of our two past outages as -well. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -- Don’t run large queries, it consumes CPU! -- Write resource constraints for our services. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Write resource constraints for our services. diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst deleted file mode 100644 index f621782..0000000 --- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst +++ /dev/null @@ -1,155 +0,0 @@ -2021-01-12: Django site CPU/RAM exhaustion outage -================================================= - -At 03:01 UTC on Tuesday 12th January we experienced a momentary outage -of our PostgreSQL database, causing some very minor service downtime. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -We deleted the Developers role which led to a large user diff for all -the users where we had to update their roles on the site. - -The bot was trying to post this for over 24 hours repeatedly after every -restart. - -We deployed the bot at 2:55 UTC on 12th January and the user sync -process began once again. - -This caused a CPU & RAM spike on our Django site, which in turn -triggered an OOM error on the server which killed the Postgres process, -sending it into a recovery state where queries could not be executed. - -Django site did not have any tools in place to batch the requests so was -trying to process all 80k user updates in a single query, something that -PostgreSQL probably could handle, but not the Django ORM. During the -incident site jumped from it’s average RAM usage of 300-400MB to -**1.5GB.** - -.. image:: ./images/2021-01-12/site_resource_abnormal.png - -RAM and CPU usage of site throughout the incident. The period just -before 3:40 where no statistics were reported is the actual outage -period where the Kubernetes node had some networking errors. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -This database outage lasted mere minutes, since Postgres recovered and -healed itself and the sync process was aborted, but it did leave us with -a large user diff and our database becoming further out of sync. - -Most services stayed up that did not depend on PostgreSQL, and the site -remained stable after the sync had been cancelled. - -👁️ Detection ---------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were immediately alerted to the PostgreSQL outage on Grafana and -through Sentry, meaning our response time was under a minute. - -We reduced some alert thresholds in order to catch RAM & CPU spikes -faster in the future. - -It was hard to immediately see the cause of things since there is -minimal logging on the site and the bot logs were not evident that -anything was at fault, therefore our only detection was through machine -metrics. - -We did manage to recover exactly what PostgreSQL was trying to do at the -time of crashing by examining the logs which pointed us towards the user -sync process. - -🙋🏿♂️ Response ------------------------ - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the issue, there were no real obstacles -encountered other than the node being less performant than we would like -due to the CPU starvation. - -🙆🏽♀️ Recovery ---------------------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident was resolved by stopping the sync process and writing a -more efficient one through an internal eval script. We batched the -updates into 1,000 users and instead of doing one large one did 80 -smaller updates. This led to much higher efficiency with a cost of -taking a little longer (~7 minutes). - -.. code:: python - - from bot.exts.backend.sync import _syncers - syncer = _syncers.UserSyncer - diff = await syncer._get_diff(ctx.guild) - - def chunks(lst, n): - for i in range(0, len(lst), n): - yield lst[i:i + n] - - for chunk in chunks(diff.updated, 1000): - await bot.api_client.patch("bot/users/bulk_patch", json=chunk) - -Resource limits were also put into place on site to prevent RAM and CPU -spikes, and throttle the CPU usage in these situations. This can be seen -in the below graph: - -.. image:: ./images/2021-01-12/site_cpu_throttle.png - -CPU throttling is where a container has hit the limits and we need to -reel it in. Ideally this value stays as closes to 0 as possible, however -as you can see site hit this twice (during the periods where it was -trying to sync 80k users at once) - -🔎 Five Why’s ---------------------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -- We experienced a major PostgreSQL outage -- PostgreSQL was killed by the system OOM due to the RAM spike on site. -- The RAM spike on site was caused by a large query. -- This was because we do not chunk queries on the bot. -- The large query was caused by the removal of the Developers role - resulting in 80k users needing updating. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The removal of the Developers role created a large diff which could not -be applied by Django in a single request. - -See the follow up tasks on exactly how we can avoid this in future, it’s -a relatively easy mitigation. - -🤔 Lessons learned ------------------------ - -*What did we learn from this incident?* - -- Django (or DRF) does not like huge update queries. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Make the bot syncer more efficient (batch requests) -- ☐ Increase logging on bot, state when an error has been hit (we had - no indication of this inside Discord, we need that) -- ☒ Adjust resource alerts to page DevOps members earlier. -- ☒ Apply resource limits to site to prevent major spikes diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst deleted file mode 100644 index b13ecd7..0000000 --- a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst +++ /dev/null @@ -1,146 +0,0 @@ -2021-01-30: NodeBalancer networking faults due to memory pressure -================================================================= - -At around 14:30 UTC on Saturday 30th January we started experiencing -networking issues at the LoadBalancer level between Cloudflare and our -Kubernetes cluster. It seems that the misconfiguration was due to memory -and CPU pressure. - -[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word -from Linode’s SysAdmins on any problems they detected.] - -**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a -different machine. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -At 14:30 we started receiving alerts that services were becoming -unreachable. We first experienced some momentary DNS errors which -resolved themselves, however traffic ingress was still degraded. - -Upon checking Linode our NodeBalancer, the service which balances -traffic between our Kubernetes nodes was reporting the backends (the -services it balances to) as down. It reported all 4 as down (two for -port 80 + two for port 443). This status was fluctuating between up and -down, meaning traffic was not reaching our cluster correctly. Scaleios -correctly noted: - -.. image:: ./images/2021-01-30/scaleios.png - -The config seems to have been set incorrectly due to memory and CPU -pressure on one of our nodes. Here is the memory throughout the -incident: - -.. image:: ./images/2021-01-30/memory_charts.png - -Here is the display from Linode: - -.. image:: ./images/2021-01-30/linode_loadbalancers.png - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Since traffic could not correctly enter our cluster multiple services -which were web based were offline, including services such as site, -grafana and bitwarden. It appears that no inter-node communication was -affected as this uses a WireGuard tunnel between the nodes which was not -affected by the NodeBalancer. - -The lack of Grafana made diagnosis slightly more difficult, but even -then it was only a short trip to the - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were alerted fairly promptly through statping which reported services -as being down and posted a Discord notification. Subsequent alerts came -in from Grafana but were limited since outbound communication was -faulty. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded! - -Primary obstacle was the DevOps tools being out due to the traffic -ingress problems. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident resolved itself upstream at Linode, we’ve opened a ticket -with Linode to let them know of the faults, this might give us a better -indication of what caused the issues. Our Kubernetes cluster continued -posting updates to Linode to refresh the NodeBalancer configuration, -inspecting these payloads the configuration looked correct. - -We’ve set up alerts for when Prometheus services stop responding since -this seems to be a fairly tell-tale symptom of networking problems, this -was the Prometheus status graph throughout the incident: - -.. image:: ./images/2021-01-30/prometheus_status.png - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** Our service experienced an outage due to networking faults. - -**Why?** Incoming traffic could not reach our Kubernetes nodes - -**Why?** Our Linode NodeBalancers were not using correct configuration - -**Why?** Memory & CPU pressure seemed to cause invalid configuration -errors upstream at Linode. - -**Why?** Unknown at this stage, NodeBalancer migrated. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The configuration of our NodeBalancer was invalid, we cannot say why at -this point since we are awaiting contact back from Linode, but -indicators point to it being an upstream fault since memory & CPU -pressure should **not** cause a load balancer misconfiguration. - -Linode are going to follow up with us at some point during the week with -information from their System Administrators. - -**Update 2nd February 2021:** Linode have concluded investigations at -their end, taken notes and migrated our NodeBalancer to a new machine. -We haven’t experienced problems since. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -We should be careful over-scheduling onto nodes since even while -operating within reasonable constraints we risk sending invalid -configuration upstream to Linode and therefore preventing traffic from -entering our cluster. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Monitor for follow up from Linode -- ☒ Carefully monitor the allocation rules for our services diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst deleted file mode 100644 index b2e5cdf..0000000 --- a/docs/postmortems/2021-07-11-cascading-node-failures.rst +++ /dev/null @@ -1,335 +0,0 @@ -2021-07-11: Cascading node failures and ensuing volume problems -=============================================================== - -A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node -to an unresponsive state (00:55 UTC), upon performing a recycle of the -affected node volumes were placed into a state where they could not be -mounted. - -⚠️ Leadup ----------- - -*List the sequence of events that led to the incident* - -- **00:27 UTC:** Django starts rapidly using connections to our - PostgreSQL database -- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated - it’s 115 max connections limit. Joe is paged. -- **00:33 UTC:** DevOps team is alerted that a service has claimed 34 - dangerous table locks (it peaked at 61). -- **00:42 UTC:** Status incident created and backdated to 00:25 UTC. - `Status incident <https://status.pythondiscord.com/incident/92712>`__ -- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no - longer healthy after the Django connection surge, so it’s recycled - and a new one is to be added to the pool. -- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s - restart -- **01:13 UTC:** Node has restored and regained healthy status, but - volumes will not mount to the node. Support ticket opened at Linode - for assistance. -- **06:36 UTC:** DevOps team alerted that Python is offline. This is - due to Redis being a dependency of the bot, which as a stateful - service was not healthy. - -🥏 Impact ----------- - -*Describe how internal and external users were impacted during the -incident* - -Initially, this manifested as a standard node outage where services on -that node experienced some downtime as the node was restored. - -Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) -were unexecutable due to the volume issues, and so any dependent -services (e.g. Site, Bot, Hastebin) also had trouble starting. - -PostgreSQL was restored early on so for the most part Moderation could -continue. - -👁️ Detection ---------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -DevOps were initially alerted at 00:32 UTC due to the PostgreSQL -connection surge, and acknowledged at the same time. - -Further alerting could be used to catch surges earlier on (looking at -conn delta vs. conn total), but for the most part alerting time was -satisfactory here. - -🙋🏿♂️ Response ------------------ - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded. The primary issue encountered was failure upstream -at Linode to remount the affected volumes, a support ticket has been -created. - -🙆🏽♀️ Recovery ------------------- - -*How was the incident resolved? How can we improve future mitigation?* - -Initial node restoration was performed by @Joe Banks by recycling the -affected node. - -Subsequent volume restoration was also @Joe Banks and once Linode had -unlocked the volumes affected pods were scaled down to 0, the volumes -were unmounted at the Linode side and then the deployments were -recreated. - -.. raw:: html - - <details> - -.. raw:: html - - <summary> - -Support ticket sent - -.. raw:: html - - </summary> - -.. raw:: html - - <blockquote> - -Good evening, - -We experienced a resource surge on one of our Kubernetes nodes at 00:32 -UTC, causing a node to go unresponsive. To mitigate problems here the -node was recycled and began restarting at 1:01 UTC. - -The node has now rejoined the ring and started picking up services, but -volumes will not attach to it, meaning pods with stateful storage will -not start. - -An example events log for one such pod: - -:: - - Type Reason Age From Message - ---- ------ ---- ---- ------- - Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf - Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f] - Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition - -I’ve been trying to manually resolve this through the Linode Web UI but -get presented with attachment errors upon doing so. Please could you -advise on the best way forward to restore Volumes & Nodes to a -functioning state? As far as I can see there is something going on -upstream since the Linode UI presents these nodes as mounted however as -shown above LKE nodes are not locating them, there is also a few failed -attachment logs in the Linode Audit Log. - -Thanks, - -Joe - -.. raw:: html - - </blockquote> - -.. raw:: html - - </details> - -.. raw:: html - - <details> - -.. raw:: html - - <summary> - -Response received from Linode - -.. raw:: html - - </summary> - -.. raw:: html - - <blockquote> - -Hi Joe, - - Were there any known issues with Block Storage in Frankfurt today? - -Not today, though there were service issues reported for Block Storage -and LKE in Frankfurt on July 8 and 9: - -- `Service Issue - Block Storage - EU-Central - (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__ -- `Service Issue - Linode Kubernetes Engine - - Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__ - -There was also an API issue reported on the 10th (resolved on the 11th), -mentioned here: - -- `Service Issue - Cloud Manager and - API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__ - -Regarding the specific error you were receiving: - - ``Unable to find device path out of attempted paths`` - -I’m not certain it’s specifically related to those Service Issues, -considering this isn’t the first time a customer has reported this error -in their LKE logs. In fact, if I recall correctly, I’ve run across this -before too, since our volumes are RWO and I had too many replicas in my -deployment that I was trying to attach to, for example. - - is this a known bug/condition that occurs with Linode CSI/LKE? - -From what I understand, yes, this is a known condition that crops up -from time to time, which we are tracking. However, since there is a -workaround at the moment (e.g. - “After some more manual attempts to fix -things, scaling down deployments, unmounting at Linode and then scaling -up the deployments seems to have worked and all our services have now -been restored.”), there is no ETA for addressing this. With that said, -I’ve let our Storage team know that you’ve run into this, so as to draw -further attention to it. - -If you have any further questions or concerns regarding this, let us -know. - -Best regards, [Redacted] - -Linode Support Team - -.. raw:: html - - </blockquote> - -.. raw:: html - - </details> - -.. raw:: html - - <details> - -.. raw:: html - - <summary> - -Concluding response from Joe Banks - -.. raw:: html - - </summary> - -.. raw:: html - - <blockquote> - -Hey [Redacted]! - -Thanks for the response. We ensure that stateful pods only ever have one -volume assigned to them, either with a single replica deployment or a -statefulset. It appears that the error generally manifests when a -deployment is being migrated from one node to another during a redeploy, -which makes sense if there is some delay on the unmount/remount. - -Confusion occurred because Linode was reporting the volume as attached -when the node had been recycled, but I assume that was because the node -did not cleanly shutdown and therefore could not cleanly unmount -volumes. - -We’ve not seen any resurgence of such issues, and we’ll address the -software fault which overloaded the node which will helpfully mitigate -such problems in the future. - -Thanks again for the response, have a great week! - -Best, - -Joe - -.. raw:: html - - </blockquote> - -.. raw:: html - - </details> - -🔎 Five Why’s ---------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** -~~~~~~~~~ - -Several of our services became unavailable because their volumes could -not be mounted. - -Why? -~~~~ - -A node recycle left the node unable to mount volumes using the Linode -CSI. - -.. _why-1: - -Why? -~~~~ - -A node recycle was used because PostgreSQL had a connection surge. - -.. _why-2: - -Why? -~~~~ - -A Django feature deadlocked a table 62 times and suddenly started using -~70 connections to the database, saturating the maximum connections -limit. - -.. _why-3: - -Why? -~~~~ - -The root cause of why Django does this is unclear, and someone with more -Django proficiency is absolutely welcome to share any knowledge they may -have. I presume it’s some sort of worker race condition, but I’ve not -been able to reproduce it. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrence* - -A node being forcefully restarted left volumes in a limbo state where -mounting was difficult, it took multiple hours for this to be resolved -since we had to wait for the volumes to unlock so they could be cloned. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -Volumes are painful. - -We need to look at why Django is doing this and mitigations of the fault -to prevent this from occurring again. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ `Follow up on ticket at - Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__ -- ☐ Investigate why Django could be connection surging and locking - tables diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png Binary files differdeleted file mode 100644 index b530ec6..0000000 --- a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png Binary files differdeleted file mode 100644 index e1e07af..0000000 --- a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png Binary files differdeleted file mode 100644 index f0eae1f..0000000 --- a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.png Binary files differdeleted file mode 100644 index 370d19e..0000000 --- a/docs/postmortems/images/2021-01-30/memory_charts.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.png Binary files differdeleted file mode 100644 index e95b8d7..0000000 --- a/docs/postmortems/images/2021-01-30/prometheus_status.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.png Binary files differdeleted file mode 100644 index 584d74d..0000000 --- a/docs/postmortems/images/2021-01-30/scaleios.png +++ /dev/null diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst deleted file mode 100644 index e28dc7a..0000000 --- a/docs/postmortems/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -Postmortems -=========== - -Browse the pages under this category to view historical postmortems for -Python Discord outages. - -.. toctree:: - :maxdepth: 1 - - 2020-12-11-all-services-outage - 2020-12-11-postgres-conn-surge - 2021-01-10-primary-kubernetes-node-outage - 2021-01-12-site-cpu-ram-exhaustion - 2021-01-30-nodebalancer-fails-memory - 2021-07-11-cascading-node-failures |