diff options
Diffstat (limited to 'docs/postmortems')
-rw-r--r-- | docs/postmortems/2020-12-11-all-services-outage.rst | 121 | ||||
-rw-r--r-- | docs/postmortems/2020-12-11-postgres-conn-surge.rst | 130 | ||||
-rw-r--r-- | docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst | 117 | ||||
-rw-r--r-- | docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst | 155 | ||||
-rw-r--r-- | docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst | 146 | ||||
-rw-r--r-- | docs/postmortems/2021-07-11-cascading-node-failures.rst | 335 | ||||
-rw-r--r-- | docs/postmortems/images/2021-01-12/site_cpu_throttle.png | bin | 0 -> 227245 bytes | |||
-rw-r--r-- | docs/postmortems/images/2021-01-12/site_resource_abnormal.png | bin | 0 -> 232260 bytes | |||
-rw-r--r-- | docs/postmortems/images/2021-01-30/linode_loadbalancers.png | bin | 0 -> 50882 bytes | |||
-rw-r--r-- | docs/postmortems/images/2021-01-30/memory_charts.png | bin | 0 -> 211053 bytes | |||
-rw-r--r-- | docs/postmortems/images/2021-01-30/prometheus_status.png | bin | 0 -> 291122 bytes | |||
-rw-r--r-- | docs/postmortems/images/2021-01-30/scaleios.png | bin | 0 -> 18294 bytes | |||
-rw-r--r-- | docs/postmortems/index.rst | 15 |
13 files changed, 1019 insertions, 0 deletions
diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst new file mode 100644 index 0000000..9c29303 --- /dev/null +++ b/docs/postmortems/2020-12-11-all-services-outage.rst @@ -0,0 +1,121 @@ +2020-12-11: All services outage +=============================== + +At **19:55 UTC, all services became unresponsive**. The DevOps were +already in a call, and immediately started to investigate. + +Postgres was running at 100% CPU usage due to a **VACUUM**, which caused +all services that depended on it to stop working. The high CPU left the +host unresponsive and it shutdown. Linode Lassie noticed this and +triggered a restart. + +It did not recover gracefully from this restart, with numerous core +services reporting an error, so we had to manually restart core system +services using Lens in order to get things working again. + +⚠️ Leadup +--------- + +*List the sequence of events that led to the incident* + +Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This +made Postgres run at 100% CPU and was unresponsive, which caused +services to stop responding. This lead to a restart of the node, from +which we did not recover gracefully. + +🥏 Impact +--------- + +*Describe how internal and external users were impacted during the +incident* + +All services went down. Catastrophic failure. We did not pass go, we did +not collect $200. + +- Help channel system unavailable, so people are not able to + effectively ask for help. +- Gates unavailable, so people can’t successfully get into the + community. +- Moderation and raid prevention unavailable, which leaves us + defenseless against attacks. + +👁️ Detection +------------ + +*Report when the team detected the incident, and how we could improve +detection time* + +We noticed that all PyDis services had stopped responding, +coincidentally our DevOps team were in a call at the time, so that was +helpful. + +We may be able to improve detection time by adding monitoring of +resource usage. To this end, we’ve added alerts for high CPU usage and +low memory. + +🙋🏿♂️ Response +---------------- + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded to the incident. + +We noticed our node was entirely unresponsive and within minutes a +restart had been triggered by Lassie after a high CPU shutdown occurred. + +The node came back and we saw a number of core services offline +(e.g. Calico, CoreDNS, Linode CSI). + +**Obstacle: no recent database back-up available** + +🙆🏽♀️ Recovery +----------------- + +*How was the incident resolved? How can we improve future mitigation +times?* + +Through `Lens <https://k8slens.dev/>`__ we restarted core services one +by one until they stabilised, after these core services were up other +services began to come back online. + +We finally provisioned PostgreSQL which had been removed as a component +before the restart (but too late to prevent the CPU errors). Once +PostgreSQL was up we restarted any components that were acting buggy +(e.g. site and bot). + +🔎 Five Why’s +------------- + +*Run a 5-whys analysis to understand the true cause of the incident.* + +- Major service outage +- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI) +- **Why?** Kubernetes worker node restart +- **Why?** High CPU shutdown +- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike + +🌱 Blameless root cause +----------------------- + +*Note the final root cause and describe what needs to change to prevent +reoccurrance* + +🤔 Lessons learned +------------------ + +*What did we learn from this incident?* + +- We must ensure we have working database backups. We are lucky that we + did not lose any data this time. If this problem had caused volume + corruption, we would be screwed. +- Sentry is broken for the bot. It was missing a DSN secret, which we + have now restored. +- The https://sentry.pydis.com redirect was never migrated to the + cluster. **We should do that.** + +☑️ Follow-up tasks +------------------ + +*List any tasks we’ve created as a result of this incident* + +- ☒ Push forward with backup plans diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst new file mode 100644 index 0000000..6ebcb01 --- /dev/null +++ b/docs/postmortems/2020-12-11-postgres-conn-surge.rst @@ -0,0 +1,130 @@ +2020-12-11: Postgres connection surge +===================================== + +At **13:24 UTC,** we noticed the bot was not able to infract, and +`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The +DevOps team started to investigate. + +We discovered that Postgres was not accepting new connections because it +had hit 100 clients. This made it unavailable to all services that +depended on it. + +Ultimately this was resolved by taking down Postgres, remounting the +associated volume, and bringing it back up again. + +⚠️ Leadup +--------- + +*List the sequence of events that led to the incident* + +The bot infractions stopped working, and we started investigating. + +🥏 Impact +--------- + +*Describe how internal and external users were impacted during the +incident* + +Services were unavailable both for internal and external users. + +- The Help Channel System was unavailable. +- Voice Gate and Server Gate were not working. +- Moderation commands were unavailable. +- Python Discord site & API were unavailable. CloudFlare automatically + switched us to Always Online. + +👁️ Detection +------------ + +*Report when the team detected the incident, and how we could improve +detection time* + +We noticed HTTP 524s coming from CloudFlare, upon attempting database +connection we observed the maximum client limit. + +We noticed this log in site: + +.. code:: yaml + + django.db.utils.OperationalError: FATAL: sorry, too many clients already + +We should be monitoring number of clients, and the monitor should alert +us when we’re approaching the max. That would have allowed for earlier +detection, and possibly allowed us to prevent the incident altogether. + +We will look at +`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__ +for monitoring this. + +🙋🏿♂️ Response +---------------- + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded to the incident. The obstacles were mostly a lack of +a clear response strategy. + +We should document our recovery procedure so that we’re not so dependent +on Joe Banks should this happen again while he’s unavailable. + +🙆🏽♀️ Recovery +---------------- + +*How was the incident resolved? How can we improve future mitigation?* + +- Delete PostgreSQL deployment ``kubectl delete deployment/postgres`` +- Delete any remaining pods, WITH force. + ``kubectl delete <pod name> --force --grace-period=0`` +- Unmount volume at Linode +- Remount volume at Linode +- Reapply deployment ``kubectl apply -f postgres/deployment.yaml`` + +🔎 Five Why’s +------------- + +*Run a 5-whys analysis to understand the true cause of the incident.* + +- Postgres was unavailable, so our services died. +- **Why?** Postgres hit max clients, and could not respond. +- **Why?** Unknown, but we saw a number of connections from previous + deployments of site. This indicates that database connections are not + being terminated properly. Needs further investigation. + +🌱 Blameless root cause +----------------------- + +*Note the final root cause and describe what needs to change to prevent +reoccurrance* + +We’re not sure what the root cause is, but suspect site is not +terminating database connections properly in some cases. We were unable +to reproduce this problem. + +We’ve set up new telemetry on Grafana with alerts so that we can +investigate this more closely. We will be let know if the number of +connections from site exceeds 32, or if the total number of connections +exceeds 90. + +🤔 Lessons learned +------------------ + +*What did we learn from this incident?* + +- We must ensure the DevOps team has access to Linode and other key + services even if our Bitwarden is down. +- We need to ensure we’re alerted of any risk factors that have the + potential to make Postgres unavailable, since this causes a + catastrophic outage of practically all services. +- We absolutely need backups for the databases, so that this sort of + problem carries less of a risk. +- We may need to consider something like + `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage + a connection pool so that we don’t exceed 100 *legitimate* clients + connected as we connect more services to the postgres database. + +☑️ Follow-up tasks +------------------ + +*List any tasks we should complete that are relevant to this incident* + +- ☒ All database backup diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst new file mode 100644 index 0000000..5852c46 --- /dev/null +++ b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst @@ -0,0 +1,117 @@ +2021-01-10: Primary Kubernetes node outage +========================================== + +We had an outage of our highest spec node due to CPU exhaustion. The +outage lasted from around 20:20 to 20:46 UTC, but was not a full service +outage. + +⚠️ Leadup +--------- + +*List the sequence of events that led to the incident* + +I ran a query on Prometheus to try figure out some statistics on the +number of metrics we are holding, this ended up scanning a lot of data +in the TSDB database that Prometheus uses. + +This scan caused a CPU exhaustion which caused issues with the +Kubernetes node status. + +🥏 Impact +--------- + +*Describe how internal and external users were impacted during the +incident* + +This brought down the primary node which meant there was some service +outage. Most services transferred successfully to our secondary node +which kept up some key services such as the Moderation bot and Modmail +bot, as well as MongoDB. + +👁️ Detection +------------ + +*Report when the team detected the incident, and how we could improve +detection time* + +This was noticed when Discord services started having failures. The +primary detection was through alerts though! I was paged 1 minute after +we started encountering CPU exhaustion issues. + +🙋🏿♂️ Response +---------------- + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded to the incident. + +No major obstacles were encountered during this. + +🙆🏽♀️ Recovery +---------------- + +*How was the incident resolved? How can we improve future mitigation?* + +It was noted that in the response to ``kubectl get nodes`` the primary +node’s status was reported as ``NotReady``. Looking into the reason it +was because the node had stopped responding. + +The quickest way to fix this was triggering a node restart. This shifted +a lot of pods over to node 2 which encountered some capacity issues +since it’s not as highly specified as the first node. + +I brought this back the first node by restarting it at Linode’s end. +Once this node was reporting as ``Ready`` again I drained the second +node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This +command stops the node from being available for scheduling and moves +existing pods onto other nodes. + +Services gradually recovered as the dependencies started. The incident +lasted overall around 26 minutes, though this was not a complete outage +for the whole time and the bot remained functional throughout (meaning +systems like the help channels were still functional). + +🔎 Five Why’s +------------- + +*Run a 5-whys analysis to understand the true cause of the incident.* + +**Why?** Partial service outage + +**Why?** We had a node outage. + +**Why?** CPU exhaustion of our primary node. + +**Why?** Large prometheus query using a lot of CPU. + +**Why?** Prometheus had to scan millions of TSDB records which consumed +all cores. + +🌱 Blameless root cause +----------------------- + +*Note the final root cause and describe what needs to change to prevent +reoccurrance* + +A large query was run on Prometheus, so the solution is just to not run +said queries. + +To protect against this more precisely though we should write resource +constraints for services like this that are vulnerable to CPU exhaustion +or memory consumption, which are the causes of our two past outages as +well. + +🤔 Lessons learned +------------------ + +*What did we learn from this incident?* + +- Don’t run large queries, it consumes CPU! +- Write resource constraints for our services. + +☑️ Follow-up tasks +------------------ + +*List any tasks we should complete that are relevant to this incident* + +- ☒ Write resource constraints for our services. diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst new file mode 100644 index 0000000..57f9fd8 --- /dev/null +++ b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst @@ -0,0 +1,155 @@ +2021-01-12: Django site CPU/RAM exhaustion outage +================================================= + +At 03:01 UTC on Tuesday 12th January we experienced a momentary outage +of our PostgreSQL database, causing some very minor service downtime. + +⚠️ Leadup +========= + +*List the sequence of events that led to the incident* + +We deleted the Developers role which led to a large user diff for all +the users where we had to update their roles on the site. + +The bot was trying to post this for over 24 hours repeatedly after every +restart. + +We deployed the bot at 2:55 UTC on 12th January and the user sync +process began once again. + +This caused a CPU & RAM spike on our Django site, which in turn +triggered an OOM error on the server which killed the Postgres process, +sending it into a recovery state where queries could not be executed. + +Django site did not have any tools in place to batch the requests so was +trying to process all 80k user updates in a single query, something that +PostgreSQL probably could handle, but not the Django ORM. During the +incident site jumped from it’s average RAM usage of 300-400MB to +**1.5GB.** + +.. image:: ./images/2021-01-12/site_resource_abnormal.png + +RAM and CPU usage of site throughout the incident. The period just +before 3:40 where no statistics were reported is the actual outage +period where the Kubernetes node had some networking errors. + +🥏 Impact +========= + +*Describe how internal and external users were impacted during the +incident* + +This database outage lasted mere minutes, since Postgres recovered and +healed itself and the sync process was aborted, but it did leave us with +a large user diff and our database becoming further out of sync. + +Most services stayed up that did not depend on PostgreSQL, and the site +remained stable after the sync had been cancelled. + +👁️ Detection +============ + +*Report when the team detected the incident, and how we could improve +detection time* + +We were immediately alerted to the PostgreSQL outage on Grafana and +through Sentry, meaning our response time was under a minute. + +We reduced some alert thresholds in order to catch RAM & CPU spikes +faster in the future. + +It was hard to immediately see the cause of things since there is +minimal logging on the site and the bot logs were not evident that +anything was at fault, therefore our only detection was through machine +metrics. + +We did manage to recover exactly what PostgreSQL was trying to do at the +time of crashing by examining the logs which pointed us towards the user +sync process. + +🙋🏿♂️ Response +================ + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded to the issue, there were no real obstacles +encountered other than the node being less performant than we would like +due to the CPU starvation. + +🙆🏽♀️ Recovery +================ + +*How was the incident resolved? How can we improve future mitigation?* + +The incident was resolved by stopping the sync process and writing a +more efficient one through an internal eval script. We batched the +updates into 1,000 users and instead of doing one large one did 80 +smaller updates. This led to much higher efficiency with a cost of +taking a little longer (~7 minutes). + +.. code:: python + + from bot.exts.backend.sync import _syncers + syncer = _syncers.UserSyncer + diff = await syncer._get_diff(ctx.guild) + + def chunks(lst, n): + for i in range(0, len(lst), n): + yield lst[i:i + n] + + for chunk in chunks(diff.updated, 1000): + await bot.api_client.patch("bot/users/bulk_patch", json=chunk) + +Resource limits were also put into place on site to prevent RAM and CPU +spikes, and throttle the CPU usage in these situations. This can be seen +in the below graph: + +.. image:: ./images/2021-01-12/site_cpu_throttle.png + +CPU throttling is where a container has hit the limits and we need to +reel it in. Ideally this value stays as closes to 0 as possible, however +as you can see site hit this twice (during the periods where it was +trying to sync 80k users at once) + +🔎 Five Why’s +============= + +*Run a 5-whys analysis to understand the true cause of the incident.* + +- We experienced a major PostgreSQL outage +- PostgreSQL was killed by the system OOM due to the RAM spike on site. +- The RAM spike on site was caused by a large query. +- This was because we do not chunk queries on the bot. +- The large query was caused by the removal of the Developers role + resulting in 80k users needing updating. + +🌱 Blameless root cause +======================= + +*Note the final root cause and describe what needs to change to prevent +reoccurrance* + +The removal of the Developers role created a large diff which could not +be applied by Django in a single request. + +See the follow up tasks on exactly how we can avoid this in future, it’s +a relatively easy mitigation. + +🤔 Lessons learned +================== + +*What did we learn from this incident?* + +- Django (or DRF) does not like huge update queries. + +☑️ Follow-up tasks +================== + +*List any tasks we should complete that are relevant to this incident* + +- ☒ Make the bot syncer more efficient (batch requests) +- ☐ Increase logging on bot, state when an error has been hit (we had + no indication of this inside Discord, we need that) +- ☒ Adjust resource alerts to page DevOps members earlier. +- ☒ Apply resource limits to site to prevent major spikes diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst new file mode 100644 index 0000000..b13ecd7 --- /dev/null +++ b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst @@ -0,0 +1,146 @@ +2021-01-30: NodeBalancer networking faults due to memory pressure +================================================================= + +At around 14:30 UTC on Saturday 30th January we started experiencing +networking issues at the LoadBalancer level between Cloudflare and our +Kubernetes cluster. It seems that the misconfiguration was due to memory +and CPU pressure. + +[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word +from Linode’s SysAdmins on any problems they detected.] + +**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a +different machine. + +⚠️ Leadup +--------- + +*List the sequence of events that led to the incident* + +At 14:30 we started receiving alerts that services were becoming +unreachable. We first experienced some momentary DNS errors which +resolved themselves, however traffic ingress was still degraded. + +Upon checking Linode our NodeBalancer, the service which balances +traffic between our Kubernetes nodes was reporting the backends (the +services it balances to) as down. It reported all 4 as down (two for +port 80 + two for port 443). This status was fluctuating between up and +down, meaning traffic was not reaching our cluster correctly. Scaleios +correctly noted: + +.. image:: ./images/2021-01-30/scaleios.png + +The config seems to have been set incorrectly due to memory and CPU +pressure on one of our nodes. Here is the memory throughout the +incident: + +.. image:: ./images/2021-01-30/memory_charts.png + +Here is the display from Linode: + +.. image:: ./images/2021-01-30/linode_loadbalancers.png + +🥏 Impact +--------- + +*Describe how internal and external users were impacted during the +incident* + +Since traffic could not correctly enter our cluster multiple services +which were web based were offline, including services such as site, +grafana and bitwarden. It appears that no inter-node communication was +affected as this uses a WireGuard tunnel between the nodes which was not +affected by the NodeBalancer. + +The lack of Grafana made diagnosis slightly more difficult, but even +then it was only a short trip to the + +👁️ Detection +------------ + +*Report when the team detected the incident, and how we could improve +detection time* + +We were alerted fairly promptly through statping which reported services +as being down and posted a Discord notification. Subsequent alerts came +in from Grafana but were limited since outbound communication was +faulty. + +🙋🏿♂️ Response +---------------- + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded! + +Primary obstacle was the DevOps tools being out due to the traffic +ingress problems. + +🙆🏽♀️ Recovery +---------------- + +*How was the incident resolved? How can we improve future mitigation?* + +The incident resolved itself upstream at Linode, we’ve opened a ticket +with Linode to let them know of the faults, this might give us a better +indication of what caused the issues. Our Kubernetes cluster continued +posting updates to Linode to refresh the NodeBalancer configuration, +inspecting these payloads the configuration looked correct. + +We’ve set up alerts for when Prometheus services stop responding since +this seems to be a fairly tell-tale symptom of networking problems, this +was the Prometheus status graph throughout the incident: + +.. image:: ./images/2021-01-30/prometheus_status.png + +🔎 Five Why’s +------------- + +*Run a 5-whys analysis to understand the true cause of the incident.* + +**What?** Our service experienced an outage due to networking faults. + +**Why?** Incoming traffic could not reach our Kubernetes nodes + +**Why?** Our Linode NodeBalancers were not using correct configuration + +**Why?** Memory & CPU pressure seemed to cause invalid configuration +errors upstream at Linode. + +**Why?** Unknown at this stage, NodeBalancer migrated. + +🌱 Blameless root cause +----------------------- + +*Note the final root cause and describe what needs to change to prevent +reoccurrance* + +The configuration of our NodeBalancer was invalid, we cannot say why at +this point since we are awaiting contact back from Linode, but +indicators point to it being an upstream fault since memory & CPU +pressure should **not** cause a load balancer misconfiguration. + +Linode are going to follow up with us at some point during the week with +information from their System Administrators. + +**Update 2nd February 2021:** Linode have concluded investigations at +their end, taken notes and migrated our NodeBalancer to a new machine. +We haven’t experienced problems since. + +🤔 Lessons learned +------------------ + +*What did we learn from this incident?* + +We should be careful over-scheduling onto nodes since even while +operating within reasonable constraints we risk sending invalid +configuration upstream to Linode and therefore preventing traffic from +entering our cluster. + +☑️ Follow-up tasks +------------------ + +*List any tasks we should complete that are relevant to this incident* + +- ☒ Monitor for follow up from Linode +- ☒ Carefully monitor the allocation rules for our services diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst new file mode 100644 index 0000000..6cd30f3 --- /dev/null +++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst @@ -0,0 +1,335 @@ +2021-07-11: Cascading node failures and ensuing volume problems +=============================================================== + +A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node +to an unresponsive state (00:55 UTC), upon performing a recycle of the +affected node volumes were placed into a state where they could not be +mounted. + +⚠️ Leadup +========= + +*List the sequence of events that led to the incident* + +- **00:27 UTC:** Django starts rapidly using connections to our + PostgreSQL database +- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated + it’s 115 max connections limit. Joe is paged. +- **00:33 UTC:** DevOps team is alerted that a service has claimed 34 + dangerous table locks (it peaked at 61). +- **00:42 UTC:** Status incident created and backdated to 00:25 UTC. + `Status incident <https://status.pythondiscord.com/incident/92712>`__ +- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no + longer healthy after the Django connection surge, so it’s recycled + and a new one is to be added to the pool. +- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s + restart +- **01:13 UTC:** Node has restored and regained healthy status, but + volumes will not mount to the node. Support ticket opened at Linode + for assistance. +- **06:36 UTC:** DevOps team alerted that Python is offline. This is + due to Redis being a dependency of the bot, which as a stateful + service was not healthy. + +🥏 Impact +========= + +*Describe how internal and external users were impacted during the +incident* + +Initially, this manifested as a standard node outage where services on +that node experienced some downtime as the node was restored. + +Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) +were unexecutable due to the volume issues, and so any dependent +services (e.g. Site, Bot, Hastebin) also had trouble starting. + +PostgreSQL was restored early on so for the most part Moderation could +continue. + +👁️ Detection +============ + +*Report when the team detected the incident, and how we could improve +detection time* + +DevOps were initially alerted at 00:32 UTC due to the PostgreSQL +connection surge, and acknowledged at the same time. + +Further alerting could be used to catch surges earlier on (looking at +conn delta vs. conn total), but for the most part alerting time was +satisfactory here. + +🙋🏿♂️ Response +================ + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded. The primary issue encountered was failure upstream +at Linode to remount the affected volumes, a support ticket has been +created. + +🙆🏽♀️ Recovery +================ + +*How was the incident resolved? How can we improve future mitigation?* + +Initial node restoration was performed by @Joe Banks by recycling the +affected node. + +Subsequent volume restoration was also @Joe Banks and once Linode had +unlocked the volumes affected pods were scaled down to 0, the volumes +were unmounted at the Linode side and then the deployments were +recreated. + +.. raw:: html + + <details> + +.. raw:: html + + <summary> + +Support ticket sent + +.. raw:: html + + </summary> + +.. raw:: html + + <blockquote> + +Good evening, + +We experienced a resource surge on one of our Kubernetes nodes at 00:32 +UTC, causing a node to go unresponsive. To mitigate problems here the +node was recycled and began restarting at 1:01 UTC. + +The node has now rejoined the ring and started picking up services, but +volumes will not attach to it, meaning pods with stateful storage will +not start. + +An example events log for one such pod: + +:: + + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf + Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f] + Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition + +I’ve been trying to manually resolve this through the Linode Web UI but +get presented with attachment errors upon doing so. Please could you +advise on the best way forward to restore Volumes & Nodes to a +functioning state? As far as I can see there is something going on +upstream since the Linode UI presents these nodes as mounted however as +shown above LKE nodes are not locating them, there is also a few failed +attachment logs in the Linode Audit Log. + +Thanks, + +Joe + +.. raw:: html + + </blockquote> + +.. raw:: html + + </details> + +.. raw:: html + + <details> + +.. raw:: html + + <summary> + +Response received from Linode + +.. raw:: html + + </summary> + +.. raw:: html + + <blockquote> + +Hi Joe, + + Were there any known issues with Block Storage in Frankfurt today? + +Not today, though there were service issues reported for Block Storage +and LKE in Frankfurt on July 8 and 9: + +- `Service Issue - Block Storage - EU-Central + (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__ +- `Service Issue - Linode Kubernetes Engine - + Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__ + +There was also an API issue reported on the 10th (resolved on the 11th), +mentioned here: + +- `Service Issue - Cloud Manager and + API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__ + +Regarding the specific error you were receiving: + + ``Unable to find device path out of attempted paths`` + +I’m not certain it’s specifically related to those Service Issues, +considering this isn’t the first time a customer has reported this error +in their LKE logs. In fact, if I recall correctly, I’ve run across this +before too, since our volumes are RWO and I had too many replicas in my +deployment that I was trying to attach to, for example. + + is this a known bug/condition that occurs with Linode CSI/LKE? + +From what I understand, yes, this is a known condition that crops up +from time to time, which we are tracking. However, since there is a +workaround at the moment (e.g. - “After some more manual attempts to fix +things, scaling down deployments, unmounting at Linode and then scaling +up the deployments seems to have worked and all our services have now +been restored.”), there is no ETA for addressing this. With that said, +I’ve let our Storage team know that you’ve run into this, so as to draw +further attention to it. + +If you have any further questions or concerns regarding this, let us +know. + +Best regards, [Redacted] + +Linode Support Team + +.. raw:: html + + </blockquote> + +.. raw:: html + + </details> + +.. raw:: html + + <details> + +.. raw:: html + + <summary> + +Concluding response from Joe Banks + +.. raw:: html + + </summary> + +.. raw:: html + + <blockquote> + +Hey [Redacted]! + +Thanks for the response. We ensure that stateful pods only ever have one +volume assigned to them, either with a single replica deployment or a +statefulset. It appears that the error generally manifests when a +deployment is being migrated from one node to another during a redeploy, +which makes sense if there is some delay on the unmount/remount. + +Confusion occurred because Linode was reporting the volume as attached +when the node had been recycled, but I assume that was because the node +did not cleanly shutdown and therefore could not cleanly unmount +volumes. + +We’ve not seen any resurgence of such issues, and we’ll address the +software fault which overloaded the node which will helpfully mitigate +such problems in the future. + +Thanks again for the response, have a great week! + +Best, + +Joe + +.. raw:: html + + </blockquote> + +.. raw:: html + + </details> + +🔎 Five Why’s +============= + +*Run a 5-whys analysis to understand the true cause of the incident.* + +**What?** +~~~~~~~~~ + +Several of our services became unavailable because their volumes could +not be mounted. + +Why? +~~~~ + +A node recycle left the node unable to mount volumes using the Linode +CSI. + +.. _why-1: + +Why? +~~~~ + +A node recycle was used because PostgreSQL had a connection surge. + +.. _why-2: + +Why? +~~~~ + +A Django feature deadlocked a table 62 times and suddenly started using +~70 connections to the database, saturating the maximum connections +limit. + +.. _why-3: + +Why? +~~~~ + +The root cause of why Django does this is unclear, and someone with more +Django proficiency is absolutely welcome to share any knowledge they may +have. I presume it’s some sort of worker race condition, but I’ve not +been able to reproduce it. + +🌱 Blameless root cause +======================= + +*Note the final root cause and describe what needs to change to prevent +reoccurrence* + +A node being forcefully restarted left volumes in a limbo state where +mounting was difficult, it took multiple hours for this to be resolved +since we had to wait for the volumes to unlock so they could be cloned. + +🤔 Lessons learned +================== + +*What did we learn from this incident?* + +Volumes are painful. + +We need to look at why Django is doing this and mitigations of the fault +to prevent this from occurring again. + +☑️ Follow-up tasks +================== + +*List any tasks we should complete that are relevant to this incident* + +- ☒ `Follow up on ticket at + Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__ +- ☐ Investigate why Django could be connection surging and locking + tables diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png Binary files differnew file mode 100644 index 0000000..b530ec6 --- /dev/null +++ b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png Binary files differnew file mode 100644 index 0000000..e1e07af --- /dev/null +++ b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png Binary files differnew file mode 100644 index 0000000..f0eae1f --- /dev/null +++ b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.png Binary files differnew file mode 100644 index 0000000..370d19e --- /dev/null +++ b/docs/postmortems/images/2021-01-30/memory_charts.png diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.png Binary files differnew file mode 100644 index 0000000..e95b8d7 --- /dev/null +++ b/docs/postmortems/images/2021-01-30/prometheus_status.png diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.png Binary files differnew file mode 100644 index 0000000..584d74d --- /dev/null +++ b/docs/postmortems/images/2021-01-30/scaleios.png diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst new file mode 100644 index 0000000..43994a2 --- /dev/null +++ b/docs/postmortems/index.rst @@ -0,0 +1,15 @@ +Postmortems +=========== + +Browse the pages under this category to view historical postmortems for +Python Discord outages. + +.. toctree:: + :maxdepth: 2 + + 2020-12-11-all-services-outage + 2020-12-11-postgres-conn-surge + 2021-01-10-primary-kubernetes-node-outage + 2021-01-12-site-cpu-ram-exhaustion + 2021-01-30-nodebalancer-fails-memory + 2021-07-11-cascading-node-failures |