diff options
Diffstat (limited to 'docs/postmortems/2021-07-11-cascading-node-failures.rst')
-rw-r--r-- | docs/postmortems/2021-07-11-cascading-node-failures.rst | 335 |
1 files changed, 335 insertions, 0 deletions
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst new file mode 100644 index 0000000..6cd30f3 --- /dev/null +++ b/docs/postmortems/2021-07-11-cascading-node-failures.rst @@ -0,0 +1,335 @@ +2021-07-11: Cascading node failures and ensuing volume problems +=============================================================== + +A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node +to an unresponsive state (00:55 UTC), upon performing a recycle of the +affected node volumes were placed into a state where they could not be +mounted. + +⚠️ Leadup +========= + +*List the sequence of events that led to the incident* + +- **00:27 UTC:** Django starts rapidly using connections to our + PostgreSQL database +- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated + it’s 115 max connections limit. Joe is paged. +- **00:33 UTC:** DevOps team is alerted that a service has claimed 34 + dangerous table locks (it peaked at 61). +- **00:42 UTC:** Status incident created and backdated to 00:25 UTC. + `Status incident <https://status.pythondiscord.com/incident/92712>`__ +- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no + longer healthy after the Django connection surge, so it’s recycled + and a new one is to be added to the pool. +- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s + restart +- **01:13 UTC:** Node has restored and regained healthy status, but + volumes will not mount to the node. Support ticket opened at Linode + for assistance. +- **06:36 UTC:** DevOps team alerted that Python is offline. This is + due to Redis being a dependency of the bot, which as a stateful + service was not healthy. + +🥏 Impact +========= + +*Describe how internal and external users were impacted during the +incident* + +Initially, this manifested as a standard node outage where services on +that node experienced some downtime as the node was restored. + +Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) +were unexecutable due to the volume issues, and so any dependent +services (e.g. Site, Bot, Hastebin) also had trouble starting. + +PostgreSQL was restored early on so for the most part Moderation could +continue. + +👁️ Detection +============ + +*Report when the team detected the incident, and how we could improve +detection time* + +DevOps were initially alerted at 00:32 UTC due to the PostgreSQL +connection surge, and acknowledged at the same time. + +Further alerting could be used to catch surges earlier on (looking at +conn delta vs. conn total), but for the most part alerting time was +satisfactory here. + +🙋🏿♂️ Response +================ + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded. The primary issue encountered was failure upstream +at Linode to remount the affected volumes, a support ticket has been +created. + +🙆🏽♀️ Recovery +================ + +*How was the incident resolved? How can we improve future mitigation?* + +Initial node restoration was performed by @Joe Banks by recycling the +affected node. + +Subsequent volume restoration was also @Joe Banks and once Linode had +unlocked the volumes affected pods were scaled down to 0, the volumes +were unmounted at the Linode side and then the deployments were +recreated. + +.. raw:: html + + <details> + +.. raw:: html + + <summary> + +Support ticket sent + +.. raw:: html + + </summary> + +.. raw:: html + + <blockquote> + +Good evening, + +We experienced a resource surge on one of our Kubernetes nodes at 00:32 +UTC, causing a node to go unresponsive. To mitigate problems here the +node was recycled and began restarting at 1:01 UTC. + +The node has now rejoined the ring and started picking up services, but +volumes will not attach to it, meaning pods with stateful storage will +not start. + +An example events log for one such pod: + +:: + + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf + Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f] + Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition + +I’ve been trying to manually resolve this through the Linode Web UI but +get presented with attachment errors upon doing so. Please could you +advise on the best way forward to restore Volumes & Nodes to a +functioning state? As far as I can see there is something going on +upstream since the Linode UI presents these nodes as mounted however as +shown above LKE nodes are not locating them, there is also a few failed +attachment logs in the Linode Audit Log. + +Thanks, + +Joe + +.. raw:: html + + </blockquote> + +.. raw:: html + + </details> + +.. raw:: html + + <details> + +.. raw:: html + + <summary> + +Response received from Linode + +.. raw:: html + + </summary> + +.. raw:: html + + <blockquote> + +Hi Joe, + + Were there any known issues with Block Storage in Frankfurt today? + +Not today, though there were service issues reported for Block Storage +and LKE in Frankfurt on July 8 and 9: + +- `Service Issue - Block Storage - EU-Central + (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__ +- `Service Issue - Linode Kubernetes Engine - + Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__ + +There was also an API issue reported on the 10th (resolved on the 11th), +mentioned here: + +- `Service Issue - Cloud Manager and + API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__ + +Regarding the specific error you were receiving: + + ``Unable to find device path out of attempted paths`` + +I’m not certain it’s specifically related to those Service Issues, +considering this isn’t the first time a customer has reported this error +in their LKE logs. In fact, if I recall correctly, I’ve run across this +before too, since our volumes are RWO and I had too many replicas in my +deployment that I was trying to attach to, for example. + + is this a known bug/condition that occurs with Linode CSI/LKE? + +From what I understand, yes, this is a known condition that crops up +from time to time, which we are tracking. However, since there is a +workaround at the moment (e.g. - “After some more manual attempts to fix +things, scaling down deployments, unmounting at Linode and then scaling +up the deployments seems to have worked and all our services have now +been restored.”), there is no ETA for addressing this. With that said, +I’ve let our Storage team know that you’ve run into this, so as to draw +further attention to it. + +If you have any further questions or concerns regarding this, let us +know. + +Best regards, [Redacted] + +Linode Support Team + +.. raw:: html + + </blockquote> + +.. raw:: html + + </details> + +.. raw:: html + + <details> + +.. raw:: html + + <summary> + +Concluding response from Joe Banks + +.. raw:: html + + </summary> + +.. raw:: html + + <blockquote> + +Hey [Redacted]! + +Thanks for the response. We ensure that stateful pods only ever have one +volume assigned to them, either with a single replica deployment or a +statefulset. It appears that the error generally manifests when a +deployment is being migrated from one node to another during a redeploy, +which makes sense if there is some delay on the unmount/remount. + +Confusion occurred because Linode was reporting the volume as attached +when the node had been recycled, but I assume that was because the node +did not cleanly shutdown and therefore could not cleanly unmount +volumes. + +We’ve not seen any resurgence of such issues, and we’ll address the +software fault which overloaded the node which will helpfully mitigate +such problems in the future. + +Thanks again for the response, have a great week! + +Best, + +Joe + +.. raw:: html + + </blockquote> + +.. raw:: html + + </details> + +🔎 Five Why’s +============= + +*Run a 5-whys analysis to understand the true cause of the incident.* + +**What?** +~~~~~~~~~ + +Several of our services became unavailable because their volumes could +not be mounted. + +Why? +~~~~ + +A node recycle left the node unable to mount volumes using the Linode +CSI. + +.. _why-1: + +Why? +~~~~ + +A node recycle was used because PostgreSQL had a connection surge. + +.. _why-2: + +Why? +~~~~ + +A Django feature deadlocked a table 62 times and suddenly started using +~70 connections to the database, saturating the maximum connections +limit. + +.. _why-3: + +Why? +~~~~ + +The root cause of why Django does this is unclear, and someone with more +Django proficiency is absolutely welcome to share any knowledge they may +have. I presume it’s some sort of worker race condition, but I’ve not +been able to reproduce it. + +🌱 Blameless root cause +======================= + +*Note the final root cause and describe what needs to change to prevent +reoccurrence* + +A node being forcefully restarted left volumes in a limbo state where +mounting was difficult, it took multiple hours for this to be resolved +since we had to wait for the volumes to unlock so they could be cloned. + +🤔 Lessons learned +================== + +*What did we learn from this incident?* + +Volumes are painful. + +We need to look at why Django is doing this and mitigations of the fault +to prevent this from occurring again. + +☑️ Follow-up tasks +================== + +*List any tasks we should complete that are relevant to this incident* + +- ☒ `Follow up on ticket at + Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__ +- ☐ Investigate why Django could be connection surging and locking + tables |