aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems/2021-07-11-cascading-node-failures.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/postmortems/2021-07-11-cascading-node-failures.rst')
-rw-r--r--docs/postmortems/2021-07-11-cascading-node-failures.rst335
1 files changed, 0 insertions, 335 deletions
diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst
deleted file mode 100644
index b2e5cdf..0000000
--- a/docs/postmortems/2021-07-11-cascading-node-failures.rst
+++ /dev/null
@@ -1,335 +0,0 @@
-2021-07-11: Cascading node failures and ensuing volume problems
-===============================================================
-
-A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node
-to an unresponsive state (00:55 UTC), upon performing a recycle of the
-affected node volumes were placed into a state where they could not be
-mounted.
-
-⚠️ Leadup
-----------
-
-*List the sequence of events that led to the incident*
-
-- **00:27 UTC:** Django starts rapidly using connections to our
- PostgreSQL database
-- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated
- it’s 115 max connections limit. Joe is paged.
-- **00:33 UTC:** DevOps team is alerted that a service has claimed 34
- dangerous table locks (it peaked at 61).
-- **00:42 UTC:** Status incident created and backdated to 00:25 UTC.
- `Status incident <https://status.pythondiscord.com/incident/92712>`__
-- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no
- longer healthy after the Django connection surge, so it’s recycled
- and a new one is to be added to the pool.
-- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s
- restart
-- **01:13 UTC:** Node has restored and regained healthy status, but
- volumes will not mount to the node. Support ticket opened at Linode
- for assistance.
-- **06:36 UTC:** DevOps team alerted that Python is offline. This is
- due to Redis being a dependency of the bot, which as a stateful
- service was not healthy.
-
-🥏 Impact
-----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-Initially, this manifested as a standard node outage where services on
-that node experienced some downtime as the node was restored.
-
-Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop)
-were unexecutable due to the volume issues, and so any dependent
-services (e.g. Site, Bot, Hastebin) also had trouble starting.
-
-PostgreSQL was restored early on so for the most part Moderation could
-continue.
-
-👁️ Detection
----------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-DevOps were initially alerted at 00:32 UTC due to the PostgreSQL
-connection surge, and acknowledged at the same time.
-
-Further alerting could be used to catch surges earlier on (looking at
-conn delta vs. conn total), but for the most part alerting time was
-satisfactory here.
-
-🙋🏿‍♂️ Response
------------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded. The primary issue encountered was failure upstream
-at Linode to remount the affected volumes, a support ticket has been
-created.
-
-🙆🏽‍♀️ Recovery
-------------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-Initial node restoration was performed by @Joe Banks by recycling the
-affected node.
-
-Subsequent volume restoration was also @Joe Banks and once Linode had
-unlocked the volumes affected pods were scaled down to 0, the volumes
-were unmounted at the Linode side and then the deployments were
-recreated.
-
-.. raw:: html
-
- <details>
-
-.. raw:: html
-
- <summary>
-
-Support ticket sent
-
-.. raw:: html
-
- </summary>
-
-.. raw:: html
-
- <blockquote>
-
-Good evening,
-
-We experienced a resource surge on one of our Kubernetes nodes at 00:32
-UTC, causing a node to go unresponsive. To mitigate problems here the
-node was recycled and began restarting at 1:01 UTC.
-
-The node has now rejoined the ring and started picking up services, but
-volumes will not attach to it, meaning pods with stateful storage will
-not start.
-
-An example events log for one such pod:
-
-::
-
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf
- Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f]
- Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition
-
-I’ve been trying to manually resolve this through the Linode Web UI but
-get presented with attachment errors upon doing so. Please could you
-advise on the best way forward to restore Volumes & Nodes to a
-functioning state? As far as I can see there is something going on
-upstream since the Linode UI presents these nodes as mounted however as
-shown above LKE nodes are not locating them, there is also a few failed
-attachment logs in the Linode Audit Log.
-
-Thanks,
-
-Joe
-
-.. raw:: html
-
- </blockquote>
-
-.. raw:: html
-
- </details>
-
-.. raw:: html
-
- <details>
-
-.. raw:: html
-
- <summary>
-
-Response received from Linode
-
-.. raw:: html
-
- </summary>
-
-.. raw:: html
-
- <blockquote>
-
-Hi Joe,
-
- Were there any known issues with Block Storage in Frankfurt today?
-
-Not today, though there were service issues reported for Block Storage
-and LKE in Frankfurt on July 8 and 9:
-
-- `Service Issue - Block Storage - EU-Central
- (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__
-- `Service Issue - Linode Kubernetes Engine -
- Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__
-
-There was also an API issue reported on the 10th (resolved on the 11th),
-mentioned here:
-
-- `Service Issue - Cloud Manager and
- API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__
-
-Regarding the specific error you were receiving:
-
- ``Unable to find device path out of attempted paths``
-
-I’m not certain it’s specifically related to those Service Issues,
-considering this isn’t the first time a customer has reported this error
-in their LKE logs. In fact, if I recall correctly, I’ve run across this
-before too, since our volumes are RWO and I had too many replicas in my
-deployment that I was trying to attach to, for example.
-
- is this a known bug/condition that occurs with Linode CSI/LKE?
-
-From what I understand, yes, this is a known condition that crops up
-from time to time, which we are tracking. However, since there is a
-workaround at the moment (e.g. - “After some more manual attempts to fix
-things, scaling down deployments, unmounting at Linode and then scaling
-up the deployments seems to have worked and all our services have now
-been restored.”), there is no ETA for addressing this. With that said,
-I’ve let our Storage team know that you’ve run into this, so as to draw
-further attention to it.
-
-If you have any further questions or concerns regarding this, let us
-know.
-
-Best regards, [Redacted]
-
-Linode Support Team
-
-.. raw:: html
-
- </blockquote>
-
-.. raw:: html
-
- </details>
-
-.. raw:: html
-
- <details>
-
-.. raw:: html
-
- <summary>
-
-Concluding response from Joe Banks
-
-.. raw:: html
-
- </summary>
-
-.. raw:: html
-
- <blockquote>
-
-Hey [Redacted]!
-
-Thanks for the response. We ensure that stateful pods only ever have one
-volume assigned to them, either with a single replica deployment or a
-statefulset. It appears that the error generally manifests when a
-deployment is being migrated from one node to another during a redeploy,
-which makes sense if there is some delay on the unmount/remount.
-
-Confusion occurred because Linode was reporting the volume as attached
-when the node had been recycled, but I assume that was because the node
-did not cleanly shutdown and therefore could not cleanly unmount
-volumes.
-
-We’ve not seen any resurgence of such issues, and we’ll address the
-software fault which overloaded the node which will helpfully mitigate
-such problems in the future.
-
-Thanks again for the response, have a great week!
-
-Best,
-
-Joe
-
-.. raw:: html
-
- </blockquote>
-
-.. raw:: html
-
- </details>
-
-🔎 Five Why’s
----------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-**What?**
-~~~~~~~~~
-
-Several of our services became unavailable because their volumes could
-not be mounted.
-
-Why?
-~~~~
-
-A node recycle left the node unable to mount volumes using the Linode
-CSI.
-
-.. _why-1:
-
-Why?
-~~~~
-
-A node recycle was used because PostgreSQL had a connection surge.
-
-.. _why-2:
-
-Why?
-~~~~
-
-A Django feature deadlocked a table 62 times and suddenly started using
-~70 connections to the database, saturating the maximum connections
-limit.
-
-.. _why-3:
-
-Why?
-~~~~
-
-The root cause of why Django does this is unclear, and someone with more
-Django proficiency is absolutely welcome to share any knowledge they may
-have. I presume it’s some sort of worker race condition, but I’ve not
-been able to reproduce it.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrence*
-
-A node being forcefully restarted left volumes in a limbo state where
-mounting was difficult, it took multiple hours for this to be resolved
-since we had to wait for the volumes to unlock so they could be cloned.
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-Volumes are painful.
-
-We need to look at why Django is doing this and mitigations of the fault
-to prevent this from occurring again.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ `Follow up on ticket at
- Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__
-- ☐ Investigate why Django could be connection surging and locking
- tables