docs/postmortems/2021-07-11-cascading-node-failures.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335

2021-07-11: Cascading node failures and ensuing volume problems
===============================================================

A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node
to an unresponsive state (00:55 UTC), upon performing a recycle of the
affected node volumes were placed into a state where they could not be
mounted.

⚠️ Leadup
----------

*List the sequence of events that led to the incident*

-  **00:27 UTC:** Django starts rapidly using connections to our
   PostgreSQL database
-  **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated
   it’s 115 max connections limit. Joe is paged.
-  **00:33 UTC:** DevOps team is alerted that a service has claimed 34
   dangerous table locks (it peaked at 61).
-  **00:42 UTC:** Status incident created and backdated to 00:25 UTC.
   `Status incident <https://status.pythondiscord.com/incident/92712>`__
-  **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no
   longer healthy after the Django connection surge, so it’s recycled
   and a new one is to be added to the pool.
-  **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s
   restart
-  **01:13 UTC:** Node has restored and regained healthy status, but
   volumes will not mount to the node. Support ticket opened at Linode
   for assistance.
-  **06:36 UTC:** DevOps team alerted that Python is offline. This is
   due to Redis being a dependency of the bot, which as a stateful
   service was not healthy.

🥏 Impact
----------

*Describe how internal and external users were impacted during the
incident*

Initially, this manifested as a standard node outage where services on
that node experienced some downtime as the node was restored.

Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop)
were unexecutable due to the volume issues, and so any dependent
services (e.g. Site, Bot, Hastebin) also had trouble starting.

PostgreSQL was restored early on so for the most part Moderation could
continue.

👁️ Detection
---------------

*Report when the team detected the incident, and how we could improve
detection time*

DevOps were initially alerted at 00:32 UTC due to the PostgreSQL
connection surge, and acknowledged at the same time.

Further alerting could be used to catch surges earlier on (looking at
conn delta vs. conn total), but for the most part alerting time was
satisfactory here.

🙋🏿‍♂️ Response
-----------------

*Who responded to the incident, and what obstacles did they encounter?*

Joe Banks responded. The primary issue encountered was failure upstream
at Linode to remount the affected volumes, a support ticket has been
created.

🙆🏽‍♀️ Recovery
------------------

*How was the incident resolved? How can we improve future mitigation?*

Initial node restoration was performed by @Joe Banks by recycling the
affected node.

Subsequent volume restoration was also @Joe Banks and once Linode had
unlocked the volumes affected pods were scaled down to 0, the volumes
were unmounted at the Linode side and then the deployments were
recreated.

.. raw:: html

   <details>

.. raw:: html

   <summary>

Support ticket sent

.. raw:: html

   </summary>

.. raw:: html

   <blockquote>

Good evening,

We experienced a resource surge on one of our Kubernetes nodes at 00:32
UTC, causing a node to go unresponsive. To mitigate problems here the
node was recycled and began restarting at 1:01 UTC.

The node has now rejoined the ring and started picking up services, but
volumes will not attach to it, meaning pods with stateful storage will
not start.

An example events log for one such pod:

::

     Type     Reason       Age    From               Message
     ----     ------       ----   ----               -------
     Normal   Scheduled    2m45s  default-scheduler  Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf
     Warning  FailedMount  103s   kubelet            MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f]
     Warning  FailedMount  43s    kubelet            Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition

I’ve been trying to manually resolve this through the Linode Web UI but
get presented with attachment errors upon doing so. Please could you
advise on the best way forward to restore Volumes & Nodes to a
functioning state? As far as I can see there is something going on
upstream since the Linode UI presents these nodes as mounted however as
shown above LKE nodes are not locating them, there is also a few failed
attachment logs in the Linode Audit Log.

Thanks,

Joe

.. raw:: html

   </blockquote>

.. raw:: html

   </details>

.. raw:: html

   <details>

.. raw:: html

   <summary>

Response received from Linode

.. raw:: html

   </summary>

.. raw:: html

   <blockquote>

Hi Joe,

   Were there any known issues with Block Storage in Frankfurt today?

Not today, though there were service issues reported for Block Storage
and LKE in Frankfurt on July 8 and 9:

-  `Service Issue - Block Storage - EU-Central
   (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__
-  `Service Issue - Linode Kubernetes Engine -
   Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__

There was also an API issue reported on the 10th (resolved on the 11th),
mentioned here:

-  `Service Issue - Cloud Manager and
   API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__

Regarding the specific error you were receiving:

   ``Unable to find device path out of attempted paths``

I’m not certain it’s specifically related to those Service Issues,
considering this isn’t the first time a customer has reported this error
in their LKE logs. In fact, if I recall correctly, I’ve run across this
before too, since our volumes are RWO and I had too many replicas in my
deployment that I was trying to attach to, for example.

   is this a known bug/condition that occurs with Linode CSI/LKE?

From what I understand, yes, this is a known condition that crops up
from time to time, which we are tracking. However, since there is a
workaround at the moment (e.g. - “After some more manual attempts to fix
things, scaling down deployments, unmounting at Linode and then scaling
up the deployments seems to have worked and all our services have now
been restored.”), there is no ETA for addressing this. With that said,
I’ve let our Storage team know that you’ve run into this, so as to draw
further attention to it.

If you have any further questions or concerns regarding this, let us
know.

Best regards, [Redacted]

Linode Support Team

.. raw:: html

   </blockquote>

.. raw:: html

   </details>

.. raw:: html

   <details>

.. raw:: html

   <summary>

Concluding response from Joe Banks

.. raw:: html

   </summary>

.. raw:: html

   <blockquote>

Hey [Redacted]!

Thanks for the response. We ensure that stateful pods only ever have one
volume assigned to them, either with a single replica deployment or a
statefulset. It appears that the error generally manifests when a
deployment is being migrated from one node to another during a redeploy,
which makes sense if there is some delay on the unmount/remount.

Confusion occurred because Linode was reporting the volume as attached
when the node had been recycled, but I assume that was because the node
did not cleanly shutdown and therefore could not cleanly unmount
volumes.

We’ve not seen any resurgence of such issues, and we’ll address the
software fault which overloaded the node which will helpfully mitigate
such problems in the future.

Thanks again for the response, have a great week!

Best,

Joe

.. raw:: html

   </blockquote>

.. raw:: html

   </details>

🔎 Five Why’s
---------------

*Run a 5-whys analysis to understand the true cause of the incident.*

**What?**
~~~~~~~~~

Several of our services became unavailable because their volumes could
not be mounted.

Why?
~~~~

A node recycle left the node unable to mount volumes using the Linode
CSI.

.. _why-1:

Why?
~~~~

A node recycle was used because PostgreSQL had a connection surge.

.. _why-2:

Why?
~~~~

A Django feature deadlocked a table 62 times and suddenly started using
~70 connections to the database, saturating the maximum connections
limit.

.. _why-3:

Why?
~~~~

The root cause of why Django does this is unclear, and someone with more
Django proficiency is absolutely welcome to share any knowledge they may
have. I presume it’s some sort of worker race condition, but I’ve not
been able to reproduce it.

🌱 Blameless root cause
-----------------------

*Note the final root cause and describe what needs to change to prevent
reoccurrence*

A node being forcefully restarted left volumes in a limbo state where
mounting was difficult, it took multiple hours for this to be resolved
since we had to wait for the volumes to unlock so they could be cloned.

🤔 Lessons learned
------------------

*What did we learn from this incident?*

Volumes are painful.

We need to look at why Django is doing this and mitigations of the fault
to prevent this from occurring again.

☑️ Follow-up tasks
------------------

*List any tasks we should complete that are relevant to this incident*

-  ☒ `Follow up on ticket at
   Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__
-  ☐ Investigate why Django could be connection surging and locking
   tables