aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
diff options
context:
space:
mode:
authorGravatar Johannes Christ <[email protected]>2024-07-24 20:09:42 +0200
committerGravatar Johannes Christ <[email protected]>2024-07-25 20:06:54 +0200
commita4d7e92d544aeb43dbe1fcd8648d97e0dbf7b9d3 (patch)
tree183318852234388654c99514e45f095af8c21676 /docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
parentAdd link to DevOps Kanban board in meeting template (#420) (diff)
Improve documentation
This commit ports our documentation to Sphinx. The reason for this is straightforward. We need to improve both the quality and the accessibility of our documentation. Hugo is not capable of doing this, as its primary output format is HTML. Sphinx builds plenty of high-quality output formats out of the box, and incentivizes writing good documentation.
Diffstat (limited to 'docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst')
-rw-r--r--docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst146
1 files changed, 146 insertions, 0 deletions
diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
new file mode 100644
index 0000000..b13ecd7
--- /dev/null
+++ b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst
@@ -0,0 +1,146 @@
+2021-01-30: NodeBalancer networking faults due to memory pressure
+=================================================================
+
+At around 14:30 UTC on Saturday 30th January we started experiencing
+networking issues at the LoadBalancer level between Cloudflare and our
+Kubernetes cluster. It seems that the misconfiguration was due to memory
+and CPU pressure.
+
+[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word
+from Linode’s SysAdmins on any problems they detected.]
+
+**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a
+different machine.
+
+⚠️ Leadup
+---------
+
+*List the sequence of events that led to the incident*
+
+At 14:30 we started receiving alerts that services were becoming
+unreachable. We first experienced some momentary DNS errors which
+resolved themselves, however traffic ingress was still degraded.
+
+Upon checking Linode our NodeBalancer, the service which balances
+traffic between our Kubernetes nodes was reporting the backends (the
+services it balances to) as down. It reported all 4 as down (two for
+port 80 + two for port 443). This status was fluctuating between up and
+down, meaning traffic was not reaching our cluster correctly. Scaleios
+correctly noted:
+
+.. image:: ./images/2021-01-30/scaleios.png
+
+The config seems to have been set incorrectly due to memory and CPU
+pressure on one of our nodes. Here is the memory throughout the
+incident:
+
+.. image:: ./images/2021-01-30/memory_charts.png
+
+Here is the display from Linode:
+
+.. image:: ./images/2021-01-30/linode_loadbalancers.png
+
+🥏 Impact
+---------
+
+*Describe how internal and external users were impacted during the
+incident*
+
+Since traffic could not correctly enter our cluster multiple services
+which were web based were offline, including services such as site,
+grafana and bitwarden. It appears that no inter-node communication was
+affected as this uses a WireGuard tunnel between the nodes which was not
+affected by the NodeBalancer.
+
+The lack of Grafana made diagnosis slightly more difficult, but even
+then it was only a short trip to the
+
+👁️ Detection
+------------
+
+*Report when the team detected the incident, and how we could improve
+detection time*
+
+We were alerted fairly promptly through statping which reported services
+as being down and posted a Discord notification. Subsequent alerts came
+in from Grafana but were limited since outbound communication was
+faulty.
+
+🙋🏿‍♂️ Response
+----------------
+
+*Who responded to the incident, and what obstacles did they encounter?*
+
+Joe Banks responded!
+
+Primary obstacle was the DevOps tools being out due to the traffic
+ingress problems.
+
+🙆🏽‍♀️ Recovery
+----------------
+
+*How was the incident resolved? How can we improve future mitigation?*
+
+The incident resolved itself upstream at Linode, we’ve opened a ticket
+with Linode to let them know of the faults, this might give us a better
+indication of what caused the issues. Our Kubernetes cluster continued
+posting updates to Linode to refresh the NodeBalancer configuration,
+inspecting these payloads the configuration looked correct.
+
+We’ve set up alerts for when Prometheus services stop responding since
+this seems to be a fairly tell-tale symptom of networking problems, this
+was the Prometheus status graph throughout the incident:
+
+.. image:: ./images/2021-01-30/prometheus_status.png
+
+🔎 Five Why’s
+-------------
+
+*Run a 5-whys analysis to understand the true cause of the incident.*
+
+**What?** Our service experienced an outage due to networking faults.
+
+**Why?** Incoming traffic could not reach our Kubernetes nodes
+
+**Why?** Our Linode NodeBalancers were not using correct configuration
+
+**Why?** Memory & CPU pressure seemed to cause invalid configuration
+errors upstream at Linode.
+
+**Why?** Unknown at this stage, NodeBalancer migrated.
+
+🌱 Blameless root cause
+-----------------------
+
+*Note the final root cause and describe what needs to change to prevent
+reoccurrance*
+
+The configuration of our NodeBalancer was invalid, we cannot say why at
+this point since we are awaiting contact back from Linode, but
+indicators point to it being an upstream fault since memory & CPU
+pressure should **not** cause a load balancer misconfiguration.
+
+Linode are going to follow up with us at some point during the week with
+information from their System Administrators.
+
+**Update 2nd February 2021:** Linode have concluded investigations at
+their end, taken notes and migrated our NodeBalancer to a new machine.
+We haven’t experienced problems since.
+
+🤔 Lessons learned
+------------------
+
+*What did we learn from this incident?*
+
+We should be careful over-scheduling onto nodes since even while
+operating within reasonable constraints we risk sending invalid
+configuration upstream to Linode and therefore preventing traffic from
+entering our cluster.
+
+☑️ Follow-up tasks
+------------------
+
+*List any tasks we should complete that are relevant to this incident*
+
+- ☒ Monitor for follow up from Linode
+- ☒ Carefully monitor the allocation rules for our services