diff options
author | 2024-08-07 18:41:02 +0100 | |
---|---|---|
committer | 2024-08-07 18:41:02 +0100 | |
commit | dcbb78959177537cf1fdda813380996a4b2daf8f (patch) | |
tree | 0a53ded19896aaddf93cc8f1e4ff34ac3f70464e /docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst | |
parent | Revert "Enable fail2ban jails for postfix" (diff) |
Remove old documentation
Diffstat (limited to 'docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst')
-rw-r--r-- | docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst | 146 |
1 files changed, 0 insertions, 146 deletions
diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst deleted file mode 100644 index b13ecd7..0000000 --- a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst +++ /dev/null @@ -1,146 +0,0 @@ -2021-01-30: NodeBalancer networking faults due to memory pressure -================================================================= - -At around 14:30 UTC on Saturday 30th January we started experiencing -networking issues at the LoadBalancer level between Cloudflare and our -Kubernetes cluster. It seems that the misconfiguration was due to memory -and CPU pressure. - -[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word -from Linode’s SysAdmins on any problems they detected.] - -**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a -different machine. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -At 14:30 we started receiving alerts that services were becoming -unreachable. We first experienced some momentary DNS errors which -resolved themselves, however traffic ingress was still degraded. - -Upon checking Linode our NodeBalancer, the service which balances -traffic between our Kubernetes nodes was reporting the backends (the -services it balances to) as down. It reported all 4 as down (two for -port 80 + two for port 443). This status was fluctuating between up and -down, meaning traffic was not reaching our cluster correctly. Scaleios -correctly noted: - -.. image:: ./images/2021-01-30/scaleios.png - -The config seems to have been set incorrectly due to memory and CPU -pressure on one of our nodes. Here is the memory throughout the -incident: - -.. image:: ./images/2021-01-30/memory_charts.png - -Here is the display from Linode: - -.. image:: ./images/2021-01-30/linode_loadbalancers.png - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Since traffic could not correctly enter our cluster multiple services -which were web based were offline, including services such as site, -grafana and bitwarden. It appears that no inter-node communication was -affected as this uses a WireGuard tunnel between the nodes which was not -affected by the NodeBalancer. - -The lack of Grafana made diagnosis slightly more difficult, but even -then it was only a short trip to the - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were alerted fairly promptly through statping which reported services -as being down and posted a Discord notification. Subsequent alerts came -in from Grafana but were limited since outbound communication was -faulty. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded! - -Primary obstacle was the DevOps tools being out due to the traffic -ingress problems. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident resolved itself upstream at Linode, we’ve opened a ticket -with Linode to let them know of the faults, this might give us a better -indication of what caused the issues. Our Kubernetes cluster continued -posting updates to Linode to refresh the NodeBalancer configuration, -inspecting these payloads the configuration looked correct. - -We’ve set up alerts for when Prometheus services stop responding since -this seems to be a fairly tell-tale symptom of networking problems, this -was the Prometheus status graph throughout the incident: - -.. image:: ./images/2021-01-30/prometheus_status.png - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** Our service experienced an outage due to networking faults. - -**Why?** Incoming traffic could not reach our Kubernetes nodes - -**Why?** Our Linode NodeBalancers were not using correct configuration - -**Why?** Memory & CPU pressure seemed to cause invalid configuration -errors upstream at Linode. - -**Why?** Unknown at this stage, NodeBalancer migrated. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The configuration of our NodeBalancer was invalid, we cannot say why at -this point since we are awaiting contact back from Linode, but -indicators point to it being an upstream fault since memory & CPU -pressure should **not** cause a load balancer misconfiguration. - -Linode are going to follow up with us at some point during the week with -information from their System Administrators. - -**Update 2nd February 2021:** Linode have concluded investigations at -their end, taken notes and migrated our NodeBalancer to a new machine. -We haven’t experienced problems since. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -We should be careful over-scheduling onto nodes since even while -operating within reasonable constraints we risk sending invalid -configuration upstream to Linode and therefore preventing traffic from -entering our cluster. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Monitor for follow up from Linode -- ☒ Carefully monitor the allocation rules for our services |