From 661f49409e69f5cfafbef4cd41411a72ebc5418d Mon Sep 17 00:00:00 2001 From: Chris Lovering Date: Sun, 13 Aug 2023 20:01:42 +0100 Subject: Copy all files from kubernetes repo into this one MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This commit is a like-for-like copy of the [kubernetes repo](https://github.com/python-discord/kubernetes) check that repo for comit history prioir to this commit. Co-authored-by: Amrou Bellalouna Co-authored-by: Bradley Reynolds Co-authored-by: Chris Co-authored-by: Chris Lovering Co-authored-by: ChrisJL Co-authored-by: Den4200 Co-authored-by: GDWR Co-authored-by: Hassan Abouelela Co-authored-by: Hassan Abouelela Co-authored-by: jchristgit Co-authored-by: Joe Banks <20439493+jb3@users.noreply.github.com> Co-authored-by: Joe Banks Co-authored-by: Joe Banks Co-authored-by: Johannes Christ Co-authored-by: Kieran Siek Co-authored-by: kosayoda Co-authored-by: ks129 <45097959+ks129@users.noreply.github.com> Co-authored-by: Leon Sand├©y Co-authored-by: Leon Sand├©y Co-authored-by: MarkKoz Co-authored-by: Matteo Bertucci Co-authored-by: Sebastiaan Zeeff <33516116+SebastiaanZ@users.noreply.github.com> Co-authored-by: Sebastiaan Zeeff Co-authored-by: vcokltfre --- .../postmortems/2020-12-11-postgres-conn-surge.md | 96 ++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 kubernetes/docs/postmortems/2020-12-11-postgres-conn-surge.md (limited to 'kubernetes/docs/postmortems/2020-12-11-postgres-conn-surge.md') diff --git a/kubernetes/docs/postmortems/2020-12-11-postgres-conn-surge.md b/kubernetes/docs/postmortems/2020-12-11-postgres-conn-surge.md new file mode 100644 index 0000000..3e5360c --- /dev/null +++ b/kubernetes/docs/postmortems/2020-12-11-postgres-conn-surge.md @@ -0,0 +1,96 @@ +--- +layout: default +title: "2020-12-11: Postgres connection surge" +parent: Postmortems +nav_order: 1 +--- + +# 2020-12-11: Postgres connection surge + +At **13:24 UTC,** we noticed the bot was not able to infract, and [pythondiscord.com](http://pythondiscord.com) was unavailable. The DevOps team started to investigate. + +We discovered that Postgres was not accepting new connections because it had hit 100 clients. This made it unavailable to all services that depended on it. + +Ultimately this was resolved by taking down Postgres, remounting the associated volume, and bringing it back up again. + +## ⚠️ Leadup + +*List the sequence of events that led to the incident* + +The bot infractions stopped working, and we started investigating. + +## 🥏 Impact + +*Describe how internal and external users were impacted during the incident* + +Services were unavailable both for internal and external users. + +- The Help Channel System was unavailable. +- Voice Gate and Server Gate were not working. +- Moderation commands were unavailable. +- Python Discord site & API were unavailable. CloudFlare automatically switched us to Always Online. + +## 👁️ Detection + +*Report when the team detected the incident, and how we could improve detection time* + +We noticed HTTP 524s coming from CloudFlare, upon attempting database connection we observed the maximum client limit. + +We noticed this log in site: + +```yaml +django.db.utils.OperationalError: FATAL: sorry, too many clients already +``` + +We should be monitoring number of clients, and the monitor should alert us when we're approaching the max. That would have allowed for earlier detection, and possibly allowed us to prevent the incident altogether. + +We will look at [wrouesnel/postgres_exporter](https://github.com/wrouesnel/postgres_exporter) for monitoring this. + +## 🙋🏿‍♂️ Response + +*Who responded to the incident, and what obstacles did they encounter?* + +Joe Banks responded to the incident. The obstacles were mostly a lack of a clear response strategy. + +We should document our recovery procedure so that we're not so dependent on Joe Banks should this happen again while he's unavailable. + +## 🙆🏽‍♀️ Recovery + +*How was the incident resolved? How can we improve future mitigation?* + +- Delete PostgreSQL deployment `kubectl delete deployment/postgres` +- Delete any remaining pods, WITH force. `kubectl delete --force --grace-period=0` +- Unmount volume at Linode +- Remount volume at Linode +- Reapply deployment `kubectl apply -f postgres/deployment.yaml` + +## 🔎 Five Why's + +*Run a 5-whys analysis to understand the true cause of the incident.* + +- Postgres was unavailable, so our services died. +- **Why?** Postgres hit max clients, and could not respond. +- **Why?** Unknown, but we saw a number of connections from previous deployments of site. This indicates that database connections are not being terminated properly. Needs further investigation. + +## 🌱 Blameless root cause + +*Note the final root cause and describe what needs to change to prevent reoccurrance* + +We're not sure what the root cause is, but suspect site is not terminating database connections properly in some cases. We were unable to reproduce this problem. + +We've set up new telemetry on Grafana with alerts so that we can investigate this more closely. We will be let know if the number of connections from site exceeds 32, or if the total number of connections exceeds 90. + +## 🤔 Lessons learned + +*What did we learn from this incident?* + +- We must ensure the DevOps team has access to Linode and other key services even if our Bitwarden is down. +- We need to ensure we're alerted of any risk factors that have the potential to make Postgres unavailable, since this causes a catastrophic outage of practically all services. +- We absolutely need backups for the databases, so that this sort of problem carries less of a risk. +- We may need to consider something like [pg_bouncer](https://wiki.postgresql.org/wiki/PgBouncer) to manage a connection pool so that we don't exceed 100 *legitimate* clients connected as we connect more services to the postgres database. + +## ☑️ Follow-up tasks + +*List any tasks we should complete that are relevant to this incident* + +- [x] All database backup -- cgit v1.2.3