aboutsummaryrefslogtreecommitdiffstats
path: root/docs/postmortems/2020-12-11-postgres-conn-surge.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/postmortems/2020-12-11-postgres-conn-surge.rst')
-rw-r--r--docs/postmortems/2020-12-11-postgres-conn-surge.rst130
1 files changed, 0 insertions, 130 deletions
diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst
deleted file mode 100644
index 6ebcb01..0000000
--- a/docs/postmortems/2020-12-11-postgres-conn-surge.rst
+++ /dev/null
@@ -1,130 +0,0 @@
-2020-12-11: Postgres connection surge
-=====================================
-
-At **13:24 UTC,** we noticed the bot was not able to infract, and
-`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The
-DevOps team started to investigate.
-
-We discovered that Postgres was not accepting new connections because it
-had hit 100 clients. This made it unavailable to all services that
-depended on it.
-
-Ultimately this was resolved by taking down Postgres, remounting the
-associated volume, and bringing it back up again.
-
-⚠️ Leadup
----------
-
-*List the sequence of events that led to the incident*
-
-The bot infractions stopped working, and we started investigating.
-
-🥏 Impact
----------
-
-*Describe how internal and external users were impacted during the
-incident*
-
-Services were unavailable both for internal and external users.
-
-- The Help Channel System was unavailable.
-- Voice Gate and Server Gate were not working.
-- Moderation commands were unavailable.
-- Python Discord site & API were unavailable. CloudFlare automatically
- switched us to Always Online.
-
-👁️ Detection
-------------
-
-*Report when the team detected the incident, and how we could improve
-detection time*
-
-We noticed HTTP 524s coming from CloudFlare, upon attempting database
-connection we observed the maximum client limit.
-
-We noticed this log in site:
-
-.. code:: yaml
-
- django.db.utils.OperationalError: FATAL: sorry, too many clients already
-
-We should be monitoring number of clients, and the monitor should alert
-us when we’re approaching the max. That would have allowed for earlier
-detection, and possibly allowed us to prevent the incident altogether.
-
-We will look at
-`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__
-for monitoring this.
-
-🙋🏿‍♂️ Response
-----------------
-
-*Who responded to the incident, and what obstacles did they encounter?*
-
-Joe Banks responded to the incident. The obstacles were mostly a lack of
-a clear response strategy.
-
-We should document our recovery procedure so that we’re not so dependent
-on Joe Banks should this happen again while he’s unavailable.
-
-🙆🏽‍♀️ Recovery
-----------------
-
-*How was the incident resolved? How can we improve future mitigation?*
-
-- Delete PostgreSQL deployment ``kubectl delete deployment/postgres``
-- Delete any remaining pods, WITH force.
- ``kubectl delete <pod name> --force --grace-period=0``
-- Unmount volume at Linode
-- Remount volume at Linode
-- Reapply deployment ``kubectl apply -f postgres/deployment.yaml``
-
-🔎 Five Why’s
--------------
-
-*Run a 5-whys analysis to understand the true cause of the incident.*
-
-- Postgres was unavailable, so our services died.
-- **Why?** Postgres hit max clients, and could not respond.
-- **Why?** Unknown, but we saw a number of connections from previous
- deployments of site. This indicates that database connections are not
- being terminated properly. Needs further investigation.
-
-🌱 Blameless root cause
------------------------
-
-*Note the final root cause and describe what needs to change to prevent
-reoccurrance*
-
-We’re not sure what the root cause is, but suspect site is not
-terminating database connections properly in some cases. We were unable
-to reproduce this problem.
-
-We’ve set up new telemetry on Grafana with alerts so that we can
-investigate this more closely. We will be let know if the number of
-connections from site exceeds 32, or if the total number of connections
-exceeds 90.
-
-🤔 Lessons learned
-------------------
-
-*What did we learn from this incident?*
-
-- We must ensure the DevOps team has access to Linode and other key
- services even if our Bitwarden is down.
-- We need to ensure we’re alerted of any risk factors that have the
- potential to make Postgres unavailable, since this causes a
- catastrophic outage of practically all services.
-- We absolutely need backups for the databases, so that this sort of
- problem carries less of a risk.
-- We may need to consider something like
- `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage
- a connection pool so that we don’t exceed 100 *legitimate* clients
- connected as we connect more services to the postgres database.
-
-☑️ Follow-up tasks
-------------------
-
-*List any tasks we should complete that are relevant to this incident*
-
-- ☒ All database backup