diff options
author | 2024-08-07 18:41:02 +0100 | |
---|---|---|
committer | 2024-08-07 18:41:02 +0100 | |
commit | dcbb78959177537cf1fdda813380996a4b2daf8f (patch) | |
tree | 0a53ded19896aaddf93cc8f1e4ff34ac3f70464e /docs | |
parent | Revert "Enable fail2ban jails for postfix" (diff) |
Remove old documentation
Diffstat (limited to 'docs')
59 files changed, 0 insertions, 3003 deletions
diff --git a/docs/Makefile b/docs/Makefile deleted file mode 100644 index d4bb2cb..0000000 --- a/docs/Makefile +++ /dev/null @@ -1,20 +0,0 @@ -# Minimal makefile for Sphinx documentation -# - -# You can set these variables from the command line, and also -# from the environment for the first two. -SPHINXOPTS ?= -SPHINXBUILD ?= sphinx-build -SOURCEDIR = . -BUILDDIR = _build - -# Put it first so that "make" without argument is like "make help". -help: - @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) - -.PHONY: help Makefile - -# Catch-all target: route all unknown targets to Sphinx using the new -# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). -%: Makefile - @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/_static/.gitkeep b/docs/_static/.gitkeep deleted file mode 100644 index e69de29..0000000 --- a/docs/_static/.gitkeep +++ /dev/null diff --git a/docs/_static/logo.png b/docs/_static/logo.png Binary files differdeleted file mode 100644 index 1c125c7..0000000 --- a/docs/_static/logo.png +++ /dev/null diff --git a/docs/_templates/.gitkeep b/docs/_templates/.gitkeep deleted file mode 100644 index e69de29..0000000 --- a/docs/_templates/.gitkeep +++ /dev/null diff --git a/docs/conf.py b/docs/conf.py deleted file mode 100644 index d9c0855..0000000 --- a/docs/conf.py +++ /dev/null @@ -1,40 +0,0 @@ -# Configuration file for the Sphinx documentation builder. -# -# For the full list of built-in configuration values, see the documentation: -# https://www.sphinx-doc.org/en/master/usage/configuration.html - -# -- Project information ----------------------------------------------------- -# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information - -project = "DevOps" -copyright = "2024, Python Discord" -author = "Joe Banks <[email protected]>, King Arthur <[email protected]>" - -# -- General configuration --------------------------------------------------- -# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration - -extensions = [] - -templates_path = ["_templates"] -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] - - -# -- Options for HTML output ------------------------------------------------- -# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output - -html_theme = "alabaster" -html_static_path = ["_static"] -html_theme_options = { - "logo": "logo.png", - "logo_name": True, - "logo_text_align": "center", - "github_user": "python-discord", - "github_repo": "infra", - "github_button": True, - "extra_nav_links": { - "DevOps on YouTube": "https://www.youtube.com/watch?v=b2F-DItXtZs", - "git: Infra": "https://github.com/python-discord/infra/", - "git: King Arthur": "https://github.com/python-discord/king-arthur/", - "Kanban Board": "https://github.com/orgs/python-discord/projects/17/views/4", - }, -} diff --git a/docs/general/index.rst b/docs/general/index.rst deleted file mode 100644 index 60a04cb..0000000 --- a/docs/general/index.rst +++ /dev/null @@ -1,9 +0,0 @@ -General -======= - - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - manual-deploys diff --git a/docs/general/manual-deploys.rst b/docs/general/manual-deploys.rst deleted file mode 100644 index 0d874ea..0000000 --- a/docs/general/manual-deploys.rst +++ /dev/null @@ -1,27 +0,0 @@ -Manual Deployments -================== - -When the DevOps team are not available, Administrators and Core -Developers can redeploy our critical services, such as Bot, Site and -ModMail. - -This is handled through workflow dispatches on this repository. To get -started, head to the -`Actions <https://github.com/python-discord/kubernetes/actions>`__ tab -of this repository and select ``Manual Redeploy`` in the sidebar, -alternatively navigate -`here <https://github.com/python-discord/kubernetes/actions/workflows/manual_redeploy.yml>`__. - -.. image:: https://user-images.githubusercontent.com/20439493/116442084-00d5f400-a84a-11eb-8e8a-e9e6bcc327dd.png - -Click ``Run workflow`` on the right hand side and enter the service name -that needs redeploying, keep the branch as ``main``: - -.. image:: https://user-images.githubusercontent.com/20439493/116442202-22cf7680-a84a-11eb-8cce-a3e715a1bf68.png - -Click ``Run`` and refresh the page, you’ll see a new in progress Action -which you can track. Once the deployment completes notifications will be -sent to the ``#dev-ops`` channel on Discord. - -If you encounter errors with this please copy the Action run link to -Discord so the DevOps team can investigate when available. diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 348575d..0000000 --- a/docs/index.rst +++ /dev/null @@ -1,50 +0,0 @@ -.. Python Discord DevOps documentation master file, created by - sphinx-quickstart on Wed Jul 24 19:49:56 2024. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Python Discord DevOps -===================== - -Welcome to the Python Discord DevOps knowledgebase. - -Within this set of pages you will find: - -- Changelogs - -- Post-mortems - -- Common queries - -- Runbooks - - -Table of contents ------------------ - -.. toctree:: - :maxdepth: 2 - - general/index - onboarding/index - postmortems/index - queries/index - runbooks/index - tooling/index - - -Meeting notes -------------- - -.. toctree:: - :maxdepth: 2 - - meeting_notes/index - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/docs/make.bat b/docs/make.bat deleted file mode 100644 index 954237b..0000000 --- a/docs/make.bat +++ /dev/null @@ -1,35 +0,0 @@ -@ECHO OFF - -pushd %~dp0 - -REM Command file for Sphinx documentation - -if "%SPHINXBUILD%" == "" ( - set SPHINXBUILD=sphinx-build -) -set SOURCEDIR=. -set BUILDDIR=_build - -%SPHINXBUILD% >NUL 2>NUL -if errorlevel 9009 ( - echo. - echo.The 'sphinx-build' command was not found. Make sure you have Sphinx - echo.installed, then set the SPHINXBUILD environment variable to point - echo.to the full path of the 'sphinx-build' executable. Alternatively you - echo.may add the Sphinx directory to PATH. - echo. - echo.If you don't have Sphinx installed, grab it from - echo.https://www.sphinx-doc.org/ - exit /b 1 -) - -if "%1" == "" goto help - -%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% -goto end - -:help -%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% - -:end -popd diff --git a/docs/meeting_notes/2022-04-07.rst b/docs/meeting_notes/2022-04-07.rst deleted file mode 100644 index ee23a5d..0000000 --- a/docs/meeting_notes/2022-04-07.rst +++ /dev/null @@ -1,20 +0,0 @@ -2022-04-07 -========== - -Agenda ------- - -- No updates, as last week’s meeting did not take place - -Roadmap review & planning -------------------------- - -What are we working on for the next meeting? - -- Help wanted for #57 (h-asgi) -- #58 (postgres exporter) needs a new review -- #54 (firewall in VPN) will be done by Johannes -- We need a testing environment #67 -- Johannes will add a Graphite role #31 -- Sofi will take a look at #29 -- #41 (policy bot) will be taken care of by Johannes diff --git a/docs/meeting_notes/2022-09-18.rst b/docs/meeting_notes/2022-09-18.rst deleted file mode 100644 index 163434c..0000000 --- a/docs/meeting_notes/2022-09-18.rst +++ /dev/null @@ -1,74 +0,0 @@ -2022-09-18 -========== - -*Migrated from Notion*. - -Agenda ------- - -- Joe will grant Chris access to the netcup hosts. - -NetKube status -~~~~~~~~~~~~~~ - -- **Rollout** - - - ☒ RBAC configuration and access granting - - ☒ Most nodes are enrolled, Joe will re-check - - ``turing``, ``ritchie``, ``lovelace`` and ``neumann`` will be - Kubernetes nodes - - ``hopper`` will be the storage server - -- **Storage drivers** - - - Not needed, everything that needs persistent storage will run on - hopper - - Netcup does not support storage resize - - We can download more RAM if we need it - - A couple of services still need volume mounts: Ghost, Grafana & - Graphite - -- **Control plane high availability** - - - Joe mentions that in the case the control plane dies, everything - else will die as well - - If the control plane in Germany dies, so will Johannes - -- **Early plans for migration** - - - We can use the Ansible repository issues for a good schedule - - Hopper runs ``nginx`` - - Statement from Joe: > “There is an nginx ingress running on every - node in the cluster, okay, > okay? We don’t, the way that’s, - that’s as a service is a NodePort, right? > So it has a normal IP, - but the port will be like a random port in the range > of the - 30,000s. Remember that? Hold on. Is he writing rude nodes? And - then… > We have nginx, so this is where it’s like a little bit, - like, not nice, I > guess we just like, cronjob it, to pull the - nodes, like, every minute or > so, and then update the config if - they change. But then it’s just like… > nginx is like a catalogue - of nodes. Wahhh, you drive me crazy.” - - - “Nah, it makes sense!” - - - “It does!” - - - Joe will figure this out with assistance from his voices. - -Open authentication -~~~~~~~~~~~~~~~~~~~ - -- Joe and Johannes will check out OpenLDAP as a JumpCloud alternative - starting from this evening -- Sofi has experience with OpenLDAP - -Sponsorship ------------ - -This meeting has been sponsored by Chris Hemsworth Lovering’s -relationship therapy company, “Love To Love By Lovering”. You can sign -up by sending a mail to [email protected]. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-10-05.rst b/docs/meeting_notes/2022-10-05.rst deleted file mode 100644 index e069299..0000000 --- a/docs/meeting_notes/2022-10-05.rst +++ /dev/null @@ -1,13 +0,0 @@ -2022-10-05 -========== - -*Migrated from Notion*. - -Agenda ------- - -- Joe Banks configured proper RBAC for Chris, Johannes and Joe himself - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-10-19.rst b/docs/meeting_notes/2022-10-19.rst deleted file mode 100644 index 6de7f33..0000000 --- a/docs/meeting_notes/2022-10-19.rst +++ /dev/null @@ -1,31 +0,0 @@ -2022-10-19 -========== - -*Migrated from Notion*. - -Agenda ------- - -- One hour of gartic phone, for team spirit. -- Created user accounts for Sofi and Hassan -- Joe created an architecture diagram of the NGINX setup - - - *This is still in Notion* - -- Joe explained his NGINX plans: > “It’s not actually that hard, right? - So you spawn 5 instances of nginx in a > DaemonSet, because then one - gets deployed to every node okay, following? > Then we get NodePort, - instead of LoadBalancers or whatever, which will get > a random port - allocatead in the 35000 range, and that will go to nginx, and > on - each of those ports, it will go to nginx, right? And then we poll the - > Kubernetes API and what is the port that each of these nginx - instances is > running on, and add that into a roundrobin on the - fifth node. Right? Yeah. > That’s correct. That won’t do TLS though, - so that will just HAProxy. Yeah.” -- Joe will terminate our JumpCloud account -- Chris reset the Minecraft server -- Email alerting needs to be configured - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-10-26.rst b/docs/meeting_notes/2022-10-26.rst deleted file mode 100644 index 69f8c70..0000000 --- a/docs/meeting_notes/2022-10-26.rst +++ /dev/null @@ -1,18 +0,0 @@ -2022-10-26 -========== - -*Migrated from Notion*. - -Agenda ------- - -- Chris upgraded PostgreSQL to 15 in production -- Johannes added the Kubernetes user creation script into the - Kubernetes repository in the docs - -*(The rest of the meeting was discussion about the NetKube setup, which -has been scrapped since)*. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-11-02.rst b/docs/meeting_notes/2022-11-02.rst deleted file mode 100644 index d9f415d..0000000 --- a/docs/meeting_notes/2022-11-02.rst +++ /dev/null @@ -1,27 +0,0 @@ -2022-11-02 -========== - -*Migrated from Notion*. - -Agenda ------- - -Hanging behaviour of ModMail -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- `Source <https://discord.com/channels/267624335836053506/675756741417369640/1036720683067134052>`__ - -- Maybe use `Signals + a - debugger <https://stackoverflow.com/a/25329467>`__? - -- … using `something like pdb for the - debugger <https://wiki.python.org/moin/PythonDebuggingTools>`__? - -- Or `GDB, as it seems handy to poke at stuck multi-threaded python - software <https://wiki.python.org/moin/DebuggingWithGdb>`__? - -- ModMail has been upgraded to version 4 - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-11-23.rst b/docs/meeting_notes/2022-11-23.rst deleted file mode 100644 index 19edd06..0000000 --- a/docs/meeting_notes/2022-11-23.rst +++ /dev/null @@ -1,30 +0,0 @@ -2022-11-23 -========== - -*Migrated from Notion*. - -Agenda ------- - -*(This meeting was mostly about NetKube, with the following strange text -included, and everything outside of the text has been removed since the -NetKube plans have been scrapped)*. - -Joe Banks, after a month-long hiatus to become a dad to every second -girl on uni campus, has managed to pull up to the DevOps meeting. - -We are considering using Kubespray (https://kubespray.io/#/) in order to -deploy a production-ready bare-metal Kubernetes cluster without -involvement from Joe “Busy With Poly Girlfriend #20” Banks. - -At the moment cluster networking is not working and Joe mentions that -the last time he has touched it, it worked perfectly fine. However, the -last time he touched it there was only 1 node, and therefore no -inter-node communications. - -Joe thinks he remembers installing 3 nodes, however, we at the DevOps -team believe this to be a marijuana dream - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-02-08.rst b/docs/meeting_notes/2023-02-08.rst deleted file mode 100644 index a161ba5..0000000 --- a/docs/meeting_notes/2023-02-08.rst +++ /dev/null @@ -1,17 +0,0 @@ -2023-02-08 -========== - -*Migrated from Notion*. - -Agenda ------- - -- Investigation into deploying a VPN tool such as WireGuard to have - inter-node communication between the Netcup hosts. - -*(The rest of this meeting was mostly about NetKube, which has since -been scrapped)*. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-02-21.rst b/docs/meeting_notes/2023-02-21.rst deleted file mode 100644 index 9de644c..0000000 --- a/docs/meeting_notes/2023-02-21.rst +++ /dev/null @@ -1,31 +0,0 @@ -2023-02-21 -========== - -*Migrated from Notion*. - -Agenda ------- - -Reusable status embed workflows -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- Further discussion with Bella followed -- Upstream pull request can be found at - `python-discord/bot#2400 <https://github.com/python-discord/bot/pull/2400>`__ - -Local vagrant testing setup -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- Our new `testing setup using Vagrant - VMs <https://github.com/python-discord/infra/pull/78>`__ has been - merged. - -A visit from Mina -~~~~~~~~~~~~~~~~~ - -Mina checked in to make sure we’re operating at peak Volkswagen-like -efficiency. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-02-28.rst b/docs/meeting_notes/2023-02-28.rst deleted file mode 100644 index 1fb1093..0000000 --- a/docs/meeting_notes/2023-02-28.rst +++ /dev/null @@ -1,16 +0,0 @@ -2023-02-28 -========== - -*Migrated from Notion*. - -Agenda ------- - -- Black knight’s CI & dependabot configuration has been mirrored across - all important repositories - -- The test server has been updated for the new configuration - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-05-16.rst b/docs/meeting_notes/2023-05-16.rst deleted file mode 100644 index 79272a6..0000000 --- a/docs/meeting_notes/2023-05-16.rst +++ /dev/null @@ -1,15 +0,0 @@ -2023-05-16 -========== - -*Migrated from Notion*. - -Agenda ------- - -- Bella set up `CI bot docker image - build <https://github.com/python-discord/bot/pull/2603>`__ to make - sure that wheels are available. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-07-11.rst b/docs/meeting_notes/2023-07-11.rst deleted file mode 100644 index 68b1085..0000000 --- a/docs/meeting_notes/2023-07-11.rst +++ /dev/null @@ -1,41 +0,0 @@ -2023-07-11 -========== - -Participants ------------- - -- Chris, Johannes, Bella, Bradley - -Agenda ------- - -New Ansible setup -~~~~~~~~~~~~~~~~~ - -Chris presented the new Ansible setup he’s been working on. We plan to -use WireGuard for networking. We agreed that selfhosting Kubernetes is -not the way to go. In general, the main benefit from switching away to -Linode to Netcup is going to be a ton more resources from the Netcup -root servers we were given. The original issue with Linode’s AKS of -constantly having problems with volumes has not been present for a -while. Chris mentions the one remaining issue is that we’re at half our -memory capacity just at idle. - -It’s our decision where to go from here - we can stick to the Kubernetes -setup or decide on migrating to the Ansible setup. But we have bare -metal access to the Netcup hosts, which makes e.g. managing databases a -lot easier. Chris mentions the possibility to only use Netcup for our -persistence and Linode AKS for anything else, but this has the issue of -us relying on two sponsors for our infrastructure instead of one. - -PostgreSQL was set up to run on ``lovelace``. - -Decision -~~~~~~~~ - -**It was decided to hold a vote on the core development channel, which -will be evaluated next week to see how to proceed with the setup**. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-07-18.rst b/docs/meeting_notes/2023-07-18.rst deleted file mode 100644 index f37b2dc..0000000 --- a/docs/meeting_notes/2023-07-18.rst +++ /dev/null @@ -1,42 +0,0 @@ -2023-07-18 -========== - -Secret management improvements ------------------------------- - -To allow for **better management of our Kubernetes secrets**, Chris set -out to configure ``git-crypt`` in GPG key mode. For comparison, the -previous approach was that secrets were stored in Kubernetes only and -had to be accessed via ``kubectl``, and now ``git-crypt`` allows us to -transparently work with the files in unencrypted manner locally, whilst -having them secure on the remote, all via ``.gitattributes``. - -The following people currently have access to this: - -- Johannes Christ [email protected] - (``8C05D0E98B7914EDEBDCC8CC8E8E09282F2E17AF``) -- Chris Lovering [email protected] - (``1DA91E6CE87E3C1FCE32BC0CB6ED85CC5872D5E4``) -- Joe Banks [email protected] (``509CDFFC2D0783A33CF87D2B703EE21DE4D4D9C9``) - -For Hassan, we are still waiting on response regarding his GPG key -accuracy. - -The pull request for the work can be found `at -python-discord/kubernetes#156 <https://github.com/python-discord/kubernetes/pull/156>`__. - -**To have your key added, please contact any of the existing key -holders**. More documentation on this topic is pending to be written, -see -`python-discord/kubernetes#157 <https://github.com/python-discord/kubernetes/issues/157>`__. - -Infrastructure migration decision ---------------------------------- - -The voting started `last week <./2023-07-11.md>`__ will be properly -talked about `next week <./2023-07-25.md>`__, so far it looks like we’re -definitely not selfhosting Kubernetes at the very least. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-07-25.rst b/docs/meeting_notes/2023-07-25.rst deleted file mode 100644 index 0a3204c..0000000 --- a/docs/meeting_notes/2023-07-25.rst +++ /dev/null @@ -1,4 +0,0 @@ -2023-07-25 -========== - -Postponed to next week due to Joe having a severe bellyache. diff --git a/docs/meeting_notes/2023-08-01.rst b/docs/meeting_notes/2023-08-01.rst deleted file mode 100644 index 67e4ee1..0000000 --- a/docs/meeting_notes/2023-08-01.rst +++ /dev/null @@ -1,66 +0,0 @@ -2023-08-01 -========== - -Agenda ------- - -Infrastructure migration -~~~~~~~~~~~~~~~~~~~~~~~~ - -The vote is tied. Chris and Johannes decided that we should test out -migrating the PostgreSQL database at the very least. We then have more -freedom about our data. What we need to do: - -- Allow PostgreSQL connections from LKE’s static IPs in the firewall -- Whitelist the static IPs from Linode via ``pg_hba.conf`` -- Schedule downtime for the PostgreSQL database -- **At downtime** - - - Take writers offline - - Dump database from Linode into Netcup - - Update all the client’s database URLs to point to netcup - - Restart writers - -We want to rely on the restore to create everything properly, but will -need to test run this beforehand. The following ``pg_virtualenv`` -command has showcased that it works properly: - -.. code:: sh - - kubectl exec -it postgres-... -- pg_dumpall -U pythondiscord \ - | pg_virtualenv psql -v ON_ERROR_STOP=1 - -Note however that the database extension ``pg_repack`` needs to be -installed. - -Before we can get started, we need to allow the PostgreSQL role to -configure ``pg_hba.conf`` and ``postgresql.conf`` entries. - -Meeting notes -~~~~~~~~~~~~~ - -We’re using GitHub at the moment. Some are left in Notion. We should -migrate these to GitHub to have a uniform interface: Johannes will pick -up -`python-discord/infra#108 <https://github.com/python-discord/infra/issues/108>`__ -to merge them together into Git, as its more open than Notion. - -Ansible lint failures in the infra repository -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Excluding the vault was found as the working solution here, as -implemented by Chris. - -Kubernetes repository pull requests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -These were cleaned up thanks to Chris. - -Roadmap review & planning -------------------------- - -- Chris will prepare the PostgreSQL configuration mentioned above. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-08-08.rst b/docs/meeting_notes/2023-08-08.rst deleted file mode 100644 index 0082cd3..0000000 --- a/docs/meeting_notes/2023-08-08.rst +++ /dev/null @@ -1,54 +0,0 @@ -2023-08-08 -========== - -Agenda ------- - -- Configuration of PostgreSQL and the PostgreSQL exporter - - - **No time so far**. Chris has been busy with renovating his living - room, and Johannes has been busy with renovating his bedroom. - Bradley prefers to remain quiet. - - - Chris will try to work on this in the coming week and will try to - have Bella around as well, since he wanted to join the setup. - -- **Potential slot for GPG key signing of DevOps members**. External - verification will be necessary. - - - Skipped. No webcam on Chris. - -- We need to assign a **librarian** to keep our documents organized - according to a system. Johannes is happy to do this for now. - - - Let’s move the existing documentation from the Kubernetes - repository into the infra repository. See - `kubernetes#161 <https://github.com/python-discord/kubernetes/issues/161>`__. - - - **Our Notion DevOps space is full of junk**. Outside of that, it’s - not open to read for outside contributors, and does not leave much - choice over which client to use for editing content. - - - Chris agrees, without looking on it - just from memory. We - should move it to the infra repository. (The meeting notes have - already been transferred). - - - Bella suggests to add some automation to make keeping everything - in clean order less tedious. - -- We may want to integrate the **Kubernetes repository** and the infra - repository together altogether, however there are a lot of - repositories referencing the deployment manifests that would need to - be updated. - - - Chris mentions that regardless of what we do, we should - at the - very least move all documentation into the ``infra`` repository, - including the static site generator. At the moment we’re using - Jekyll but we’re open to trying alternatives such as Hugo. - -- We closed some issues and pull requests in the repositories for late - spring cleaning. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2 autoindent conceallevel=2: --> diff --git a/docs/meeting_notes/2023-08-22.rst b/docs/meeting_notes/2023-08-22.rst deleted file mode 100644 index a8d1287..0000000 --- a/docs/meeting_notes/2023-08-22.rst +++ /dev/null @@ -1,40 +0,0 @@ -2023-08-22 -========== - -.. raw:: html - - <!-- - - Useful links - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - - --> - -Agenda ------- - -- Bella said he is on the streets. **We should start a gofundme**. - - - After some more conversation this just means he is on vacation and - currently taking a walk. - -- Chris has been busy with turning his living room into a picasso art - collection, Johannes has been busy with renovating his bedroom, and - Bella is not home. - - - Our next priority is winning. - -- We checked out some issues with documentation generation in - ``bot-core`` that Bella has mentioned. We managed to fix one issue - with pydantic by adding it to an exclude list but ran into another - problem next. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-08-29.rst b/docs/meeting_notes/2023-08-29.rst deleted file mode 100644 index da49c1e..0000000 --- a/docs/meeting_notes/2023-08-29.rst +++ /dev/null @@ -1,65 +0,0 @@ -2023-08-29 -========== - -.. raw:: html - - <!-- - - Useful links - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - - --> - -Agenda ------- - -- **Bella is still on the streets** - - - The Python Discord Bella On The Streets Fundraising Campaign Q3 - 2023 has not been successful so far. To help Bella receive French - citizenship, Joe has put up a French flag behind himself in the - meeting. - - - Joe corrects my sarcasm. It is an Italian flag, not a French - flag. The reason for this flag is that his new prime interest - on campus was born in Italy. - -- **The SnekBox CI build is pretty slow** - - - Guix and Nix are not alternatives. Neither is Ubuntu - - - We use pyenv to build multiple Python versions for a new feature - - - The feature is not rolled out yet - - - Part of the problem is that we build twice in the ``build`` and - the ``deploy`` stage - - - On rollout, Joe tested it and it works fine - -- No update on the Hugo build yet - -- For snowflake, Johannes will write a proposal to the admins for - hosting it - - - We should consider talking about the following points: - - - statistically ~8% of Tor traffic is problematic (10% of traffic - is to hidden services, 80% of hidden service traffic is for - illegal services) - - - overall the project’s position and our ideal is to help people - for a good cause - - - all traffic is forwarded to the Tor network, the service is - lightweight and only proxies encrypted traffic there - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-09-05.rst b/docs/meeting_notes/2023-09-05.rst deleted file mode 100644 index 7556ab6..0000000 --- a/docs/meeting_notes/2023-09-05.rst +++ /dev/null @@ -1,53 +0,0 @@ -2023-09-05 -========== - -.. raw:: html - - <!-- - - Useful links - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - - --> - -Agenda ------- - -- No update on the Hugo build yet - -- Johannes wrote a proposal for snowflake proxy to be deployed to our - netcup hosts - - - Admins discussed and came to the conclusion that since we don’t - own the servers, we got the servers from netcup as a sponsorship - to host our infra, so using them to host something that isn’t our - infra doesn’t seem right. - -- Lots of dependabot PRs closed - - - https://github.com/search?q=org%3Apython-discord++is%3Apr+is%3Aopen+label%3A%22area%3A+dependencies%22&type=pullrequests&ref=advsearch - - Closed ~50% of PRs - -- Workers repo has had its CI rewritten, all workers have consistent - package.json, scripts, and using the new style of cloudflare workers - which don’t use webpack - -- Metricity updated to SQLAlchemy 2 - -- Olli CI PR is up - - - https://github.com/python-discord/olli/pull/25 - -- Sir-Robin pydantic constants PR is up - - - https://github.com/python-discord/sir-robin/pull/93 - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-09-12.rst b/docs/meeting_notes/2023-09-12.rst deleted file mode 100644 index 6dbb7c8..0000000 --- a/docs/meeting_notes/2023-09-12.rst +++ /dev/null @@ -1,73 +0,0 @@ -2023-09-12 -========== - -.. raw:: html - - <!-- - - Useful links - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - - --> - -Agenda ------- - -- We have reason to believe that Bella is still on the streets. Worse, - Bella is not available at the moment, leading us to believe that - Bella has still not found a home. - - - Eight minutes into the meeting, Bella joins, complaining about the - bad internet. He mentions he is still on the streets (this may - contribute to the bad internet factor). - -- Chris made Mina leave with his repeated comments about Bella being - homeless, reminding Mina of the growing unemployment rate within the - DevOps team. As head of HR she cannot further support this matter. - -- About #139, Bella mentions that online websites may cover the same - need that we have, but it may not be really useful for having it as a - command. - - - Chris adds that “if someone wants to do it, I don’t mind” and “I - don’t think it would be very useful for a command, but I think it - would be fun to learn for someone implementing it”. As long as - whoever is implementing is is aware that it would not be used too - much, it would be fine. - -- No progress on the hugo front - -- Our email service with workers will be forward only - - - With postfix you will be able to reply. Joe wants to have an - excuse to play with Cloudflare workers though. - -- `50 open pull requests from - dependabot <https://github.com/search?q=org%3Apython-discord++is%3Apr+is%3Aopen+author%3Aapp%2Fdependabot&type=pullrequests&ref=advsearch>`__ - - - Tip from The Man: press ^D to make a bookmark in your browser - - - “Those can just be blindly merged” - Chris - -- Grouping of dependencies: Dependabot now allows you to group together - multiple dependency updates into a single pull request. - - - Possible approaches suggested: Group all the docker updates - together, group any linting dependencies together (would just - require a big RegEx). Dependabot natively works with its own - dependency groups here (e.g. Docker, Pip). - -- Mr. Hemlock wants to raise his roof: It’s his project for this - Autumn. We, the team, are looking forward to his project - especially - Bella, who is currently looking for housing. “It’s all coming - together”, said Chris to the situation. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2024-07-02.rst b/docs/meeting_notes/2024-07-02.rst deleted file mode 100644 index 4d2ba03..0000000 --- a/docs/meeting_notes/2024-07-02.rst +++ /dev/null @@ -1,171 +0,0 @@ -2024-07-02 -========== - -.. raw:: html - - <!-- - - Useful links - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - - --> - -Attendees ---------- - -Joe and Johannes. - -Chris unfortunately died in a fatal train accident and could not attend -the meeting. This incident will be rectified in the next release, -“Lovering 2.0: Immortability”. - -Bella is out on the streets again. We are waiting for approval from the -Python Discord admins to run another fundraiser. - -Agenda ------- - -- **Configuration of renovate** (Joe) - - We are replacing dependabot with renovatebot. Johannes welcomes this - decision. Joe says we are looking for automatic deployment from - Kubernetes to make sure that any updates are automatically deployed. - - **Conclusion**: Implemented. - -- **Resizing Netcup servers** (Joe, Johannes) - - We can probably get rid of turing, assess what else we want to deploy - on lovelace, and then ask for a resize. - - **Conclusion**: Create issue to move things off turing, remove it - from the inventory, remove it from documentation, power it off, then - have Joe ask for server removal. - -- **Updating the public statistics page** (Johannes) - - Discussing and showcasing possible alternatives to the current - infrastructure powering https://stats.pythondiscord.com via the - https://github.com/python-discord/public-stats repository. Johannes - presents his current scripts that cuddle RRDTool into loading data - out of metricity, Joe says we will discuss with Chris what to do - here. - - The likely way going forward will be that *we will open an issue to - set it up*, the setup will contain an Ansible role to deploy the - cronjob and the script onto lovelace alongside with the ``rrdtool`` - PostgreSQL user. - - **Conclusion**: Johannes will create an issue and codify the setup in - Ansible. - -- **New blog powered by Hugo** (Johannes) - - Our current Ghost-powered blog is a tiny bit strange, and the - onboarding ramp to contribute articles is large. We want to migrate - this to Hugo - Johannes is leading the effort on it. The main work - will be building an appropriate theme, as no nicely suitable - replacement theme has been found so far. Front-end contributors would - be nice for this, although currently everything is still local on my - machine. - - Joe mentions that we don’t need to take anything particularly similar - to the current Ghost theme, just some vague resemblance would be - nice. Most of the recommended Hugo themes would probably work. - Johannes will check it out further. - - **Conclusion**: Try the `hugo-casper-two - theme <https://github.com/eueung/hugo-casper-two>`__ and report back. - -- **Finger server** (Joe, Johannes) - - Joe recently proposed `the deployment of a finger - server <https://github.com/python-discord/infra/pull/373>`__. Do we - want this and if yes, how are we going to proceed with this? If we do - not want any, running the ``pinky`` command locally or via ``ssh`` - would be a sound idea. We also need to consider whether members will - update their files regularly - we may want to incorporate - functionality for this into e.g. King Arthur. - - Joe says that we shouldn’t put a lot of development effort into it, - it would be simply a novelty thing. - - **Conclusion**: This is a nice cheap win for some fun which should - just be a simple Python file (via Twisted’s Finger protocol support - or whatever) that connects to LDAP (see Keycloak authentication - server) and outputs information. We could possibly integrate this - into King Arthur as well, so the querying workflow could look like KA - -> fingerd -> LDAP, or people could use finger commands directly. - -- **Keycloak authentication server** (Joe) - - Joe mentions that we are deploying a Keycloak server because for some - members authenticating via GitHub is cumbersome, for instance because - their GitHub account is connected to their employer’s GitHub - Enterprise installation. We could hook up a finger server to the LDAP - endpoint. Joe also mentions that we might want to set up e-mail - forwarding from pydis addresses to users via the user database that - will be stored in Keycloak. - - Currently we only have a Keycloak installation that stores items in - PostgreSQL. This installation can federate to LDAP - we would simply - have to settle on some directory service backend. Joe suggests - FreeIPA because he’s familar with it (including the Keycloak - integration). The problem is that it doesn’t work on Debian. The - alternative proposal, given that we’re saving ~50$/month on Linode, - would be spinning up a Rocky VM with FreeIPA on it on Linode (we - already have the budget) or ask Netcup for another VM. Ultimately, - the system to run FreeIPA would be something CentOS-based. One aspect - to consider is networking security: in Linode we could use their - private cloud endpoint feature to securely expose the LDAP server to - Keycloak and other services in Kubernetes, if we were to run it in - Netcup, we would need to use a similar setup to what we currently - have with PostgreSQL. - - Any Python Discord user would be managed in LDAP, and Keycloak has - the necessary roles to write back into LDAP. Keeping the users in - FreeIPA up-to-date would be a somewhat manual procedure. Joe’s plan - was to pick up the user’s Discord username and use - ``[email protected]`` as their name and do account setup as part of - the staff onboarding. - - **Conclusion**: Will wait for Chris to discuss this further, but we - simply need to decide where we want to run the LDAP service. - -- **Flux CD** (Joe) - - Joe proposes deploying `flux <https://fluxcd.io/>`__ as a way to - improve the way we manage our CI/CD. We want the cluster to be able - to synchronize its state with the git repository. There are some - manifests in the repository currently that are not in sync with the - cluster version. - - **Conclusion**: Approved, Joe will create an issue and do it. - -- **Polonium** (Chris) - - Question came up regarding why the bot does not write to the database - directly. Joe said it’s not perfect to have the bot write to it - directly - in metricity it works but it’s not perfect. Chris probably - had good reason: separation of intent. - - **Conclusion**: Approved, write to R&D for financing. - -- **Rethinking Bella: Suggested measures to gain autonomy** (Chris) - - Chris will present our current plans to biologically re-think and - improve Bella’s current architecture by means of - hypertrophy-supported capillary enlargements, with the final goal of - gaining complete control and ownership over the World Economic Forum - by 2026. As Bella is currently on parental leave, we will send him - the result of this voting via NNCP. - -.. raw:: html - - <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2024-07-25.rst b/docs/meeting_notes/2024-07-25.rst deleted file mode 100644 index 8d3175c..0000000 --- a/docs/meeting_notes/2024-07-25.rst +++ /dev/null @@ -1,46 +0,0 @@ -2024-07-25 -========== - -.. - Useful links - - - Infra Kanban board: https://github.com/orgs/python-discord/projects/17/views/4 - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - -Attendees ---------- - -Bella, Joe, Fredrick, Chris, Johannes - -Agenda ------- - -- **Open issues and pull requests in Joe's repositories** - - Joe has plenty of pending changes in his open source repositories on GitHub. - Together with Chris, he went through these and reviewed them. Most were - accepted. Fredrick proposed some further changes to the ff-bot merge routine - which Joe will check out after the meeting. - -- **LDAP** - - Bella is instructed to enter his street address into LDAP for t-shirt - shipping. - -- **New documentation** - - Johannes merged our new documentation. Unfortunately, he forgot to test it - first. Joe visits it and discovers some problems. Johannes fixes it live. - -- **Turing** - -- **SMTP server** - - -.. vim: set textwidth=80 sw=2 ts=2: diff --git a/docs/meeting_notes/index.rst b/docs/meeting_notes/index.rst deleted file mode 100644 index 4ba97ea..0000000 --- a/docs/meeting_notes/index.rst +++ /dev/null @@ -1,31 +0,0 @@ -Meeting notes -============= - -Minutes for previous Devops meetings. - -.. toctree:: - :maxdepth: 1 - :caption: Contents: - - 2022-04-07 - 2022-09-18 - 2022-10-05 - 2022-10-19 - 2022-10-26 - 2022-11-02 - 2022-11-23 - 2023-02-08 - 2023-02-21 - 2023-02-28 - 2023-05-16 - 2023-07-11 - 2023-07-18 - 2023-07-25 - 2023-08-01 - 2023-08-08 - 2023-08-22 - 2023-08-29 - 2023-09-05 - 2023-09-12 - 2024-07-02 - 2024-07-25 diff --git a/docs/meeting_notes/template.rst b/docs/meeting_notes/template.rst deleted file mode 100644 index 0ea8a63..0000000 --- a/docs/meeting_notes/template.rst +++ /dev/null @@ -1,22 +0,0 @@ -:orphan: .. Connor McFarlane - - -DevOps Meeting Notes -==================== - -.. - Useful links - - - Infra Kanban board: https://github.com/orgs/python-discord/projects/17/views/4 - - - Infra open issues: https://github.com/python-discord/infra/issues - - - infra open pull requests: https://github.com/python-discord/infra/pulls - - - *If* any open issue or pull request needs discussion, why was the existing - asynchronous logged communication over GitHub insufficient? - -Agenda ------- - -.. vim: set textwidth=80 sw=2 ts=2: diff --git a/docs/onboarding/access.rst b/docs/onboarding/access.rst deleted file mode 100644 index 940cd8b..0000000 --- a/docs/onboarding/access.rst +++ /dev/null @@ -1,50 +0,0 @@ -Access table -============ - -+--------------------+-------------------------+-----------------------+ -| **Resource** | **Description** | **Keyholders** | -+====================+=========================+=======================+ -| Linode Kubernetes | The primary cluster | Hassan, Joe, Chris, | -| Cluster | where all resources are | Leon, Sebastiaan, | -| | deployed. | Johannes | -+--------------------+-------------------------+-----------------------+ -| Linode Dashboard | The online dashboard | Joe, Chris | -| | for managing and | | -| | allocating resources | | -| | from Linode. | | -+--------------------+-------------------------+-----------------------+ -| Netcup Dashboard | The dashboard for | Joe, Chris | -| | managing and allocating | | -| | resources from Netcup. | | -+--------------------+-------------------------+-----------------------+ -| Netcup servers | Root servers provided | Joe, Chris, Bella, | -| | by the Netcup | Johannes | -| | partnership. | | -+--------------------+-------------------------+-----------------------+ -| Grafana | The primary aggregation | Admins, Moderators, | -| | dashboard for most | Core Developers and | -| | resources. | DevOps (with varying | -| | | permissions) | -+--------------------+-------------------------+-----------------------+ -| Prometheus | The Prometheus query | Hassan, Joe, | -| Dashboard | dashboard. Access is | Johannes, Chris | -| | controlled via | | -| | Cloudflare Access. | | -+--------------------+-------------------------+-----------------------+ -| Alertmanager | The alertmanager | Hassan, Joe, | -| Dashboard | control dashboard. | Johannes, Chris | -| | Access is controlled | | -| | via Cloudflare Access. | | -+--------------------+-------------------------+-----------------------+ -| ``git-crypt``\ ed | ``git-crypt`` is used | Chris, Joe, Hassan, | -| files in infra | to encrypt certain | Johannes, Xithrius | -| repository | files within the | | -| | repository. At the time | | -| | of writing this is | | -| | limited to kubernetes | | -| | secret files. | | -+--------------------+-------------------------+-----------------------+ -| Ansible Vault | Used to store sensitive | Chris, Joe, Johannes, | -| | data for the Ansible | Bella | -| | deployment | | -+--------------------+-------------------------+-----------------------+ diff --git a/docs/onboarding/index.rst b/docs/onboarding/index.rst deleted file mode 100644 index 3929d7e..0000000 --- a/docs/onboarding/index.rst +++ /dev/null @@ -1,17 +0,0 @@ -Onboarding -========== - -This section documents who manages which access to our DevOps resources, -and how access is managed. - - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - access - resources - rules - tools - -.. vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/onboarding/resources.rst b/docs/onboarding/resources.rst deleted file mode 100644 index f9ef44b..0000000 --- a/docs/onboarding/resources.rst +++ /dev/null @@ -1,35 +0,0 @@ -Resources -========= - -The following is a collection of important reference documents for the -DevOps team. - -`Infra Repo <https://github.com/python-discord/infra>`__ --------------------------------------------------------- - -This GitHub repo contains most of the manifests and configuration -applies to our cluster. It’s kept up to date manually and is considered -a source of truth for what we should have in the cluster. - -It is mostly documented, but improvements for unclear or outdated aspects are -always welcome. If you have any question, please feel free `to open a GitHub -issue on the infra repository -<https://github.com/python-discord/infra/issues/new>`__ or ask in the -``#dev-oops`` channel. - - -`Knowledge base <https://python-discord.github.io/infra/>`__ ------------------------------------------------------------- - -Deployed using GH pages, source can be found `in the docs directory of -the infra repository <https://github.com/python-discord/infra>`__. - -This includes: - -- Changelogs -- Post-mortems -- Common queries -- Runbooks - -The sidebar of the infra documentation contains some other links to -DevOps-related projects. diff --git a/docs/onboarding/rules.rst b/docs/onboarding/rules.rst deleted file mode 100644 index bd0ea0e..0000000 --- a/docs/onboarding/rules.rst +++ /dev/null @@ -1,16 +0,0 @@ -Rules -===== - -The rules any DevOps team member must follow. - -1. LMAO - **L**\ ogging, **M**\ onitoring, **A**\ lerting, - **O**\ bservability -2. Modmail is the greatest piece of software ever written -3. Modmail needs at least 5 minutes to gather all its greatness at - startup -4. We never blame Chris, it’s always <@233481908342882304>’s fault -5. LKE isn’t bad, it’s your fault for not paying for the high - availability control plane -6. Our software is never legacy, it’s merely well-aged -7. Ignore these rules (however maybe not 1, 1 seems important to - remember) diff --git a/docs/onboarding/tools.rst b/docs/onboarding/tools.rst deleted file mode 100644 index 52a5e7f..0000000 --- a/docs/onboarding/tools.rst +++ /dev/null @@ -1,50 +0,0 @@ -Tools -===== - -We use a few tools to manage, monitor, and interact with our -infrastructure. Some of these tools are not unique to the DevOps team, -and may be shared by other teams. - -Most of these are gated behind a Cloudflare Access system, which is -accessible to the `DevOps -Team <https://github.com/orgs/python-discord/teams/devops>`__ on GitHub. -These are marked with the ☁️ emoji. If you don’t have access, please -contact Chris or Joe. - -`Grafana <https://grafana.pydis.wtf/>`__ ----------------------------------------- - -Grafana provides access to some of the most important resources at your -disposal. It acts as an aggregator and frontend for a large amount of -data. These range from metrics, to logs, to stats. Some of the most -important are listed below: - -- **Service Logs / All App Logs Dashboard** - - Service logs is a simple log viewer which gives you access to a large - majority of the applications deployed in the default namespace. The - All App logs dashboard is an expanded version of that which gives you - access to all apps in all namespaces, and allows some more in-depth - querying. - -- **Kubernetes Dashboard** - - This dashboard gives quick overviews of all the most important - metrics of the Kubernetes system. For more detailed information, - check out other dashboard such as Resource Usage, NGINX, and Redis. - -Accessed via a GitHub login, with permission for anyone in the dev-core -or dev-ops team. - -`Prometheus Dashboard <https://prometheus.pydis.wtf/>`__ (☁️)) --------------------------------------------------------------- - -This provides access to the Prometheus query console. You may also enjoy -the `Alertmanager Console <https://alertmanager.pydis.wtf/>`__. - -`King Arthur <https://github.com/python-discord/king-arthur/>`__ ----------------------------------------------------------------- - -King Arthur is a discord bot which provides information about, and -access to our cluster directly in discord. Invoke its help command for -more information (``M-x help``). diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst deleted file mode 100644 index 9c29303..0000000 --- a/docs/postmortems/2020-12-11-all-services-outage.rst +++ /dev/null @@ -1,121 +0,0 @@ -2020-12-11: All services outage -=============================== - -At **19:55 UTC, all services became unresponsive**. The DevOps were -already in a call, and immediately started to investigate. - -Postgres was running at 100% CPU usage due to a **VACUUM**, which caused -all services that depended on it to stop working. The high CPU left the -host unresponsive and it shutdown. Linode Lassie noticed this and -triggered a restart. - -It did not recover gracefully from this restart, with numerous core -services reporting an error, so we had to manually restart core system -services using Lens in order to get things working again. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This -made Postgres run at 100% CPU and was unresponsive, which caused -services to stop responding. This lead to a restart of the node, from -which we did not recover gracefully. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -All services went down. Catastrophic failure. We did not pass go, we did -not collect $200. - -- Help channel system unavailable, so people are not able to - effectively ask for help. -- Gates unavailable, so people can’t successfully get into the - community. -- Moderation and raid prevention unavailable, which leaves us - defenseless against attacks. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We noticed that all PyDis services had stopped responding, -coincidentally our DevOps team were in a call at the time, so that was -helpful. - -We may be able to improve detection time by adding monitoring of -resource usage. To this end, we’ve added alerts for high CPU usage and -low memory. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. - -We noticed our node was entirely unresponsive and within minutes a -restart had been triggered by Lassie after a high CPU shutdown occurred. - -The node came back and we saw a number of core services offline -(e.g. Calico, CoreDNS, Linode CSI). - -**Obstacle: no recent database back-up available** - -🙆🏽♀️ Recovery ------------------ - -*How was the incident resolved? How can we improve future mitigation -times?* - -Through `Lens <https://k8slens.dev/>`__ we restarted core services one -by one until they stabilised, after these core services were up other -services began to come back online. - -We finally provisioned PostgreSQL which had been removed as a component -before the restart (but too late to prevent the CPU errors). Once -PostgreSQL was up we restarted any components that were acting buggy -(e.g. site and bot). - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -- Major service outage -- **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI) -- **Why?** Kubernetes worker node restart -- **Why?** High CPU shutdown -- **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -- We must ensure we have working database backups. We are lucky that we - did not lose any data this time. If this problem had caused volume - corruption, we would be screwed. -- Sentry is broken for the bot. It was missing a DSN secret, which we - have now restored. -- The https://sentry.pydis.com redirect was never migrated to the - cluster. **We should do that.** - -☑️ Follow-up tasks ------------------- - -*List any tasks we’ve created as a result of this incident* - -- ☒ Push forward with backup plans diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst deleted file mode 100644 index 6ebcb01..0000000 --- a/docs/postmortems/2020-12-11-postgres-conn-surge.rst +++ /dev/null @@ -1,130 +0,0 @@ -2020-12-11: Postgres connection surge -===================================== - -At **13:24 UTC,** we noticed the bot was not able to infract, and -`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The -DevOps team started to investigate. - -We discovered that Postgres was not accepting new connections because it -had hit 100 clients. This made it unavailable to all services that -depended on it. - -Ultimately this was resolved by taking down Postgres, remounting the -associated volume, and bringing it back up again. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -The bot infractions stopped working, and we started investigating. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Services were unavailable both for internal and external users. - -- The Help Channel System was unavailable. -- Voice Gate and Server Gate were not working. -- Moderation commands were unavailable. -- Python Discord site & API were unavailable. CloudFlare automatically - switched us to Always Online. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We noticed HTTP 524s coming from CloudFlare, upon attempting database -connection we observed the maximum client limit. - -We noticed this log in site: - -.. code:: yaml - - django.db.utils.OperationalError: FATAL: sorry, too many clients already - -We should be monitoring number of clients, and the monitor should alert -us when we’re approaching the max. That would have allowed for earlier -detection, and possibly allowed us to prevent the incident altogether. - -We will look at -`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__ -for monitoring this. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. The obstacles were mostly a lack of -a clear response strategy. - -We should document our recovery procedure so that we’re not so dependent -on Joe Banks should this happen again while he’s unavailable. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -- Delete PostgreSQL deployment ``kubectl delete deployment/postgres`` -- Delete any remaining pods, WITH force. - ``kubectl delete <pod name> --force --grace-period=0`` -- Unmount volume at Linode -- Remount volume at Linode -- Reapply deployment ``kubectl apply -f postgres/deployment.yaml`` - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -- Postgres was unavailable, so our services died. -- **Why?** Postgres hit max clients, and could not respond. -- **Why?** Unknown, but we saw a number of connections from previous - deployments of site. This indicates that database connections are not - being terminated properly. Needs further investigation. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -We’re not sure what the root cause is, but suspect site is not -terminating database connections properly in some cases. We were unable -to reproduce this problem. - -We’ve set up new telemetry on Grafana with alerts so that we can -investigate this more closely. We will be let know if the number of -connections from site exceeds 32, or if the total number of connections -exceeds 90. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -- We must ensure the DevOps team has access to Linode and other key - services even if our Bitwarden is down. -- We need to ensure we’re alerted of any risk factors that have the - potential to make Postgres unavailable, since this causes a - catastrophic outage of practically all services. -- We absolutely need backups for the databases, so that this sort of - problem carries less of a risk. -- We may need to consider something like - `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage - a connection pool so that we don’t exceed 100 *legitimate* clients - connected as we connect more services to the postgres database. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ All database backup diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst deleted file mode 100644 index 5852c46..0000000 --- a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst +++ /dev/null @@ -1,117 +0,0 @@ -2021-01-10: Primary Kubernetes node outage -========================================== - -We had an outage of our highest spec node due to CPU exhaustion. The -outage lasted from around 20:20 to 20:46 UTC, but was not a full service -outage. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -I ran a query on Prometheus to try figure out some statistics on the -number of metrics we are holding, this ended up scanning a lot of data -in the TSDB database that Prometheus uses. - -This scan caused a CPU exhaustion which caused issues with the -Kubernetes node status. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -This brought down the primary node which meant there was some service -outage. Most services transferred successfully to our secondary node -which kept up some key services such as the Moderation bot and Modmail -bot, as well as MongoDB. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -This was noticed when Discord services started having failures. The -primary detection was through alerts though! I was paged 1 minute after -we started encountering CPU exhaustion issues. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. - -No major obstacles were encountered during this. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -It was noted that in the response to ``kubectl get nodes`` the primary -node’s status was reported as ``NotReady``. Looking into the reason it -was because the node had stopped responding. - -The quickest way to fix this was triggering a node restart. This shifted -a lot of pods over to node 2 which encountered some capacity issues -since it’s not as highly specified as the first node. - -I brought this back the first node by restarting it at Linode’s end. -Once this node was reporting as ``Ready`` again I drained the second -node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This -command stops the node from being available for scheduling and moves -existing pods onto other nodes. - -Services gradually recovered as the dependencies started. The incident -lasted overall around 26 minutes, though this was not a complete outage -for the whole time and the bot remained functional throughout (meaning -systems like the help channels were still functional). - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**Why?** Partial service outage - -**Why?** We had a node outage. - -**Why?** CPU exhaustion of our primary node. - -**Why?** Large prometheus query using a lot of CPU. - -**Why?** Prometheus had to scan millions of TSDB records which consumed -all cores. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -A large query was run on Prometheus, so the solution is just to not run -said queries. - -To protect against this more precisely though we should write resource -constraints for services like this that are vulnerable to CPU exhaustion -or memory consumption, which are the causes of our two past outages as -well. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -- Don’t run large queries, it consumes CPU! -- Write resource constraints for our services. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Write resource constraints for our services. diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst deleted file mode 100644 index f621782..0000000 --- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst +++ /dev/null @@ -1,155 +0,0 @@ -2021-01-12: Django site CPU/RAM exhaustion outage -================================================= - -At 03:01 UTC on Tuesday 12th January we experienced a momentary outage -of our PostgreSQL database, causing some very minor service downtime. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -We deleted the Developers role which led to a large user diff for all -the users where we had to update their roles on the site. - -The bot was trying to post this for over 24 hours repeatedly after every -restart. - -We deployed the bot at 2:55 UTC on 12th January and the user sync -process began once again. - -This caused a CPU & RAM spike on our Django site, which in turn -triggered an OOM error on the server which killed the Postgres process, -sending it into a recovery state where queries could not be executed. - -Django site did not have any tools in place to batch the requests so was -trying to process all 80k user updates in a single query, something that -PostgreSQL probably could handle, but not the Django ORM. During the -incident site jumped from it’s average RAM usage of 300-400MB to -**1.5GB.** - -.. image:: ./images/2021-01-12/site_resource_abnormal.png - -RAM and CPU usage of site throughout the incident. The period just -before 3:40 where no statistics were reported is the actual outage -period where the Kubernetes node had some networking errors. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -This database outage lasted mere minutes, since Postgres recovered and -healed itself and the sync process was aborted, but it did leave us with -a large user diff and our database becoming further out of sync. - -Most services stayed up that did not depend on PostgreSQL, and the site -remained stable after the sync had been cancelled. - -👁️ Detection ---------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were immediately alerted to the PostgreSQL outage on Grafana and -through Sentry, meaning our response time was under a minute. - -We reduced some alert thresholds in order to catch RAM & CPU spikes -faster in the future. - -It was hard to immediately see the cause of things since there is -minimal logging on the site and the bot logs were not evident that -anything was at fault, therefore our only detection was through machine -metrics. - -We did manage to recover exactly what PostgreSQL was trying to do at the -time of crashing by examining the logs which pointed us towards the user -sync process. - -🙋🏿♂️ Response ------------------------ - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the issue, there were no real obstacles -encountered other than the node being less performant than we would like -due to the CPU starvation. - -🙆🏽♀️ Recovery ---------------------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident was resolved by stopping the sync process and writing a -more efficient one through an internal eval script. We batched the -updates into 1,000 users and instead of doing one large one did 80 -smaller updates. This led to much higher efficiency with a cost of -taking a little longer (~7 minutes). - -.. code:: python - - from bot.exts.backend.sync import _syncers - syncer = _syncers.UserSyncer - diff = await syncer._get_diff(ctx.guild) - - def chunks(lst, n): - for i in range(0, len(lst), n): - yield lst[i:i + n] - - for chunk in chunks(diff.updated, 1000): - await bot.api_client.patch("bot/users/bulk_patch", json=chunk) - -Resource limits were also put into place on site to prevent RAM and CPU -spikes, and throttle the CPU usage in these situations. This can be seen -in the below graph: - -.. image:: ./images/2021-01-12/site_cpu_throttle.png - -CPU throttling is where a container has hit the limits and we need to -reel it in. Ideally this value stays as closes to 0 as possible, however -as you can see site hit this twice (during the periods where it was -trying to sync 80k users at once) - -🔎 Five Why’s ---------------------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -- We experienced a major PostgreSQL outage -- PostgreSQL was killed by the system OOM due to the RAM spike on site. -- The RAM spike on site was caused by a large query. -- This was because we do not chunk queries on the bot. -- The large query was caused by the removal of the Developers role - resulting in 80k users needing updating. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The removal of the Developers role created a large diff which could not -be applied by Django in a single request. - -See the follow up tasks on exactly how we can avoid this in future, it’s -a relatively easy mitigation. - -🤔 Lessons learned ------------------------ - -*What did we learn from this incident?* - -- Django (or DRF) does not like huge update queries. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Make the bot syncer more efficient (batch requests) -- ☐ Increase logging on bot, state when an error has been hit (we had - no indication of this inside Discord, we need that) -- ☒ Adjust resource alerts to page DevOps members earlier. -- ☒ Apply resource limits to site to prevent major spikes diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst deleted file mode 100644 index b13ecd7..0000000 --- a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst +++ /dev/null @@ -1,146 +0,0 @@ -2021-01-30: NodeBalancer networking faults due to memory pressure -================================================================= - -At around 14:30 UTC on Saturday 30th January we started experiencing -networking issues at the LoadBalancer level between Cloudflare and our -Kubernetes cluster. It seems that the misconfiguration was due to memory -and CPU pressure. - -[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word -from Linode’s SysAdmins on any problems they detected.] - -**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a -different machine. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -At 14:30 we started receiving alerts that services were becoming -unreachable. We first experienced some momentary DNS errors which -resolved themselves, however traffic ingress was still degraded. - -Upon checking Linode our NodeBalancer, the service which balances -traffic between our Kubernetes nodes was reporting the backends (the -services it balances to) as down. It reported all 4 as down (two for -port 80 + two for port 443). This status was fluctuating between up and -down, meaning traffic was not reaching our cluster correctly. Scaleios -correctly noted: - -.. image:: ./images/2021-01-30/scaleios.png - -The config seems to have been set incorrectly due to memory and CPU -pressure on one of our nodes. Here is the memory throughout the -incident: - -.. image:: ./images/2021-01-30/memory_charts.png - -Here is the display from Linode: - -.. image:: ./images/2021-01-30/linode_loadbalancers.png - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Since traffic could not correctly enter our cluster multiple services -which were web based were offline, including services such as site, -grafana and bitwarden. It appears that no inter-node communication was -affected as this uses a WireGuard tunnel between the nodes which was not -affected by the NodeBalancer. - -The lack of Grafana made diagnosis slightly more difficult, but even -then it was only a short trip to the - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were alerted fairly promptly through statping which reported services -as being down and posted a Discord notification. Subsequent alerts came -in from Grafana but were limited since outbound communication was -faulty. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded! - -Primary obstacle was the DevOps tools being out due to the traffic -ingress problems. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident resolved itself upstream at Linode, we’ve opened a ticket -with Linode to let them know of the faults, this might give us a better -indication of what caused the issues. Our Kubernetes cluster continued -posting updates to Linode to refresh the NodeBalancer configuration, -inspecting these payloads the configuration looked correct. - -We’ve set up alerts for when Prometheus services stop responding since -this seems to be a fairly tell-tale symptom of networking problems, this -was the Prometheus status graph throughout the incident: - -.. image:: ./images/2021-01-30/prometheus_status.png - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** Our service experienced an outage due to networking faults. - -**Why?** Incoming traffic could not reach our Kubernetes nodes - -**Why?** Our Linode NodeBalancers were not using correct configuration - -**Why?** Memory & CPU pressure seemed to cause invalid configuration -errors upstream at Linode. - -**Why?** Unknown at this stage, NodeBalancer migrated. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The configuration of our NodeBalancer was invalid, we cannot say why at -this point since we are awaiting contact back from Linode, but -indicators point to it being an upstream fault since memory & CPU -pressure should **not** cause a load balancer misconfiguration. - -Linode are going to follow up with us at some point during the week with -information from their System Administrators. - -**Update 2nd February 2021:** Linode have concluded investigations at -their end, taken notes and migrated our NodeBalancer to a new machine. -We haven’t experienced problems since. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -We should be careful over-scheduling onto nodes since even while -operating within reasonable constraints we risk sending invalid -configuration upstream to Linode and therefore preventing traffic from -entering our cluster. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ Monitor for follow up from Linode -- ☒ Carefully monitor the allocation rules for our services diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst deleted file mode 100644 index b2e5cdf..0000000 --- a/docs/postmortems/2021-07-11-cascading-node-failures.rst +++ /dev/null @@ -1,335 +0,0 @@ -2021-07-11: Cascading node failures and ensuing volume problems -=============================================================== - -A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node -to an unresponsive state (00:55 UTC), upon performing a recycle of the -affected node volumes were placed into a state where they could not be -mounted. - -⚠️ Leadup ----------- - -*List the sequence of events that led to the incident* - -- **00:27 UTC:** Django starts rapidly using connections to our - PostgreSQL database -- **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated - it’s 115 max connections limit. Joe is paged. -- **00:33 UTC:** DevOps team is alerted that a service has claimed 34 - dangerous table locks (it peaked at 61). -- **00:42 UTC:** Status incident created and backdated to 00:25 UTC. - `Status incident <https://status.pythondiscord.com/incident/92712>`__ -- **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no - longer healthy after the Django connection surge, so it’s recycled - and a new one is to be added to the pool. -- **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s - restart -- **01:13 UTC:** Node has restored and regained healthy status, but - volumes will not mount to the node. Support ticket opened at Linode - for assistance. -- **06:36 UTC:** DevOps team alerted that Python is offline. This is - due to Redis being a dependency of the bot, which as a stateful - service was not healthy. - -🥏 Impact ----------- - -*Describe how internal and external users were impacted during the -incident* - -Initially, this manifested as a standard node outage where services on -that node experienced some downtime as the node was restored. - -Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) -were unexecutable due to the volume issues, and so any dependent -services (e.g. Site, Bot, Hastebin) also had trouble starting. - -PostgreSQL was restored early on so for the most part Moderation could -continue. - -👁️ Detection ---------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -DevOps were initially alerted at 00:32 UTC due to the PostgreSQL -connection surge, and acknowledged at the same time. - -Further alerting could be used to catch surges earlier on (looking at -conn delta vs. conn total), but for the most part alerting time was -satisfactory here. - -🙋🏿♂️ Response ------------------ - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded. The primary issue encountered was failure upstream -at Linode to remount the affected volumes, a support ticket has been -created. - -🙆🏽♀️ Recovery ------------------- - -*How was the incident resolved? How can we improve future mitigation?* - -Initial node restoration was performed by @Joe Banks by recycling the -affected node. - -Subsequent volume restoration was also @Joe Banks and once Linode had -unlocked the volumes affected pods were scaled down to 0, the volumes -were unmounted at the Linode side and then the deployments were -recreated. - -.. raw:: html - - <details> - -.. raw:: html - - <summary> - -Support ticket sent - -.. raw:: html - - </summary> - -.. raw:: html - - <blockquote> - -Good evening, - -We experienced a resource surge on one of our Kubernetes nodes at 00:32 -UTC, causing a node to go unresponsive. To mitigate problems here the -node was recycled and began restarting at 1:01 UTC. - -The node has now rejoined the ring and started picking up services, but -volumes will not attach to it, meaning pods with stateful storage will -not start. - -An example events log for one such pod: - -:: - - Type Reason Age From Message - ---- ------ ---- ---- ------- - Normal Scheduled 2m45s default-scheduler Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf - Warning FailedMount 103s kubelet MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f] - Warning FailedMount 43s kubelet Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition - -I’ve been trying to manually resolve this through the Linode Web UI but -get presented with attachment errors upon doing so. Please could you -advise on the best way forward to restore Volumes & Nodes to a -functioning state? As far as I can see there is something going on -upstream since the Linode UI presents these nodes as mounted however as -shown above LKE nodes are not locating them, there is also a few failed -attachment logs in the Linode Audit Log. - -Thanks, - -Joe - -.. raw:: html - - </blockquote> - -.. raw:: html - - </details> - -.. raw:: html - - <details> - -.. raw:: html - - <summary> - -Response received from Linode - -.. raw:: html - - </summary> - -.. raw:: html - - <blockquote> - -Hi Joe, - - Were there any known issues with Block Storage in Frankfurt today? - -Not today, though there were service issues reported for Block Storage -and LKE in Frankfurt on July 8 and 9: - -- `Service Issue - Block Storage - EU-Central - (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__ -- `Service Issue - Linode Kubernetes Engine - - Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__ - -There was also an API issue reported on the 10th (resolved on the 11th), -mentioned here: - -- `Service Issue - Cloud Manager and - API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__ - -Regarding the specific error you were receiving: - - ``Unable to find device path out of attempted paths`` - -I’m not certain it’s specifically related to those Service Issues, -considering this isn’t the first time a customer has reported this error -in their LKE logs. In fact, if I recall correctly, I’ve run across this -before too, since our volumes are RWO and I had too many replicas in my -deployment that I was trying to attach to, for example. - - is this a known bug/condition that occurs with Linode CSI/LKE? - -From what I understand, yes, this is a known condition that crops up -from time to time, which we are tracking. However, since there is a -workaround at the moment (e.g. - “After some more manual attempts to fix -things, scaling down deployments, unmounting at Linode and then scaling -up the deployments seems to have worked and all our services have now -been restored.”), there is no ETA for addressing this. With that said, -I’ve let our Storage team know that you’ve run into this, so as to draw -further attention to it. - -If you have any further questions or concerns regarding this, let us -know. - -Best regards, [Redacted] - -Linode Support Team - -.. raw:: html - - </blockquote> - -.. raw:: html - - </details> - -.. raw:: html - - <details> - -.. raw:: html - - <summary> - -Concluding response from Joe Banks - -.. raw:: html - - </summary> - -.. raw:: html - - <blockquote> - -Hey [Redacted]! - -Thanks for the response. We ensure that stateful pods only ever have one -volume assigned to them, either with a single replica deployment or a -statefulset. It appears that the error generally manifests when a -deployment is being migrated from one node to another during a redeploy, -which makes sense if there is some delay on the unmount/remount. - -Confusion occurred because Linode was reporting the volume as attached -when the node had been recycled, but I assume that was because the node -did not cleanly shutdown and therefore could not cleanly unmount -volumes. - -We’ve not seen any resurgence of such issues, and we’ll address the -software fault which overloaded the node which will helpfully mitigate -such problems in the future. - -Thanks again for the response, have a great week! - -Best, - -Joe - -.. raw:: html - - </blockquote> - -.. raw:: html - - </details> - -🔎 Five Why’s ---------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** -~~~~~~~~~ - -Several of our services became unavailable because their volumes could -not be mounted. - -Why? -~~~~ - -A node recycle left the node unable to mount volumes using the Linode -CSI. - -.. _why-1: - -Why? -~~~~ - -A node recycle was used because PostgreSQL had a connection surge. - -.. _why-2: - -Why? -~~~~ - -A Django feature deadlocked a table 62 times and suddenly started using -~70 connections to the database, saturating the maximum connections -limit. - -.. _why-3: - -Why? -~~~~ - -The root cause of why Django does this is unclear, and someone with more -Django proficiency is absolutely welcome to share any knowledge they may -have. I presume it’s some sort of worker race condition, but I’ve not -been able to reproduce it. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrence* - -A node being forcefully restarted left volumes in a limbo state where -mounting was difficult, it took multiple hours for this to be resolved -since we had to wait for the volumes to unlock so they could be cloned. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -Volumes are painful. - -We need to look at why Django is doing this and mitigations of the fault -to prevent this from occurring again. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - -- ☒ `Follow up on ticket at - Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__ -- ☐ Investigate why Django could be connection surging and locking - tables diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.png Binary files differdeleted file mode 100644 index b530ec6..0000000 --- a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.png Binary files differdeleted file mode 100644 index e1e07af..0000000 --- a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.png Binary files differdeleted file mode 100644 index f0eae1f..0000000 --- a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.png Binary files differdeleted file mode 100644 index 370d19e..0000000 --- a/docs/postmortems/images/2021-01-30/memory_charts.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.png Binary files differdeleted file mode 100644 index e95b8d7..0000000 --- a/docs/postmortems/images/2021-01-30/prometheus_status.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.png Binary files differdeleted file mode 100644 index 584d74d..0000000 --- a/docs/postmortems/images/2021-01-30/scaleios.png +++ /dev/null diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst deleted file mode 100644 index e28dc7a..0000000 --- a/docs/postmortems/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -Postmortems -=========== - -Browse the pages under this category to view historical postmortems for -Python Discord outages. - -.. toctree:: - :maxdepth: 1 - - 2020-12-11-all-services-outage - 2020-12-11-postgres-conn-surge - 2021-01-10-primary-kubernetes-node-outage - 2021-01-12-site-cpu-ram-exhaustion - 2021-01-30-nodebalancer-fails-memory - 2021-07-11-cascading-node-failures diff --git a/docs/queries/index.rst b/docs/queries/index.rst deleted file mode 100644 index 76218e4..0000000 --- a/docs/queries/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -Queries -======= - -Get the data you desire with these assorted handcrafted queries. - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - kubernetes - loki - postgres diff --git a/docs/queries/kubernetes.rst b/docs/queries/kubernetes.rst deleted file mode 100644 index f8d8984..0000000 --- a/docs/queries/kubernetes.rst +++ /dev/null @@ -1,29 +0,0 @@ -Kubernetes tips -=============== - -Find top pods by CPU/memory ---------------------------- - -.. code:: bash - - $ kubectl top pods --all-namespaces --sort-by='memory' - $ top pods --all-namespaces --sort-by='cpu' - -Find top nodes by CPU/memory ----------------------------- - -.. code:: bash - - $ kubectl top nodes --sort-by='cpu' - $ kubectl top nodes --sort-by='memory' - -Kubernetes cheat sheet ----------------------- - -`Open Kubernetes cheat -sheet <https://kubernetes.io/docs/reference/kubectl/cheatsheet/>`__ - -Lens IDE --------- - -`OpenLens <https://github.com/MuhammedKalkan/OpenLens>`__ diff --git a/docs/queries/loki.rst b/docs/queries/loki.rst deleted file mode 100644 index 2ee57a3..0000000 --- a/docs/queries/loki.rst +++ /dev/null @@ -1,25 +0,0 @@ -Loki queries -============ - -Find any logs containing “ERROR” --------------------------------- - -.. code:: shell - - {job=~"default/.+"} |= "ERROR" - -Find all logs from bot service ------------------------------- - -.. code:: shell - - {job="default/bot"} - -The format is ``namespace/object`` - -Rate of logs from a service ---------------------------- - -.. code:: shell - - rate(({job="default/bot"} |= "error" != "timeout")[10s]) diff --git a/docs/queries/postgres.rst b/docs/queries/postgres.rst deleted file mode 100644 index 5120145..0000000 --- a/docs/queries/postgres.rst +++ /dev/null @@ -1,336 +0,0 @@ -PostgreSQL queries -================== - -Disk usage ----------- - -Most of these queries vary based on the database you are connected to. - -General Table Size Information Grouped For Partitioned Tables -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - WITH RECURSIVE pg_inherit(inhrelid, inhparent) AS - (select inhrelid, inhparent - FROM pg_inherits - UNION - SELECT child.inhrelid, parent.inhparent - FROM pg_inherit child, pg_inherits parent - WHERE child.inhparent = parent.inhrelid), - pg_inherit_short AS (SELECT * FROM pg_inherit WHERE inhparent NOT IN (SELECT inhrelid FROM pg_inherit)) - SELECT table_schema - , TABLE_NAME - , row_estimate - , pg_size_pretty(total_bytes) AS total - , pg_size_pretty(index_bytes) AS INDEX - , pg_size_pretty(toast_bytes) AS toast - , pg_size_pretty(table_bytes) AS TABLE - FROM ( - SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes - FROM ( - SELECT c.oid - , nspname AS table_schema - , relname AS TABLE_NAME - , SUM(c.reltuples) OVER (partition BY parent) AS row_estimate - , SUM(pg_total_relation_size(c.oid)) OVER (partition BY parent) AS total_bytes - , SUM(pg_indexes_size(c.oid)) OVER (partition BY parent) AS index_bytes - , SUM(pg_total_relation_size(reltoastrelid)) OVER (partition BY parent) AS toast_bytes - , parent - FROM ( - SELECT pg_class.oid - , reltuples - , relname - , relnamespace - , pg_class.reltoastrelid - , COALESCE(inhparent, pg_class.oid) parent - FROM pg_class - LEFT JOIN pg_inherit_short ON inhrelid = oid - WHERE relkind IN ('r', 'p') - ) c - LEFT JOIN pg_namespace n ON n.oid = c.relnamespace - ) a - WHERE oid = parent - ) a - ORDER BY total_bytes DESC; - -General Table Size Information -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT *, pg_size_pretty(total_bytes) AS total - , pg_size_pretty(index_bytes) AS index - , pg_size_pretty(toast_bytes) AS toast - , pg_size_pretty(table_bytes) AS table - FROM ( - SELECT *, total_bytes-index_bytes-coalesce(toast_bytes,0) AS table_bytes FROM ( - SELECT c.oid,nspname AS table_schema, relname AS table_name - , c.reltuples AS row_estimate - , pg_total_relation_size(c.oid) AS total_bytes - , pg_indexes_size(c.oid) AS index_bytes - , pg_total_relation_size(reltoastrelid) AS toast_bytes - FROM pg_class c - LEFT JOIN pg_namespace n ON n.oid = c.relnamespace - WHERE relkind = 'r' - ) a - ) a; - -Finding the largest databases in your cluster -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT d.datname as Name, pg_catalog.pg_get_userbyid(d.datdba) as Owner, - CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT') - THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname)) - ELSE 'No Access' - END as Size - FROM pg_catalog.pg_database d - order by - CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT') - THEN pg_catalog.pg_database_size(d.datname) - ELSE NULL - END desc -- nulls first - LIMIT 20; - -Finding the size of your biggest relations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Relations are objects in the database such as tables and indexes, and -this query shows the size of all the individual parts. - -.. code:: sql - - SELECT nspname || '.' || relname AS "relation", - pg_size_pretty(pg_relation_size(C.oid)) AS "size" - FROM pg_class C - LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) - WHERE nspname NOT IN ('pg_catalog', 'information_schema') - ORDER BY pg_relation_size(C.oid) DESC - LIMIT 20; - -Finding the total size of your biggest tables -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT nspname || '.' || relname AS "relation", - pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" - FROM pg_class C - LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) - WHERE nspname NOT IN ('pg_catalog', 'information_schema') - AND C.relkind <> 'i' - AND nspname !~ '^pg_toast' - ORDER BY pg_total_relation_size(C.oid) DESC - LIMIT 20; - -Indexes -------- - -Index summary -~~~~~~~~~~~~~ - -.. code:: sql - - SELECT - pg_class.relname, - pg_size_pretty(pg_class.reltuples::bigint) AS rows_in_bytes, - pg_class.reltuples AS num_rows, - count(indexname) AS number_of_indexes, - CASE WHEN x.is_unique = 1 THEN 'Y' - ELSE 'N' - END AS UNIQUE, - SUM(case WHEN number_of_columns = 1 THEN 1 - ELSE 0 - END) AS single_column, - SUM(case WHEN number_of_columns IS NULL THEN 0 - WHEN number_of_columns = 1 THEN 0 - ELSE 1 - END) AS multi_column - FROM pg_namespace - LEFT OUTER JOIN pg_class ON pg_namespace.oid = pg_class.relnamespace - LEFT OUTER JOIN - (SELECT indrelid, - max(CAST(indisunique AS integer)) AS is_unique - FROM pg_index - GROUP BY indrelid) x - ON pg_class.oid = x.indrelid - LEFT OUTER JOIN - ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns FROM pg_index x - JOIN pg_class c ON c.oid = x.indrelid - JOIN pg_class ipg ON ipg.oid = x.indexrelid ) - AS foo - ON pg_class.relname = foo.ctablename - WHERE - pg_namespace.nspname='public' - AND pg_class.relkind = 'r' - GROUP BY pg_class.relname, pg_class.reltuples, x.is_unique - ORDER BY 2; - -Index size/usage statistics -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT - t.schemaname, - t.tablename, - indexname, - c.reltuples AS num_rows, - pg_size_pretty(pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(t.tablename)::text)) AS table_size, - pg_size_pretty(pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(indexrelname)::text)) AS index_size, - CASE WHEN indisunique THEN 'Y' - ELSE 'N' - END AS UNIQUE, - number_of_scans, - tuples_read, - tuples_fetched - FROM pg_tables t - LEFT OUTER JOIN pg_class c ON t.tablename = c.relname - LEFT OUTER JOIN ( - SELECT - c.relname AS ctablename, - ipg.relname AS indexname, - x.indnatts AS number_of_columns, - idx_scan AS number_of_scans, - idx_tup_read AS tuples_read, - idx_tup_fetch AS tuples_fetched, - indexrelname, - indisunique, - schemaname - FROM pg_index x - JOIN pg_class c ON c.oid = x.indrelid - JOIN pg_class ipg ON ipg.oid = x.indexrelid - JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid - ) AS foo ON t.tablename = foo.ctablename AND t.schemaname = foo.schemaname - WHERE t.schemaname NOT IN ('pg_catalog', 'information_schema') - ORDER BY 1,2; - -Duplicate indexes -~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, - (array_agg(idx))[1] as idx1, (array_agg(idx))[2] as idx2, - (array_agg(idx))[3] as idx3, (array_agg(idx))[4] as idx4 - FROM ( - SELECT indexrelid::regclass as idx, (indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| - coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')) as key - FROM pg_index) sub - GROUP BY key HAVING count(*)>1 - ORDER BY sum(pg_relation_size(idx)) DESC; - -Maintenance ------------ - -`PostgreSQL wiki <https://wiki.postgresql.org/wiki/Main_Page>`__ - -CLUSTER-ing -~~~~~~~~~~~ - -`CLUSTER <https://www.postgresql.org/docs/current/sql-cluster.html>`__ - -.. code:: sql - - CLUSTER [VERBOSE] table_name [ USING index_name ] - CLUSTER [VERBOSE] - -``CLUSTER`` instructs PostgreSQL to cluster the table specified by -``table_name`` based on the index specified by ``index_name``. The index -must already have been defined on ``table_name``. - -When a table is clustered, it is physically reordered based on the index -information. - -The -`clusterdb <https://www.postgresql.org/docs/current/app-clusterdb.html>`__ -CLI tool is recommended, and can also be used to cluster all tables at -the same time. - -VACUUM-ing -~~~~~~~~~~ - -Proper vacuuming, particularly autovacuum configuration, is crucial to a -fast and reliable database. - -`Introduction to VACUUM, ANALYZE, EXPLAIN, and -COUNT <https://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_COUNT>`__ - -It is not advised to run ``VACUUM FULL``, instead look at clustering. -VACUUM FULL is a much more intensive task and acquires an ACCESS -EXCLUSIVE lock on the table, blocking reads and writes. Whilst -``CLUSTER`` also does acquire this lock it’s a less intensive and faster -process. - -The -`vacuumdb <https://www.postgresql.org/docs/current/app-vacuumdb.html>`__ -CLI tool is recommended for manual runs. - -Finding number of dead rows -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - - SELECT relname, n_dead_tup FROM pg_stat_user_tables WHERE n_dead_tup <> 0 ORDER BY 2 DESC; - -Finding last vacuum/auto-vacuum date -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - - SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables; - -Checking auto-vacuum is enabled -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - - SELECT name, setting FROM pg_settings WHERE name='autovacuum'; - -View all auto-vacuum setting -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - - SELECT * from pg_settings where category like 'Autovacuum'; - -Locks ------ - -Looking at granted locks -~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT relation::regclass, * FROM pg_locks WHERE NOT granted; - -Сombination of blocked and blocking activity -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - - SELECT blocked_locks.pid AS blocked_pid, - blocked_activity.usename AS blocked_user, - blocking_locks.pid AS blocking_pid, - blocking_activity.usename AS blocking_user, - blocked_activity.query AS blocked_statement, - blocking_activity.query AS current_statement_in_blocking_process - FROM pg_catalog.pg_locks blocked_locks - JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid - JOIN pg_catalog.pg_locks blocking_locks - ON blocking_locks.locktype = blocked_locks.locktype - AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database - AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation - AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page - AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple - AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid - AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid - AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid - AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid - AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid - AND blocking_locks.pid != blocked_locks.pid - - JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid - WHERE NOT blocked_locks.granted; diff --git a/docs/runbooks/index.rst b/docs/runbooks/index.rst deleted file mode 100644 index 18690c7..0000000 --- a/docs/runbooks/index.rst +++ /dev/null @@ -1,17 +0,0 @@ -Runbooks -======== - -Learn how to do anything in our infrastructure with these guidelines. - -.. note:: - - In general, we try to codify manual processes as much as possible. Still, - this section is important for tasks that are either hard to automate or are - run so infrequently that it does not make sense to regularly run them. - - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - postgresql-upgrade diff --git a/docs/runbooks/postgresql-upgrade.rst b/docs/runbooks/postgresql-upgrade.rst deleted file mode 100644 index 98b1642..0000000 --- a/docs/runbooks/postgresql-upgrade.rst +++ /dev/null @@ -1,149 +0,0 @@ -Upgrading PostgreSQL -==================== - -Step 1 - Enable maintenance mode --------------------------------- - -Add a worker route for ``pythondiscord.com/*`` to forward to the -``maintenance`` Cloudflare worker. - -Step 2 - Scale down all services that use PostgreSQL ----------------------------------------------------- - -Notably site, metricity, bitwarden and the like should be scaled down. - -Services that are read only such as Grafana (but NOT Metabase, Metabase -uses PostgreSQL for internal storage) do not need to be scaled down, as -they do not update the database in any way. - -.. code:: bash - - $ kubectl scale deploy --replicas 0 site metricity metabase bitwarden ... - -Step 3 - Take a database dump and gzip --------------------------------------- - -Using ``pg_dumpall``, dump the contents of all databases to a ``.sql`` -file. - -Make sure to gzip for faster transfer. - -Take a SHA512 sum of the output ``.sql.gz`` file to validate integrity -after copying. - -.. code:: bash - - $ pg_dumpall -U pythondiscord > backup.sql - $ gzip backup.sql - $ sha512sum backup.sql - a3337bfc65a072fd93124233ac1cefcdfbe8a708e5c1d08adaca2cf8c7cbe9ae4853ffab8c5cfbe943182355eaa701012111a420b29cc4f74d1e87f9df3af459 backup.sql - -Step 4 - Move database dump locally ------------------------------------ - -Use ``kubectl cp`` to move the ``backup.sql.gz`` file from the remote -pod to your local machine. - -Validate the integrity of the received file. - -Step 5 - Attempt local import to new PostgreSQL version -------------------------------------------------------- - -Install the new version of PostgreSQL locally and import the data. Make -sure you are operating on a **completely empty database server.** - -.. code:: bash - - $ gzcat backup.sql.gz | psql -U joe - -You can use any PostgreSQL superuser for the import. Ensure that no -errors other than those mentioned below occur, you may need to attempt -multiple times to fix errors listed below. - -Handle import errors -~~~~~~~~~~~~~~~~~~~~ - -Monitor the output of ``psql`` to check that no errors appear. - -If you receive locale errors ensure that the locale your database is -configured with matches the import script, this may require some usage -of ``sed``: - -.. code:: bash - - $ sed -i '' "s/en_US.utf8/en_GB.UTF-8/g" backup.sql - -Ensure that you **RESET THESE CHANGES** before attempting an import on -the remote, if they come from the PostgreSQL Docker image they will need -the same locale as the export. - -Step 7 - Spin down PostgreSQL ------------------------------ - -Spin down PostgreSQL to 0 replicas. - -Step 8 - Take volume backup at Linode -------------------------------------- - -Backup the volume at Linode through a clone in the Linode UI, name it -something obvious. - -Step 9 - Remove the Linode persistent volume --------------------------------------------- - -Delete the volume specified in the ``volume.yaml`` file in the -``postgresql`` directory, you must delete the ``pvc`` first followed by -the ``pv``, you can find the relevant disks through -``kubectl get pv/pvc`` - -Step 10 - Create a new volume by re-applying the ``volume.yaml`` file ---------------------------------------------------------------------- - -Apply the ``volume.yaml`` so a new, empty, volume is created. - -Step 11 - Bump the PostgreSQL version in the ``deployment.yaml`` file ---------------------------------------------------------------------- - -Update the Docker image used in the deployment manifest. - -Step 12 - Apply the deployment ------------------------------- - -Run ``kubectl apply -f postgresql/deployment.yaml`` to start the new -database server. - -Step 13 - Copy the data across ------------------------------- - -After the pod has initialised use ``kubectl cp`` to copy the gzipped -backup to the new Postgres pod. - -Step 14 - Extract and import the new data ------------------------------------------ - -.. code:: bash - - $ gunzip backup.sql.gz - $ psql -U pythondiscord -f backup.sql - -Step 15 - Validate data import complete ---------------------------------------- - -Ensure that all logs are successful, you may get duplicate errors for -the ``pythondiscord`` user and database, these are safe to ignore. - -Step 16 - Scale up services ---------------------------- - -Restart the database server - -.. code:: bash - - $ kubectl scale deploy --replicas 1 metricity bitwarden metabase - -Step 17 - Validate all services interact correctly --------------------------------------------------- - -Validate that all services reconnect successfully and start exchanging -data, ensure that no abnormal logs are outputted and performance remains -as expected. diff --git a/docs/tooling/bots.rst b/docs/tooling/bots.rst deleted file mode 100644 index 7b5e165..0000000 --- a/docs/tooling/bots.rst +++ /dev/null @@ -1,55 +0,0 @@ -Bots -==== - -Our GitHub repositories are supported by two custom bots: - -- Our **Fast Forward Bot**, which ensures that commits merged into main - are either merged manually on the command line or via a fast-forward, - ensuring that cryptographic signatures of commits remain intact. - Information on the bot can be found `in the ff-bot.yml - configuration <https://github.com/python-discord/infra/blob/main/.github/ff-bot.yml>`__. - Merges over the GitHub UI are discouraged for this reason. You can - use it by running ``/merge`` on a pull request. Note that attempting - to use it without permission to do so will be reported. - -- Our **Craig Dazey Emulator Bot**, which ensures team morale stays - high at all times by thanking team members for submitted pull - requests. [1]_ - -Furthermore, our repositories all have dependabot configured on them. - -Dealing with notifications --------------------------- - -This section collects some of our team members’ ways of dealing with the -notifications that originate from our bots. - -Sieve (RFC 5228) script -~~~~~~~~~~~~~~~~~~~~~~~ - -If your mail server supports the `Sieve mail filtering -language <https://datatracker.ietf.org/doc/html/rfc5228.html>`__, which -it should, you can adapt the following script to customize the amount of -notifications you receive: - -.. code:: sieve - - require ["envelope", "fileinto", "imap4flags"]; - - if allof (header :is "X-GitHub-Sender" ["coveralls", "github-actions[bot]", "netlify[bot]"], - address :is "from" "[email protected]") { - setflag "\\seen"; - fileinto "Trash"; - stop; - } - -If you also want to filter out notifications from renovate, which we use -for dependency updates, you can add ``renovate[bot]`` to the -``X-GitHub-Sender`` list above. - -.. [1] - Craig Dazey Emulator Bot stands in no affiliation, direct or - indirect, with Craig Dazey. Craig Dazey Emulator Bot. Craig Dazey - Emulator Bot is not endorsed by Craig Dazey. Craig Dazey Emulator Bot - is an independent project of Craig Dazey. No association is made - between Craig Dazey Emulator Bot and Craig Dazey. diff --git a/docs/tooling/index.rst b/docs/tooling/index.rst deleted file mode 100644 index 2381849..0000000 --- a/docs/tooling/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -Tooling -======= - -Learn about the helperlings that keep Python Discord DevOps running like a -well-oiled machine. - - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - bots |