diff options
| author | 2024-08-07 18:41:02 +0100 | |
|---|---|---|
| committer | 2024-08-07 18:41:02 +0100 | |
| commit | dcbb78959177537cf1fdda813380996a4b2daf8f (patch) | |
| tree | 0a53ded19896aaddf93cc8f1e4ff34ac3f70464e /docs | |
| parent | Revert "Enable fail2ban jails for postfix" (diff) | |
Remove old documentation
Diffstat (limited to 'docs')
59 files changed, 0 insertions, 3003 deletions
| diff --git a/docs/Makefile b/docs/Makefile deleted file mode 100644 index d4bb2cb..0000000 --- a/docs/Makefile +++ /dev/null @@ -1,20 +0,0 @@ -# Minimal makefile for Sphinx documentation -# - -# You can set these variables from the command line, and also -# from the environment for the first two. -SPHINXOPTS    ?= -SPHINXBUILD   ?= sphinx-build -SOURCEDIR     = . -BUILDDIR      = _build - -# Put it first so that "make" without argument is like "make help". -help: -	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) - -.PHONY: help Makefile - -# Catch-all target: route all unknown targets to Sphinx using the new -# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS). -%: Makefile -	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/_static/.gitkeep b/docs/_static/.gitkeep deleted file mode 100644 index e69de29..0000000 --- a/docs/_static/.gitkeep +++ /dev/null diff --git a/docs/_static/logo.png b/docs/_static/logo.pngBinary files differ deleted file mode 100644 index 1c125c7..0000000 --- a/docs/_static/logo.png +++ /dev/null diff --git a/docs/_templates/.gitkeep b/docs/_templates/.gitkeep deleted file mode 100644 index e69de29..0000000 --- a/docs/_templates/.gitkeep +++ /dev/null diff --git a/docs/conf.py b/docs/conf.py deleted file mode 100644 index d9c0855..0000000 --- a/docs/conf.py +++ /dev/null @@ -1,40 +0,0 @@ -# Configuration file for the Sphinx documentation builder. -# -# For the full list of built-in configuration values, see the documentation: -# https://www.sphinx-doc.org/en/master/usage/configuration.html - -# -- Project information ----------------------------------------------------- -# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information - -project = "DevOps" -copyright = "2024, Python Discord" -author = "Joe Banks <[email protected]>, King Arthur <[email protected]>" - -# -- General configuration --------------------------------------------------- -# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration - -extensions = [] - -templates_path = ["_templates"] -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] - - -# -- Options for HTML output ------------------------------------------------- -# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output - -html_theme = "alabaster" -html_static_path = ["_static"] -html_theme_options = { -    "logo": "logo.png", -    "logo_name": True, -    "logo_text_align": "center", -    "github_user": "python-discord", -    "github_repo": "infra", -    "github_button": True, -    "extra_nav_links": { -        "DevOps on YouTube": "https://www.youtube.com/watch?v=b2F-DItXtZs", -        "git: Infra": "https://github.com/python-discord/infra/", -        "git: King Arthur": "https://github.com/python-discord/king-arthur/", -        "Kanban Board": "https://github.com/orgs/python-discord/projects/17/views/4", -    }, -} diff --git a/docs/general/index.rst b/docs/general/index.rst deleted file mode 100644 index 60a04cb..0000000 --- a/docs/general/index.rst +++ /dev/null @@ -1,9 +0,0 @@ -General -======= - - -.. toctree:: -   :maxdepth: 2 -   :caption: Contents: - -   manual-deploys diff --git a/docs/general/manual-deploys.rst b/docs/general/manual-deploys.rst deleted file mode 100644 index 0d874ea..0000000 --- a/docs/general/manual-deploys.rst +++ /dev/null @@ -1,27 +0,0 @@ -Manual Deployments -================== - -When the DevOps team are not available, Administrators and Core -Developers can redeploy our critical services, such as Bot, Site and -ModMail. - -This is handled through workflow dispatches on this repository. To get -started, head to the -`Actions <https://github.com/python-discord/kubernetes/actions>`__ tab -of this repository and select ``Manual Redeploy`` in the sidebar, -alternatively navigate -`here <https://github.com/python-discord/kubernetes/actions/workflows/manual_redeploy.yml>`__. - -.. image:: https://user-images.githubusercontent.com/20439493/116442084-00d5f400-a84a-11eb-8e8a-e9e6bcc327dd.png - -Click ``Run workflow`` on the right hand side and enter the service name -that needs redeploying, keep the branch as ``main``: - -.. image:: https://user-images.githubusercontent.com/20439493/116442202-22cf7680-a84a-11eb-8cce-a3e715a1bf68.png - -Click ``Run`` and refresh the page, you’ll see a new in progress Action -which you can track. Once the deployment completes notifications will be -sent to the ``#dev-ops`` channel on Discord. - -If you encounter errors with this please copy the Action run link to -Discord so the DevOps team can investigate when available. diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 348575d..0000000 --- a/docs/index.rst +++ /dev/null @@ -1,50 +0,0 @@ -.. Python Discord DevOps documentation master file, created by -   sphinx-quickstart on Wed Jul 24 19:49:56 2024. -   You can adapt this file completely to your liking, but it should at least -   contain the root `toctree` directive. - -Python Discord DevOps -===================== - -Welcome to the Python Discord DevOps knowledgebase. - -Within this set of pages you will find: - -- Changelogs - -- Post-mortems - -- Common queries - -- Runbooks - - -Table of contents ------------------ - -.. toctree:: -   :maxdepth: 2 - -   general/index -   onboarding/index -   postmortems/index -   queries/index -   runbooks/index -   tooling/index - - -Meeting notes -------------- - -.. toctree:: -   :maxdepth: 2 - -   meeting_notes/index - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/docs/make.bat b/docs/make.bat deleted file mode 100644 index 954237b..0000000 --- a/docs/make.bat +++ /dev/null @@ -1,35 +0,0 @@ -@ECHO OFF - -pushd %~dp0 - -REM Command file for Sphinx documentation - -if "%SPHINXBUILD%" == "" ( -	set SPHINXBUILD=sphinx-build -) -set SOURCEDIR=. -set BUILDDIR=_build - -%SPHINXBUILD% >NUL 2>NUL -if errorlevel 9009 ( -	echo. -	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx -	echo.installed, then set the SPHINXBUILD environment variable to point -	echo.to the full path of the 'sphinx-build' executable. Alternatively you -	echo.may add the Sphinx directory to PATH. -	echo. -	echo.If you don't have Sphinx installed, grab it from -	echo.https://www.sphinx-doc.org/ -	exit /b 1 -) - -if "%1" == "" goto help - -%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% -goto end - -:help -%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% - -:end -popd diff --git a/docs/meeting_notes/2022-04-07.rst b/docs/meeting_notes/2022-04-07.rst deleted file mode 100644 index ee23a5d..0000000 --- a/docs/meeting_notes/2022-04-07.rst +++ /dev/null @@ -1,20 +0,0 @@ -2022-04-07 -========== - -Agenda ------- - --  No updates, as last week’s meeting did not take place - -Roadmap review & planning -------------------------- - -What are we working on for the next meeting? - --  Help wanted for #57 (h-asgi) --  #58 (postgres exporter) needs a new review --  #54 (firewall in VPN) will be done by Johannes --  We need a testing environment #67 --  Johannes will add a Graphite role #31 --  Sofi will take a look at #29 --  #41 (policy bot) will be taken care of by Johannes diff --git a/docs/meeting_notes/2022-09-18.rst b/docs/meeting_notes/2022-09-18.rst deleted file mode 100644 index 163434c..0000000 --- a/docs/meeting_notes/2022-09-18.rst +++ /dev/null @@ -1,74 +0,0 @@ -2022-09-18 -========== - -*Migrated from Notion*. - -Agenda ------- - --  Joe will grant Chris access to the netcup hosts. - -NetKube status -~~~~~~~~~~~~~~ - --  **Rollout** - -   -  ☒ RBAC configuration and access granting -   -  ☒ Most nodes are enrolled, Joe will re-check -   -  ``turing``, ``ritchie``, ``lovelace`` and ``neumann`` will be -      Kubernetes nodes -   -  ``hopper`` will be the storage server - --  **Storage drivers** - -   -  Not needed, everything that needs persistent storage will run on -      hopper -   -  Netcup does not support storage resize -   -  We can download more RAM if we need it -   -  A couple of services still need volume mounts: Ghost, Grafana & -      Graphite - --  **Control plane high availability** - -   -  Joe mentions that in the case the control plane dies, everything -      else will die as well -   -  If the control plane in Germany dies, so will Johannes - --  **Early plans for migration** - -   -  We can use the Ansible repository issues for a good schedule -   -  Hopper runs ``nginx`` -   -  Statement from Joe: > “There is an nginx ingress running on every -      node in the cluster, okay, > okay? We don’t, the way that’s, -      that’s as a service is a NodePort, right? > So it has a normal IP, -      but the port will be like a random port in the range > of the -      30,000s. Remember that? Hold on. Is he writing rude nodes? And -      then… > We have nginx, so this is where it’s like a little bit, -      like, not nice, I > guess we just like, cronjob it, to pull the -      nodes, like, every minute or > so, and then update the config if -      they change. But then it’s just like… > nginx is like a catalogue -      of nodes. Wahhh, you drive me crazy.” - -      -  “Nah, it makes sense!” - -         -  “It does!” - -      -  Joe will figure this out with assistance from his voices. - -Open authentication -~~~~~~~~~~~~~~~~~~~ - --  Joe and Johannes will check out OpenLDAP as a JumpCloud alternative -   starting from this evening --  Sofi has experience with OpenLDAP - -Sponsorship ------------ - -This meeting has been sponsored by Chris Hemsworth Lovering’s -relationship therapy company, “Love To Love By Lovering”. You can sign -up by sending a mail to [email protected]. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-10-05.rst b/docs/meeting_notes/2022-10-05.rst deleted file mode 100644 index e069299..0000000 --- a/docs/meeting_notes/2022-10-05.rst +++ /dev/null @@ -1,13 +0,0 @@ -2022-10-05 -========== - -*Migrated from Notion*. - -Agenda ------- - --  Joe Banks configured proper RBAC for Chris, Johannes and Joe himself - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-10-19.rst b/docs/meeting_notes/2022-10-19.rst deleted file mode 100644 index 6de7f33..0000000 --- a/docs/meeting_notes/2022-10-19.rst +++ /dev/null @@ -1,31 +0,0 @@ -2022-10-19 -========== - -*Migrated from Notion*. - -Agenda ------- - --  One hour of gartic phone, for team spirit. --  Created user accounts for Sofi and Hassan --  Joe created an architecture diagram of the NGINX setup - -   -  *This is still in Notion* - --  Joe explained his NGINX plans: > “It’s not actually that hard, right? -   So you spawn 5 instances of nginx in a > DaemonSet, because then one -   gets deployed to every node okay, following? > Then we get NodePort, -   instead of LoadBalancers or whatever, which will get > a random port -   allocatead in the 35000 range, and that will go to nginx, and > on -   each of those ports, it will go to nginx, right? And then we poll the -   > Kubernetes API and what is the port that each of these nginx -   instances is > running on, and add that into a roundrobin on the -   fifth node. Right? Yeah. > That’s correct. That won’t do TLS though, -   so that will just HAProxy. Yeah.” --  Joe will terminate our JumpCloud account --  Chris reset the Minecraft server --  Email alerting needs to be configured - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-10-26.rst b/docs/meeting_notes/2022-10-26.rst deleted file mode 100644 index 69f8c70..0000000 --- a/docs/meeting_notes/2022-10-26.rst +++ /dev/null @@ -1,18 +0,0 @@ -2022-10-26 -========== - -*Migrated from Notion*. - -Agenda ------- - --  Chris upgraded PostgreSQL to 15 in production --  Johannes added the Kubernetes user creation script into the -   Kubernetes repository in the docs - -*(The rest of the meeting was discussion about the NetKube setup, which -has been scrapped since)*. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-11-02.rst b/docs/meeting_notes/2022-11-02.rst deleted file mode 100644 index d9f415d..0000000 --- a/docs/meeting_notes/2022-11-02.rst +++ /dev/null @@ -1,27 +0,0 @@ -2022-11-02 -========== - -*Migrated from Notion*. - -Agenda ------- - -Hanging behaviour of ModMail -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - --  `Source <https://discord.com/channels/267624335836053506/675756741417369640/1036720683067134052>`__ - --  Maybe use `Signals + a -   debugger <https://stackoverflow.com/a/25329467>`__? - --  … using `something like pdb for the -   debugger <https://wiki.python.org/moin/PythonDebuggingTools>`__? - --  Or `GDB, as it seems handy to poke at stuck multi-threaded python -   software <https://wiki.python.org/moin/DebuggingWithGdb>`__? - --  ModMail has been upgraded to version 4 - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2022-11-23.rst b/docs/meeting_notes/2022-11-23.rst deleted file mode 100644 index 19edd06..0000000 --- a/docs/meeting_notes/2022-11-23.rst +++ /dev/null @@ -1,30 +0,0 @@ -2022-11-23 -========== - -*Migrated from Notion*. - -Agenda ------- - -*(This meeting was mostly about NetKube, with the following strange text -included, and everything outside of the text has been removed since the -NetKube plans have been scrapped)*. - -Joe Banks, after a month-long hiatus to become a dad to every second -girl on uni campus, has managed to pull up to the DevOps meeting. - -We are considering using Kubespray (https://kubespray.io/#/) in order to -deploy a production-ready bare-metal Kubernetes cluster without -involvement from Joe “Busy With Poly Girlfriend #20” Banks. - -At the moment cluster networking is not working and Joe mentions that -the last time he has touched it, it worked perfectly fine. However, the -last time he touched it there was only 1 node, and therefore no -inter-node communications. - -Joe thinks he remembers installing 3 nodes, however, we at the DevOps -team believe this to be a marijuana dream - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-02-08.rst b/docs/meeting_notes/2023-02-08.rst deleted file mode 100644 index a161ba5..0000000 --- a/docs/meeting_notes/2023-02-08.rst +++ /dev/null @@ -1,17 +0,0 @@ -2023-02-08 -========== - -*Migrated from Notion*. - -Agenda ------- - --  Investigation into deploying a VPN tool such as WireGuard to have -   inter-node communication between the Netcup hosts. - -*(The rest of this meeting was mostly about NetKube, which has since -been scrapped)*. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-02-21.rst b/docs/meeting_notes/2023-02-21.rst deleted file mode 100644 index 9de644c..0000000 --- a/docs/meeting_notes/2023-02-21.rst +++ /dev/null @@ -1,31 +0,0 @@ -2023-02-21 -========== - -*Migrated from Notion*. - -Agenda ------- - -Reusable status embed workflows -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - --  Further discussion with Bella followed --  Upstream pull request can be found at -   `python-discord/bot#2400 <https://github.com/python-discord/bot/pull/2400>`__ - -Local vagrant testing setup -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - --  Our new `testing setup using Vagrant -   VMs <https://github.com/python-discord/infra/pull/78>`__ has been -   merged. - -A visit from Mina -~~~~~~~~~~~~~~~~~ - -Mina checked in to make sure we’re operating at peak Volkswagen-like -efficiency. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-02-28.rst b/docs/meeting_notes/2023-02-28.rst deleted file mode 100644 index 1fb1093..0000000 --- a/docs/meeting_notes/2023-02-28.rst +++ /dev/null @@ -1,16 +0,0 @@ -2023-02-28 -========== - -*Migrated from Notion*. - -Agenda ------- - --  Black knight’s CI & dependabot configuration has been mirrored across -   all important repositories - --  The test server has been updated for the new configuration - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-05-16.rst b/docs/meeting_notes/2023-05-16.rst deleted file mode 100644 index 79272a6..0000000 --- a/docs/meeting_notes/2023-05-16.rst +++ /dev/null @@ -1,15 +0,0 @@ -2023-05-16 -========== - -*Migrated from Notion*. - -Agenda ------- - --  Bella set up `CI bot docker image -   build <https://github.com/python-discord/bot/pull/2603>`__ to make -   sure that wheels are available. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-07-11.rst b/docs/meeting_notes/2023-07-11.rst deleted file mode 100644 index 68b1085..0000000 --- a/docs/meeting_notes/2023-07-11.rst +++ /dev/null @@ -1,41 +0,0 @@ -2023-07-11 -========== - -Participants ------------- - --  Chris, Johannes, Bella, Bradley - -Agenda ------- - -New Ansible setup -~~~~~~~~~~~~~~~~~ - -Chris presented the new Ansible setup he’s been working on. We plan to -use WireGuard for networking. We agreed that selfhosting Kubernetes is -not the way to go. In general, the main benefit from switching away to -Linode to Netcup is going to be a ton more resources from the Netcup -root servers we were given. The original issue with Linode’s AKS of -constantly having problems with volumes has not been present for a -while. Chris mentions the one remaining issue is that we’re at half our -memory capacity just at idle. - -It’s our decision where to go from here - we can stick to the Kubernetes -setup or decide on migrating to the Ansible setup. But we have bare -metal access to the Netcup hosts, which makes e.g. managing databases a -lot easier. Chris mentions the possibility to only use Netcup for our -persistence and Linode AKS for anything else, but this has the issue of -us relying on two sponsors for our infrastructure instead of one. - -PostgreSQL was set up to run on ``lovelace``. - -Decision -~~~~~~~~ - -**It was decided to hold a vote on the core development channel, which -will be evaluated next week to see how to proceed with the setup**. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-07-18.rst b/docs/meeting_notes/2023-07-18.rst deleted file mode 100644 index f37b2dc..0000000 --- a/docs/meeting_notes/2023-07-18.rst +++ /dev/null @@ -1,42 +0,0 @@ -2023-07-18 -========== - -Secret management improvements ------------------------------- - -To allow for **better management of our Kubernetes secrets**, Chris set -out to configure ``git-crypt`` in GPG key mode. For comparison, the -previous approach was that secrets were stored in Kubernetes only and -had to be accessed via ``kubectl``, and now ``git-crypt`` allows us to -transparently work with the files in unencrypted manner locally, whilst -having them secure on the remote, all via ``.gitattributes``. - -The following people currently have access to this: - --  Johannes Christ [email protected] -   (``8C05D0E98B7914EDEBDCC8CC8E8E09282F2E17AF``) --  Chris Lovering [email protected] -   (``1DA91E6CE87E3C1FCE32BC0CB6ED85CC5872D5E4``) --  Joe Banks [email protected] (``509CDFFC2D0783A33CF87D2B703EE21DE4D4D9C9``) - -For Hassan, we are still waiting on response regarding his GPG key -accuracy. - -The pull request for the work can be found `at -python-discord/kubernetes#156 <https://github.com/python-discord/kubernetes/pull/156>`__. - -**To have your key added, please contact any of the existing key -holders**. More documentation on this topic is pending to be written, -see -`python-discord/kubernetes#157 <https://github.com/python-discord/kubernetes/issues/157>`__. - -Infrastructure migration decision ---------------------------------- - -The voting started `last week <./2023-07-11.md>`__ will be properly -talked about `next week <./2023-07-25.md>`__, so far it looks like we’re -definitely not selfhosting Kubernetes at the very least. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-07-25.rst b/docs/meeting_notes/2023-07-25.rst deleted file mode 100644 index 0a3204c..0000000 --- a/docs/meeting_notes/2023-07-25.rst +++ /dev/null @@ -1,4 +0,0 @@ -2023-07-25 -========== - -Postponed to next week due to Joe having a severe bellyache. diff --git a/docs/meeting_notes/2023-08-01.rst b/docs/meeting_notes/2023-08-01.rst deleted file mode 100644 index 67e4ee1..0000000 --- a/docs/meeting_notes/2023-08-01.rst +++ /dev/null @@ -1,66 +0,0 @@ -2023-08-01 -========== - -Agenda ------- - -Infrastructure migration -~~~~~~~~~~~~~~~~~~~~~~~~ - -The vote is tied. Chris and Johannes decided that we should test out -migrating the PostgreSQL database at the very least. We then have more -freedom about our data. What we need to do: - --  Allow PostgreSQL connections from LKE’s static IPs in the firewall --  Whitelist the static IPs from Linode via ``pg_hba.conf`` --  Schedule downtime for the PostgreSQL database --  **At downtime** - -   -  Take writers offline -   -  Dump database from Linode into Netcup -   -  Update all the client’s database URLs to point to netcup -   -  Restart writers - -We want to rely on the restore to create everything properly, but will -need to test run this beforehand. The following ``pg_virtualenv`` -command has showcased that it works properly: - -.. code:: sh - -   kubectl exec -it postgres-... -- pg_dumpall -U pythondiscord \ -   | pg_virtualenv psql -v ON_ERROR_STOP=1 - -Note however that the database extension ``pg_repack`` needs to be -installed. - -Before we can get started, we need to allow the PostgreSQL role to -configure ``pg_hba.conf`` and ``postgresql.conf`` entries. - -Meeting notes -~~~~~~~~~~~~~ - -We’re using GitHub at the moment. Some are left in Notion. We should -migrate these to GitHub to have a uniform interface: Johannes will pick -up -`python-discord/infra#108 <https://github.com/python-discord/infra/issues/108>`__ -to merge them together into Git, as its more open than Notion. - -Ansible lint failures in the infra repository -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Excluding the vault was found as the working solution here, as -implemented by Chris. - -Kubernetes repository pull requests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -These were cleaned up thanks to Chris. - -Roadmap review & planning -------------------------- - --  Chris will prepare the PostgreSQL configuration mentioned above. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-08-08.rst b/docs/meeting_notes/2023-08-08.rst deleted file mode 100644 index 0082cd3..0000000 --- a/docs/meeting_notes/2023-08-08.rst +++ /dev/null @@ -1,54 +0,0 @@ -2023-08-08 -========== - -Agenda ------- - --  Configuration of PostgreSQL and the PostgreSQL exporter - -   -  **No time so far**. Chris has been busy with renovating his living -      room, and Johannes has been busy with renovating his bedroom. -      Bradley prefers to remain quiet. - -   -  Chris will try to work on this in the coming week and will try to -      have Bella around as well, since he wanted to join the setup. - --  **Potential slot for GPG key signing of DevOps members**. External -   verification will be necessary. - -   -  Skipped. No webcam on Chris. - --  We need to assign a **librarian** to keep our documents organized -   according to a system. Johannes is happy to do this for now. - -   -  Let’s move the existing documentation from the Kubernetes -      repository into the infra repository. See -      `kubernetes#161 <https://github.com/python-discord/kubernetes/issues/161>`__. - -   -  **Our Notion DevOps space is full of junk**. Outside of that, it’s -      not open to read for outside contributors, and does not leave much -      choice over which client to use for editing content. - -      -  Chris agrees, without looking on it - just from memory. We -         should move it to the infra repository. (The meeting notes have -         already been transferred). - -   -  Bella suggests to add some automation to make keeping everything -      in clean order less tedious. - --  We may want to integrate the **Kubernetes repository** and the infra -   repository together altogether, however there are a lot of -   repositories referencing the deployment manifests that would need to -   be updated. - -   -  Chris mentions that regardless of what we do, we should - at the -      very least move all documentation into the ``infra`` repository, -      including the static site generator. At the moment we’re using -      Jekyll but we’re open to trying alternatives such as Hugo. - --  We closed some issues and pull requests in the repositories for late -   spring cleaning. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2 autoindent conceallevel=2: --> diff --git a/docs/meeting_notes/2023-08-22.rst b/docs/meeting_notes/2023-08-22.rst deleted file mode 100644 index a8d1287..0000000 --- a/docs/meeting_notes/2023-08-22.rst +++ /dev/null @@ -1,40 +0,0 @@ -2023-08-22 -========== - -.. raw:: html - -   <!-- - -   Useful links - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -   --> - -Agenda ------- - --  Bella said he is on the streets. **We should start a gofundme**. - -   -  After some more conversation this just means he is on vacation and -      currently taking a walk. - --  Chris has been busy with turning his living room into a picasso art -   collection, Johannes has been busy with renovating his bedroom, and -   Bella is not home. - -   -  Our next priority is winning. - --  We checked out some issues with documentation generation in -   ``bot-core`` that Bella has mentioned. We managed to fix one issue -   with pydantic by adding it to an exclude list but ran into another -   problem next. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-08-29.rst b/docs/meeting_notes/2023-08-29.rst deleted file mode 100644 index da49c1e..0000000 --- a/docs/meeting_notes/2023-08-29.rst +++ /dev/null @@ -1,65 +0,0 @@ -2023-08-29 -========== - -.. raw:: html - -   <!-- - -   Useful links - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -   --> - -Agenda ------- - --  **Bella is still on the streets** - -   -  The Python Discord Bella On The Streets Fundraising Campaign Q3 -      2023 has not been successful so far. To help Bella receive French -      citizenship, Joe has put up a French flag behind himself in the -      meeting. - -      -  Joe corrects my sarcasm. It is an Italian flag, not a French -         flag. The reason for this flag is that his new prime interest -         on campus was born in Italy. - --  **The SnekBox CI build is pretty slow** - -   -  Guix and Nix are not alternatives. Neither is Ubuntu - -   -  We use pyenv to build multiple Python versions for a new feature - -      -  The feature is not rolled out yet - -   -  Part of the problem is that we build twice in the ``build`` and -      the ``deploy`` stage - -   -  On rollout, Joe tested it and it works fine - --  No update on the Hugo build yet - --  For snowflake, Johannes will write a proposal to the admins for -   hosting it - -   -  We should consider talking about the following points: - -      -  statistically ~8% of Tor traffic is problematic (10% of traffic -         is to hidden services, 80% of hidden service traffic is for -         illegal services) - -      -  overall the project’s position and our ideal is to help people -         for a good cause - -      -  all traffic is forwarded to the Tor network, the service is -         lightweight and only proxies encrypted traffic there - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-09-05.rst b/docs/meeting_notes/2023-09-05.rst deleted file mode 100644 index 7556ab6..0000000 --- a/docs/meeting_notes/2023-09-05.rst +++ /dev/null @@ -1,53 +0,0 @@ -2023-09-05 -========== - -.. raw:: html - -   <!-- - -   Useful links - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -   --> - -Agenda ------- - --  No update on the Hugo build yet - --  Johannes wrote a proposal for snowflake proxy to be deployed to our -   netcup hosts - -   -  Admins discussed and came to the conclusion that since we don’t -      own the servers, we got the servers from netcup as a sponsorship -      to host our infra, so using them to host something that isn’t our -      infra doesn’t seem right. - --  Lots of dependabot PRs closed - -   -  https://github.com/search?q=org%3Apython-discord++is%3Apr+is%3Aopen+label%3A%22area%3A+dependencies%22&type=pullrequests&ref=advsearch -   -  Closed ~50% of PRs - --  Workers repo has had its CI rewritten, all workers have consistent -   package.json, scripts, and using the new style of cloudflare workers -   which don’t use webpack - --  Metricity updated to SQLAlchemy 2 - --  Olli CI PR is up - -   -  https://github.com/python-discord/olli/pull/25 - --  Sir-Robin pydantic constants PR is up - -   -  https://github.com/python-discord/sir-robin/pull/93 - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2023-09-12.rst b/docs/meeting_notes/2023-09-12.rst deleted file mode 100644 index 6dbb7c8..0000000 --- a/docs/meeting_notes/2023-09-12.rst +++ /dev/null @@ -1,73 +0,0 @@ -2023-09-12 -========== - -.. raw:: html - -   <!-- - -   Useful links - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -   --> - -Agenda ------- - --  We have reason to believe that Bella is still on the streets. Worse, -   Bella is not available at the moment, leading us to believe that -   Bella has still not found a home. - -   -  Eight minutes into the meeting, Bella joins, complaining about the -      bad internet. He mentions he is still on the streets (this may -      contribute to the bad internet factor). - --  Chris made Mina leave with his repeated comments about Bella being -   homeless, reminding Mina of the growing unemployment rate within the -   DevOps team. As head of HR she cannot further support this matter. - --  About #139, Bella mentions that online websites may cover the same -   need that we have, but it may not be really useful for having it as a -   command. - -   -  Chris adds that “if someone wants to do it, I don’t mind” and “I -      don’t think it would be very useful for a command, but I think it -      would be fun to learn for someone implementing it”. As long as -      whoever is implementing is is aware that it would not be used too -      much, it would be fine. - --  No progress on the hugo front - --  Our email service with workers will be forward only - -   -  With postfix you will be able to reply. Joe wants to have an -      excuse to play with Cloudflare workers though. - --  `50 open pull requests from -   dependabot <https://github.com/search?q=org%3Apython-discord++is%3Apr+is%3Aopen+author%3Aapp%2Fdependabot&type=pullrequests&ref=advsearch>`__ - -   -  Tip from The Man: press ^D to make a bookmark in your browser - -   -  “Those can just be blindly merged” - Chris - --  Grouping of dependencies: Dependabot now allows you to group together -   multiple dependency updates into a single pull request. - -   -  Possible approaches suggested: Group all the docker updates -      together, group any linting dependencies together (would just -      require a big RegEx). Dependabot natively works with its own -      dependency groups here (e.g. Docker, Pip). - --  Mr. Hemlock wants to raise his roof: It’s his project for this -   Autumn. We, the team, are looking forward to his project - especially -   Bella, who is currently looking for housing. “It’s all coming -   together”, said Chris to the situation. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2024-07-02.rst b/docs/meeting_notes/2024-07-02.rst deleted file mode 100644 index 4d2ba03..0000000 --- a/docs/meeting_notes/2024-07-02.rst +++ /dev/null @@ -1,171 +0,0 @@ -2024-07-02 -========== - -.. raw:: html - -   <!-- - -   Useful links - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -   --> - -Attendees ---------- - -Joe and Johannes. - -Chris unfortunately died in a fatal train accident and could not attend -the meeting. This incident will be rectified in the next release, -“Lovering 2.0: Immortability”. - -Bella is out on the streets again. We are waiting for approval from the -Python Discord admins to run another fundraiser. - -Agenda ------- - --  **Configuration of renovate** (Joe) - -   We are replacing dependabot with renovatebot. Johannes welcomes this -   decision. Joe says we are looking for automatic deployment from -   Kubernetes to make sure that any updates are automatically deployed. - -   **Conclusion**: Implemented. - --  **Resizing Netcup servers** (Joe, Johannes) - -   We can probably get rid of turing, assess what else we want to deploy -   on lovelace, and then ask for a resize. - -   **Conclusion**: Create issue to move things off turing, remove it -   from the inventory, remove it from documentation, power it off, then -   have Joe ask for server removal. - --  **Updating the public statistics page** (Johannes) - -   Discussing and showcasing possible alternatives to the current -   infrastructure powering https://stats.pythondiscord.com via the -   https://github.com/python-discord/public-stats repository. Johannes -   presents his current scripts that cuddle RRDTool into loading data -   out of metricity, Joe says we will discuss with Chris what to do -   here. - -   The likely way going forward will be that *we will open an issue to -   set it up*, the setup will contain an Ansible role to deploy the -   cronjob and the script onto lovelace alongside with the ``rrdtool`` -   PostgreSQL user. - -   **Conclusion**: Johannes will create an issue and codify the setup in -   Ansible. - --  **New blog powered by Hugo** (Johannes) - -   Our current Ghost-powered blog is a tiny bit strange, and the -   onboarding ramp to contribute articles is large. We want to migrate -   this to Hugo - Johannes is leading the effort on it. The main work -   will be building an appropriate theme, as no nicely suitable -   replacement theme has been found so far. Front-end contributors would -   be nice for this, although currently everything is still local on my -   machine. - -   Joe mentions that we don’t need to take anything particularly similar -   to the current Ghost theme, just some vague resemblance would be -   nice. Most of the recommended Hugo themes would probably work. -   Johannes will check it out further. - -   **Conclusion**: Try the `hugo-casper-two -   theme <https://github.com/eueung/hugo-casper-two>`__ and report back. - --  **Finger server** (Joe, Johannes) - -   Joe recently proposed `the deployment of a finger -   server <https://github.com/python-discord/infra/pull/373>`__. Do we -   want this and if yes, how are we going to proceed with this? If we do -   not want any, running the ``pinky`` command locally or via ``ssh`` -   would be a sound idea. We also need to consider whether members will -   update their files regularly - we may want to incorporate -   functionality for this into e.g. King Arthur. - -   Joe says that we shouldn’t put a lot of development effort into it, -   it would be simply a novelty thing. - -   **Conclusion**: This is a nice cheap win for some fun which should -   just be a simple Python file (via Twisted’s Finger protocol support -   or whatever) that connects to LDAP (see Keycloak authentication -   server) and outputs information. We could possibly integrate this -   into King Arthur as well, so the querying workflow could look like KA -   -> fingerd -> LDAP, or people could use finger commands directly. - --  **Keycloak authentication server** (Joe) - -   Joe mentions that we are deploying a Keycloak server because for some -   members authenticating via GitHub is cumbersome, for instance because -   their GitHub account is connected to their employer’s GitHub -   Enterprise installation. We could hook up a finger server to the LDAP -   endpoint. Joe also mentions that we might want to set up e-mail -   forwarding from pydis addresses to users via the user database that -   will be stored in Keycloak. - -   Currently we only have a Keycloak installation that stores items in -   PostgreSQL. This installation can federate to LDAP - we would simply -   have to settle on some directory service backend. Joe suggests -   FreeIPA because he’s familar with it (including the Keycloak -   integration). The problem is that it doesn’t work on Debian. The -   alternative proposal, given that we’re saving ~50$/month on Linode, -   would be spinning up a Rocky VM with FreeIPA on it on Linode (we -   already have the budget) or ask Netcup for another VM. Ultimately, -   the system to run FreeIPA would be something CentOS-based. One aspect -   to consider is networking security: in Linode we could use their -   private cloud endpoint feature to securely expose the LDAP server to -   Keycloak and other services in Kubernetes, if we were to run it in -   Netcup, we would need to use a similar setup to what we currently -   have with PostgreSQL. - -   Any Python Discord user would be managed in LDAP, and Keycloak has -   the necessary roles to write back into LDAP. Keeping the users in -   FreeIPA up-to-date would be a somewhat manual procedure. Joe’s plan -   was to pick up the user’s Discord username and use -   ``[email protected]`` as their name and do account setup as part of -   the staff onboarding. - -   **Conclusion**: Will wait for Chris to discuss this further, but we -   simply need to decide where we want to run the LDAP service. - --  **Flux CD** (Joe) - -   Joe proposes deploying `flux <https://fluxcd.io/>`__ as a way to -   improve the way we manage our CI/CD. We want the cluster to be able -   to synchronize its state with the git repository. There are some -   manifests in the repository currently that are not in sync with the -   cluster version. - -   **Conclusion**: Approved, Joe will create an issue and do it. - --  **Polonium** (Chris) - -   Question came up regarding why the bot does not write to the database -   directly. Joe said it’s not perfect to have the bot write to it -   directly - in metricity it works but it’s not perfect. Chris probably -   had good reason: separation of intent. - -   **Conclusion**: Approved, write to R&D for financing. - --  **Rethinking Bella: Suggested measures to gain autonomy** (Chris) - -   Chris will present our current plans to biologically re-think and -   improve Bella’s current architecture by means of -   hypertrophy-supported capillary enlargements, with the final goal of -   gaining complete control and ownership over the World Economic Forum -   by 2026. As Bella is currently on parental leave, we will send him -   the result of this voting via NNCP. - -.. raw:: html - -   <!-- vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/meeting_notes/2024-07-25.rst b/docs/meeting_notes/2024-07-25.rst deleted file mode 100644 index 8d3175c..0000000 --- a/docs/meeting_notes/2024-07-25.rst +++ /dev/null @@ -1,46 +0,0 @@ -2024-07-25 -========== - -.. -   Useful links - -   - Infra Kanban board: https://github.com/orgs/python-discord/projects/17/views/4 - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -Attendees ---------- - -Bella, Joe, Fredrick, Chris, Johannes - -Agenda ------- - -- **Open issues and pull requests in Joe's repositories** - -  Joe has plenty of pending changes in his open source repositories on GitHub. -  Together with Chris, he went through these and reviewed them. Most were -  accepted. Fredrick proposed some further changes to the ff-bot merge routine -  which Joe will check out after the meeting. - -- **LDAP** - -  Bella is instructed to enter his street address into LDAP for t-shirt -  shipping. - -- **New documentation** - -  Johannes merged our new documentation. Unfortunately, he forgot to test it -  first. Joe visits it and discovers some problems. Johannes fixes it live. - -- **Turing** - -- **SMTP server** - - -.. vim: set textwidth=80 sw=2 ts=2: diff --git a/docs/meeting_notes/index.rst b/docs/meeting_notes/index.rst deleted file mode 100644 index 4ba97ea..0000000 --- a/docs/meeting_notes/index.rst +++ /dev/null @@ -1,31 +0,0 @@ -Meeting notes -============= - -Minutes for previous Devops meetings. - -.. toctree:: -   :maxdepth: 1 -   :caption: Contents: - -   2022-04-07 -   2022-09-18 -   2022-10-05 -   2022-10-19 -   2022-10-26 -   2022-11-02 -   2022-11-23 -   2023-02-08 -   2023-02-21 -   2023-02-28 -   2023-05-16 -   2023-07-11 -   2023-07-18 -   2023-07-25 -   2023-08-01 -   2023-08-08 -   2023-08-22 -   2023-08-29 -   2023-09-05 -   2023-09-12 -   2024-07-02 -   2024-07-25 diff --git a/docs/meeting_notes/template.rst b/docs/meeting_notes/template.rst deleted file mode 100644 index 0ea8a63..0000000 --- a/docs/meeting_notes/template.rst +++ /dev/null @@ -1,22 +0,0 @@ -:orphan:  .. Connor McFarlane - - -DevOps Meeting Notes -==================== - -.. -   Useful links - -   - Infra Kanban board: https://github.com/orgs/python-discord/projects/17/views/4 - -   - Infra open issues: https://github.com/python-discord/infra/issues - -   - infra open pull requests: https://github.com/python-discord/infra/pulls - -   - *If* any open issue or pull request needs discussion, why was the existing -     asynchronous logged communication over GitHub insufficient? - -Agenda ------- - -.. vim: set textwidth=80 sw=2 ts=2: diff --git a/docs/onboarding/access.rst b/docs/onboarding/access.rst deleted file mode 100644 index 940cd8b..0000000 --- a/docs/onboarding/access.rst +++ /dev/null @@ -1,50 +0,0 @@ -Access table -============ - -+--------------------+-------------------------+-----------------------+ -| **Resource**       | **Description**         | **Keyholders**        | -+====================+=========================+=======================+ -| Linode Kubernetes  | The primary cluster     | Hassan, Joe, Chris,   | -| Cluster            | where all resources are | Leon, Sebastiaan,     | -|                    | deployed.               | Johannes              | -+--------------------+-------------------------+-----------------------+ -| Linode Dashboard   | The online dashboard    | Joe, Chris            | -|                    | for managing and        |                       | -|                    | allocating resources    |                       | -|                    | from Linode.            |                       | -+--------------------+-------------------------+-----------------------+ -| Netcup Dashboard   | The dashboard for       | Joe, Chris            | -|                    | managing and allocating |                       | -|                    | resources from Netcup.  |                       | -+--------------------+-------------------------+-----------------------+ -| Netcup servers     | Root servers provided   | Joe, Chris, Bella,    | -|                    | by the Netcup           | Johannes              | -|                    | partnership.            |                       | -+--------------------+-------------------------+-----------------------+ -| Grafana            | The primary aggregation | Admins, Moderators,   | -|                    | dashboard for most      | Core Developers and   | -|                    | resources.              | DevOps (with varying  | -|                    |                         | permissions)          | -+--------------------+-------------------------+-----------------------+ -| Prometheus         | The Prometheus query    | Hassan, Joe,          | -| Dashboard          | dashboard. Access is    | Johannes, Chris       | -|                    | controlled via          |                       | -|                    | Cloudflare Access.      |                       | -+--------------------+-------------------------+-----------------------+ -| Alertmanager       | The alertmanager        | Hassan, Joe,          | -| Dashboard          | control dashboard.      | Johannes, Chris       | -|                    | Access is controlled    |                       | -|                    | via Cloudflare Access.  |                       | -+--------------------+-------------------------+-----------------------+ -| ``git-crypt``\ ed  | ``git-crypt`` is used   | Chris, Joe, Hassan,   | -| files in infra     | to encrypt certain      | Johannes, Xithrius    | -| repository         | files within the        |                       | -|                    | repository. At the time |                       | -|                    | of writing this is      |                       | -|                    | limited to kubernetes   |                       | -|                    | secret files.           |                       | -+--------------------+-------------------------+-----------------------+ -| Ansible Vault      | Used to store sensitive | Chris, Joe, Johannes, | -|                    | data for the Ansible    | Bella                 | -|                    | deployment              |                       | -+--------------------+-------------------------+-----------------------+ diff --git a/docs/onboarding/index.rst b/docs/onboarding/index.rst deleted file mode 100644 index 3929d7e..0000000 --- a/docs/onboarding/index.rst +++ /dev/null @@ -1,17 +0,0 @@ -Onboarding -========== - -This section documents who manages which access to our DevOps resources, -and how access is managed. - - -.. toctree:: -   :maxdepth: 2 -   :caption: Contents: - -   access -   resources -   rules -   tools - -.. vim: set textwidth=80 sw=2 ts=2: --> diff --git a/docs/onboarding/resources.rst b/docs/onboarding/resources.rst deleted file mode 100644 index f9ef44b..0000000 --- a/docs/onboarding/resources.rst +++ /dev/null @@ -1,35 +0,0 @@ -Resources -========= - -The following is a collection of important reference documents for the -DevOps team. - -`Infra Repo <https://github.com/python-discord/infra>`__ --------------------------------------------------------- - -This GitHub repo contains most of the manifests and configuration -applies to our cluster. It’s kept up to date manually and is considered -a source of truth for what we should have in the cluster. - -It is mostly documented, but improvements for unclear or outdated aspects are -always welcome. If you have any question, please feel free `to open a GitHub -issue on the infra repository -<https://github.com/python-discord/infra/issues/new>`__ or ask in the -``#dev-oops`` channel. - - -`Knowledge base <https://python-discord.github.io/infra/>`__ ------------------------------------------------------------- - -Deployed using GH pages, source can be found `in the docs directory of -the infra repository <https://github.com/python-discord/infra>`__. - -This includes: - --  Changelogs --  Post-mortems --  Common queries --  Runbooks - -The sidebar of the infra documentation contains some other links to -DevOps-related projects. diff --git a/docs/onboarding/rules.rst b/docs/onboarding/rules.rst deleted file mode 100644 index bd0ea0e..0000000 --- a/docs/onboarding/rules.rst +++ /dev/null @@ -1,16 +0,0 @@ -Rules -===== - -The rules any DevOps team member must follow. - -1. LMAO - **L**\ ogging, **M**\ onitoring, **A**\ lerting, -   **O**\ bservability -2. Modmail is the greatest piece of software ever written -3. Modmail needs at least 5 minutes to gather all its greatness at -   startup -4. We never blame Chris, it’s always <@233481908342882304>’s fault -5. LKE isn’t bad, it’s your fault for not paying for the high -   availability control plane -6. Our software is never legacy, it’s merely well-aged -7. Ignore these rules (however maybe not 1, 1 seems important to -   remember) diff --git a/docs/onboarding/tools.rst b/docs/onboarding/tools.rst deleted file mode 100644 index 52a5e7f..0000000 --- a/docs/onboarding/tools.rst +++ /dev/null @@ -1,50 +0,0 @@ -Tools -===== - -We use a few tools to manage, monitor, and interact with our -infrastructure. Some of these tools are not unique to the DevOps team, -and may be shared by other teams. - -Most of these are gated behind a Cloudflare Access system, which is -accessible to the `DevOps -Team <https://github.com/orgs/python-discord/teams/devops>`__ on GitHub. -These are marked with the ☁️ emoji. If you don’t have access, please -contact Chris or Joe. - -`Grafana <https://grafana.pydis.wtf/>`__ ----------------------------------------- - -Grafana provides access to some of the most important resources at your -disposal. It acts as an aggregator and frontend for a large amount of -data. These range from metrics, to logs, to stats. Some of the most -important are listed below: - --  **Service Logs / All App Logs Dashboard** - -   Service logs is a simple log viewer which gives you access to a large -   majority of the applications deployed in the default namespace. The -   All App logs dashboard is an expanded version of that which gives you -   access to all apps in all namespaces, and allows some more in-depth -   querying. - --  **Kubernetes Dashboard** - -   This dashboard gives quick overviews of all the most important -   metrics of the Kubernetes system. For more detailed information, -   check out other dashboard such as Resource Usage, NGINX, and Redis. - -Accessed via a GitHub login, with permission for anyone in the dev-core -or dev-ops team. - -`Prometheus Dashboard <https://prometheus.pydis.wtf/>`__ (☁️)) --------------------------------------------------------------- - -This provides access to the Prometheus query console. You may also enjoy -the `Alertmanager Console <https://alertmanager.pydis.wtf/>`__. - -`King Arthur <https://github.com/python-discord/king-arthur/>`__ ----------------------------------------------------------------- - -King Arthur is a discord bot which provides information about, and -access to our cluster directly in discord. Invoke its help command for -more information (``M-x help``). diff --git a/docs/postmortems/2020-12-11-all-services-outage.rst b/docs/postmortems/2020-12-11-all-services-outage.rst deleted file mode 100644 index 9c29303..0000000 --- a/docs/postmortems/2020-12-11-all-services-outage.rst +++ /dev/null @@ -1,121 +0,0 @@ -2020-12-11: All services outage -=============================== - -At **19:55 UTC, all services became unresponsive**. The DevOps were -already in a call, and immediately started to investigate. - -Postgres was running at 100% CPU usage due to a **VACUUM**, which caused -all services that depended on it to stop working. The high CPU left the -host unresponsive and it shutdown. Linode Lassie noticed this and -triggered a restart. - -It did not recover gracefully from this restart, with numerous core -services reporting an error, so we had to manually restart core system -services using Lens in order to get things working again. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -Postgres triggered a **AUTOVACUUM**, which lead to a CPU spike. This -made Postgres run at 100% CPU and was unresponsive, which caused -services to stop responding. This lead to a restart of the node, from -which we did not recover gracefully. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -All services went down. Catastrophic failure. We did not pass go, we did -not collect $200. - --  Help channel system unavailable, so people are not able to -   effectively ask for help. --  Gates unavailable, so people can’t successfully get into the -   community. --  Moderation and raid prevention unavailable, which leaves us -   defenseless against attacks. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We noticed that all PyDis services had stopped responding, -coincidentally our DevOps team were in a call at the time, so that was -helpful. - -We may be able to improve detection time by adding monitoring of -resource usage. To this end, we’ve added alerts for high CPU usage and -low memory. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. - -We noticed our node was entirely unresponsive and within minutes a -restart had been triggered by Lassie after a high CPU shutdown occurred. - -The node came back and we saw a number of core services offline -(e.g. Calico, CoreDNS, Linode CSI). - -**Obstacle: no recent database back-up available** - -🙆🏽♀️ Recovery ------------------ - -*How was the incident resolved? How can we improve future mitigation -times?* - -Through `Lens <https://k8slens.dev/>`__ we restarted core services one -by one until they stabilised, after these core services were up other -services began to come back online. - -We finally provisioned PostgreSQL which had been removed as a component -before the restart (but too late to prevent the CPU errors). Once -PostgreSQL was up we restarted any components that were acting buggy -(e.g. site and bot). - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - --  Major service outage --  **Why?** Core service failures (e.g. Calico, CoreDNS, Linode CSI) --  **Why?** Kubernetes worker node restart --  **Why?** High CPU shutdown --  **Why?** Intensive PostgreSQL AUTOVACUUM caused a CPU spike - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - --  We must ensure we have working database backups. We are lucky that we -   did not lose any data this time. If this problem had caused volume -   corruption, we would be screwed. --  Sentry is broken for the bot. It was missing a DSN secret, which we -   have now restored. --  The https://sentry.pydis.com redirect was never migrated to the -   cluster. **We should do that.** - -☑️ Follow-up tasks ------------------- - -*List any tasks we’ve created as a result of this incident* - --  ☒ Push forward with backup plans diff --git a/docs/postmortems/2020-12-11-postgres-conn-surge.rst b/docs/postmortems/2020-12-11-postgres-conn-surge.rst deleted file mode 100644 index 6ebcb01..0000000 --- a/docs/postmortems/2020-12-11-postgres-conn-surge.rst +++ /dev/null @@ -1,130 +0,0 @@ -2020-12-11: Postgres connection surge -===================================== - -At **13:24 UTC,** we noticed the bot was not able to infract, and -`pythondiscord.com <http://pythondiscord.com>`__ was unavailable. The -DevOps team started to investigate. - -We discovered that Postgres was not accepting new connections because it -had hit 100 clients. This made it unavailable to all services that -depended on it. - -Ultimately this was resolved by taking down Postgres, remounting the -associated volume, and bringing it back up again. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -The bot infractions stopped working, and we started investigating. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Services were unavailable both for internal and external users. - --  The Help Channel System was unavailable. --  Voice Gate and Server Gate were not working. --  Moderation commands were unavailable. --  Python Discord site & API were unavailable. CloudFlare automatically -   switched us to Always Online. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We noticed HTTP 524s coming from CloudFlare, upon attempting database -connection we observed the maximum client limit. - -We noticed this log in site: - -.. code:: yaml - -   django.db.utils.OperationalError: FATAL:  sorry, too many clients already - -We should be monitoring number of clients, and the monitor should alert -us when we’re approaching the max. That would have allowed for earlier -detection, and possibly allowed us to prevent the incident altogether. - -We will look at -`wrouesnel/postgres_exporter <https://github.com/wrouesnel/postgres_exporter>`__ -for monitoring this. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. The obstacles were mostly a lack of -a clear response strategy. - -We should document our recovery procedure so that we’re not so dependent -on Joe Banks should this happen again while he’s unavailable. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - --  Delete PostgreSQL deployment ``kubectl delete deployment/postgres`` --  Delete any remaining pods, WITH force. -   ``kubectl delete <pod name> --force --grace-period=0`` --  Unmount volume at Linode --  Remount volume at Linode --  Reapply deployment ``kubectl apply -f postgres/deployment.yaml`` - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - --  Postgres was unavailable, so our services died. --  **Why?** Postgres hit max clients, and could not respond. --  **Why?** Unknown, but we saw a number of connections from previous -   deployments of site. This indicates that database connections are not -   being terminated properly. Needs further investigation. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -We’re not sure what the root cause is, but suspect site is not -terminating database connections properly in some cases. We were unable -to reproduce this problem. - -We’ve set up new telemetry on Grafana with alerts so that we can -investigate this more closely. We will be let know if the number of -connections from site exceeds 32, or if the total number of connections -exceeds 90. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - --  We must ensure the DevOps team has access to Linode and other key -   services even if our Bitwarden is down. --  We need to ensure we’re alerted of any risk factors that have the -   potential to make Postgres unavailable, since this causes a -   catastrophic outage of practically all services. --  We absolutely need backups for the databases, so that this sort of -   problem carries less of a risk. --  We may need to consider something like -   `pg_bouncer <https://wiki.postgresql.org/wiki/PgBouncer>`__ to manage -   a connection pool so that we don’t exceed 100 *legitimate* clients -   connected as we connect more services to the postgres database. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - --  ☒ All database backup diff --git a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst b/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst deleted file mode 100644 index 5852c46..0000000 --- a/docs/postmortems/2021-01-10-primary-kubernetes-node-outage.rst +++ /dev/null @@ -1,117 +0,0 @@ -2021-01-10: Primary Kubernetes node outage -========================================== - -We had an outage of our highest spec node due to CPU exhaustion. The -outage lasted from around 20:20 to 20:46 UTC, but was not a full service -outage. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -I ran a query on Prometheus to try figure out some statistics on the -number of metrics we are holding, this ended up scanning a lot of data -in the TSDB database that Prometheus uses. - -This scan caused a CPU exhaustion which caused issues with the -Kubernetes node status. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -This brought down the primary node which meant there was some service -outage. Most services transferred successfully to our secondary node -which kept up some key services such as the Moderation bot and Modmail -bot, as well as MongoDB. - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -This was noticed when Discord services started having failures. The -primary detection was through alerts though! I was paged 1 minute after -we started encountering CPU exhaustion issues. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the incident. - -No major obstacles were encountered during this. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -It was noted that in the response to ``kubectl get nodes`` the primary -node’s status was reported as ``NotReady``. Looking into the reason it -was because the node had stopped responding. - -The quickest way to fix this was triggering a node restart. This shifted -a lot of pods over to node 2 which encountered some capacity issues -since it’s not as highly specified as the first node. - -I brought this back the first node by restarting it at Linode’s end. -Once this node was reporting as ``Ready`` again I drained the second -node by running ``kubectl drain lke13311-20304-5ffa4d11faab``. This -command stops the node from being available for scheduling and moves -existing pods onto other nodes. - -Services gradually recovered as the dependencies started. The incident -lasted overall around 26 minutes, though this was not a complete outage -for the whole time and the bot remained functional throughout (meaning -systems like the help channels were still functional). - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**Why?** Partial service outage - -**Why?** We had a node outage. - -**Why?** CPU exhaustion of our primary node. - -**Why?** Large prometheus query using a lot of CPU. - -**Why?** Prometheus had to scan millions of TSDB records which consumed -all cores. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -A large query was run on Prometheus, so the solution is just to not run -said queries. - -To protect against this more precisely though we should write resource -constraints for services like this that are vulnerable to CPU exhaustion -or memory consumption, which are the causes of our two past outages as -well. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - --  Don’t run large queries, it consumes CPU! --  Write resource constraints for our services. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - --  ☒ Write resource constraints for our services. diff --git a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst b/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst deleted file mode 100644 index f621782..0000000 --- a/docs/postmortems/2021-01-12-site-cpu-ram-exhaustion.rst +++ /dev/null @@ -1,155 +0,0 @@ -2021-01-12: Django site CPU/RAM exhaustion outage -================================================= - -At 03:01 UTC on Tuesday 12th January we experienced a momentary outage -of our PostgreSQL database, causing some very minor service downtime. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -We deleted the Developers role which led to a large user diff for all -the users where we had to update their roles on the site. - -The bot was trying to post this for over 24 hours repeatedly after every -restart. - -We deployed the bot at 2:55 UTC on 12th January and the user sync -process began once again. - -This caused a CPU & RAM spike on our Django site, which in turn -triggered an OOM error on the server which killed the Postgres process, -sending it into a recovery state where queries could not be executed. - -Django site did not have any tools in place to batch the requests so was -trying to process all 80k user updates in a single query, something that -PostgreSQL probably could handle, but not the Django ORM. During the -incident site jumped from it’s average RAM usage of 300-400MB to -**1.5GB.** - -.. image:: ./images/2021-01-12/site_resource_abnormal.png - -RAM and CPU usage of site throughout the incident. The period just -before 3:40 where no statistics were reported is the actual outage -period where the Kubernetes node had some networking errors. - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -This database outage lasted mere minutes, since Postgres recovered and -healed itself and the sync process was aborted, but it did leave us with -a large user diff and our database becoming further out of sync. - -Most services stayed up that did not depend on PostgreSQL, and the site -remained stable after the sync had been cancelled. - -👁️ Detection ---------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were immediately alerted to the PostgreSQL outage on Grafana and -through Sentry, meaning our response time was under a minute. - -We reduced some alert thresholds in order to catch RAM & CPU spikes -faster in the future. - -It was hard to immediately see the cause of things since there is -minimal logging on the site and the bot logs were not evident that -anything was at fault, therefore our only detection was through machine -metrics. - -We did manage to recover exactly what PostgreSQL was trying to do at the -time of crashing by examining the logs which pointed us towards the user -sync process. - -🙋🏿♂️ Response ------------------------ - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded to the issue, there were no real obstacles -encountered other than the node being less performant than we would like -due to the CPU starvation. - -🙆🏽♀️ Recovery ---------------------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident was resolved by stopping the sync process and writing a -more efficient one through an internal eval script. We batched the -updates into 1,000 users and instead of doing one large one did 80 -smaller updates. This led to much higher efficiency with a cost of -taking a little longer (~7 minutes). - -.. code:: python - -   from bot.exts.backend.sync import _syncers -   syncer = _syncers.UserSyncer -   diff = await syncer._get_diff(ctx.guild) - -   def chunks(lst, n): -       for i in range(0, len(lst), n): -           yield lst[i:i + n] - -   for chunk in chunks(diff.updated, 1000): -       await bot.api_client.patch("bot/users/bulk_patch", json=chunk) - -Resource limits were also put into place on site to prevent RAM and CPU -spikes, and throttle the CPU usage in these situations. This can be seen -in the below graph: - -.. image:: ./images/2021-01-12/site_cpu_throttle.png - -CPU throttling is where a container has hit the limits and we need to -reel it in. Ideally this value stays as closes to 0 as possible, however -as you can see site hit this twice (during the periods where it was -trying to sync 80k users at once) - -🔎 Five Why’s ---------------------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - --  We experienced a major PostgreSQL outage --  PostgreSQL was killed by the system OOM due to the RAM spike on site. --  The RAM spike on site was caused by a large query. --  This was because we do not chunk queries on the bot. --  The large query was caused by the removal of the Developers role -   resulting in 80k users needing updating. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The removal of the Developers role created a large diff which could not -be applied by Django in a single request. - -See the follow up tasks on exactly how we can avoid this in future, it’s -a relatively easy mitigation. - -🤔 Lessons learned ------------------------ - -*What did we learn from this incident?* - --  Django (or DRF) does not like huge update queries. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - --  ☒ Make the bot syncer more efficient (batch requests) --  ☐ Increase logging on bot, state when an error has been hit (we had -   no indication of this inside Discord, we need that) --  ☒ Adjust resource alerts to page DevOps members earlier. --  ☒ Apply resource limits to site to prevent major spikes diff --git a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst b/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst deleted file mode 100644 index b13ecd7..0000000 --- a/docs/postmortems/2021-01-30-nodebalancer-fails-memory.rst +++ /dev/null @@ -1,146 +0,0 @@ -2021-01-30: NodeBalancer networking faults due to memory pressure -================================================================= - -At around 14:30 UTC on Saturday 30th January we started experiencing -networking issues at the LoadBalancer level between Cloudflare and our -Kubernetes cluster. It seems that the misconfiguration was due to memory -and CPU pressure. - -[STRIKEOUT:This post-mortem is preliminary, we are still awaiting word -from Linode’s SysAdmins on any problems they detected.] - -**Update 2nd February 2021:** Linode have migrated our NodeBalancer to a -different machine. - -⚠️ Leadup ---------- - -*List the sequence of events that led to the incident* - -At 14:30 we started receiving alerts that services were becoming -unreachable. We first experienced some momentary DNS errors which -resolved themselves, however traffic ingress was still degraded. - -Upon checking Linode our NodeBalancer, the service which balances -traffic between our Kubernetes nodes was reporting the backends (the -services it balances to) as down. It reported all 4 as down (two for -port 80 + two for port 443). This status was fluctuating between up and -down, meaning traffic was not reaching our cluster correctly. Scaleios -correctly noted: - -.. image:: ./images/2021-01-30/scaleios.png - -The config seems to have been set incorrectly due to memory and CPU -pressure on one of our nodes. Here is the memory throughout the -incident: - -.. image:: ./images/2021-01-30/memory_charts.png - -Here is the display from Linode: - -.. image:: ./images/2021-01-30/linode_loadbalancers.png - -🥏 Impact ---------- - -*Describe how internal and external users were impacted during the -incident* - -Since traffic could not correctly enter our cluster multiple services -which were web based were offline, including services such as site, -grafana and bitwarden. It appears that no inter-node communication was -affected as this uses a WireGuard tunnel between the nodes which was not -affected by the NodeBalancer. - -The lack of Grafana made diagnosis slightly more difficult, but even -then it was only a short trip to the - -👁️ Detection ------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -We were alerted fairly promptly through statping which reported services -as being down and posted a Discord notification. Subsequent alerts came -in from Grafana but were limited since outbound communication was -faulty. - -🙋🏿♂️ Response ----------------- - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded! - -Primary obstacle was the DevOps tools being out due to the traffic -ingress problems. - -🙆🏽♀️ Recovery ----------------- - -*How was the incident resolved? How can we improve future mitigation?* - -The incident resolved itself upstream at Linode, we’ve opened a ticket -with Linode to let them know of the faults, this might give us a better -indication of what caused the issues. Our Kubernetes cluster continued -posting updates to Linode to refresh the NodeBalancer configuration, -inspecting these payloads the configuration looked correct. - -We’ve set up alerts for when Prometheus services stop responding since -this seems to be a fairly tell-tale symptom of networking problems, this -was the Prometheus status graph throughout the incident: - -.. image:: ./images/2021-01-30/prometheus_status.png - -🔎 Five Why’s -------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** Our service experienced an outage due to networking faults. - -**Why?** Incoming traffic could not reach our Kubernetes nodes - -**Why?** Our Linode NodeBalancers were not using correct configuration - -**Why?** Memory & CPU pressure seemed to cause invalid configuration -errors upstream at Linode. - -**Why?** Unknown at this stage, NodeBalancer migrated. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrance* - -The configuration of our NodeBalancer was invalid, we cannot say why at -this point since we are awaiting contact back from Linode, but -indicators point to it being an upstream fault since memory & CPU -pressure should **not** cause a load balancer misconfiguration. - -Linode are going to follow up with us at some point during the week with -information from their System Administrators. - -**Update 2nd February 2021:** Linode have concluded investigations at -their end, taken notes and migrated our NodeBalancer to a new machine. -We haven’t experienced problems since. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -We should be careful over-scheduling onto nodes since even while -operating within reasonable constraints we risk sending invalid -configuration upstream to Linode and therefore preventing traffic from -entering our cluster. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - --  ☒ Monitor for follow up from Linode --  ☒ Carefully monitor the allocation rules for our services diff --git a/docs/postmortems/2021-07-11-cascading-node-failures.rst b/docs/postmortems/2021-07-11-cascading-node-failures.rst deleted file mode 100644 index b2e5cdf..0000000 --- a/docs/postmortems/2021-07-11-cascading-node-failures.rst +++ /dev/null @@ -1,335 +0,0 @@ -2021-07-11: Cascading node failures and ensuing volume problems -=============================================================== - -A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node -to an unresponsive state (00:55 UTC), upon performing a recycle of the -affected node volumes were placed into a state where they could not be -mounted. - -⚠️ Leadup ----------- - -*List the sequence of events that led to the incident* - --  **00:27 UTC:** Django starts rapidly using connections to our -   PostgreSQL database --  **00:32 UTC:** DevOps team is alerted that PostgreSQL has saturated -   it’s 115 max connections limit. Joe is paged. --  **00:33 UTC:** DevOps team is alerted that a service has claimed 34 -   dangerous table locks (it peaked at 61). --  **00:42 UTC:** Status incident created and backdated to 00:25 UTC. -   `Status incident <https://status.pythondiscord.com/incident/92712>`__ --  **00:55 UTC:** It’s clear that the node which PostgreSQL was on is no -   longer healthy after the Django connection surge, so it’s recycled -   and a new one is to be added to the pool. --  **01:01 UTC:** Node ``lke13311-16405-5fafd1b46dcf`` begins it’s -   restart --  **01:13 UTC:** Node has restored and regained healthy status, but -   volumes will not mount to the node. Support ticket opened at Linode -   for assistance. --  **06:36 UTC:** DevOps team alerted that Python is offline. This is -   due to Redis being a dependency of the bot, which as a stateful -   service was not healthy. - -🥏 Impact ----------- - -*Describe how internal and external users were impacted during the -incident* - -Initially, this manifested as a standard node outage where services on -that node experienced some downtime as the node was restored. - -Post-restore, all stateful services (e.g. PostgreSQL, Redis, PrestaShop) -were unexecutable due to the volume issues, and so any dependent -services (e.g. Site, Bot, Hastebin) also had trouble starting. - -PostgreSQL was restored early on so for the most part Moderation could -continue. - -👁️ Detection ---------------- - -*Report when the team detected the incident, and how we could improve -detection time* - -DevOps were initially alerted at 00:32 UTC due to the PostgreSQL -connection surge, and acknowledged at the same time. - -Further alerting could be used to catch surges earlier on (looking at -conn delta vs. conn total), but for the most part alerting time was -satisfactory here. - -🙋🏿♂️ Response ------------------ - -*Who responded to the incident, and what obstacles did they encounter?* - -Joe Banks responded. The primary issue encountered was failure upstream -at Linode to remount the affected volumes, a support ticket has been -created. - -🙆🏽♀️ Recovery ------------------- - -*How was the incident resolved? How can we improve future mitigation?* - -Initial node restoration was performed by @Joe Banks by recycling the -affected node. - -Subsequent volume restoration was also @Joe Banks and once Linode had -unlocked the volumes affected pods were scaled down to 0, the volumes -were unmounted at the Linode side and then the deployments were -recreated. - -.. raw:: html - -   <details> - -.. raw:: html - -   <summary> - -Support ticket sent - -.. raw:: html - -   </summary> - -.. raw:: html - -   <blockquote> - -Good evening, - -We experienced a resource surge on one of our Kubernetes nodes at 00:32 -UTC, causing a node to go unresponsive. To mitigate problems here the -node was recycled and began restarting at 1:01 UTC. - -The node has now rejoined the ring and started picking up services, but -volumes will not attach to it, meaning pods with stateful storage will -not start. - -An example events log for one such pod: - -:: - -     Type     Reason       Age    From               Message -     ----     ------       ----   ----               ------- -     Normal   Scheduled    2m45s  default-scheduler  Successfully assigned default/redis-599887d778-wggbl to lke13311-16405-5fafd1b46dcf -     Warning  FailedMount  103s   kubelet            MountVolume.MountDevice failed for volume "pvc-bb1d06139b334c1f" : rpc error: code = Internal desc = Unable to find device path out of attempted paths: [/dev/disk/by-id/linode-pvcbb1d06139b334c1f /dev/disk/by-id/scsi-0Linode_Volume_pvcbb1d06139b334c1f] -     Warning  FailedMount  43s    kubelet            Unable to attach or mount volumes: unmounted volumes=[redis-data-volume], unattached volumes=[kube-api-access-6wwfs redis-data-volume redis-config-volume]: timed out waiting for the condition - -I’ve been trying to manually resolve this through the Linode Web UI but -get presented with attachment errors upon doing so. Please could you -advise on the best way forward to restore Volumes & Nodes to a -functioning state? As far as I can see there is something going on -upstream since the Linode UI presents these nodes as mounted however as -shown above LKE nodes are not locating them, there is also a few failed -attachment logs in the Linode Audit Log. - -Thanks, - -Joe - -.. raw:: html - -   </blockquote> - -.. raw:: html - -   </details> - -.. raw:: html - -   <details> - -.. raw:: html - -   <summary> - -Response received from Linode - -.. raw:: html - -   </summary> - -.. raw:: html - -   <blockquote> - -Hi Joe, - -   Were there any known issues with Block Storage in Frankfurt today? - -Not today, though there were service issues reported for Block Storage -and LKE in Frankfurt on July 8 and 9: - --  `Service Issue - Block Storage - EU-Central -   (Frankfurt) <https://status.linode.com/incidents/pqfxl884wbh4>`__ --  `Service Issue - Linode Kubernetes Engine - -   Frankfurt <https://status.linode.com/incidents/13fpkjd32sgz>`__ - -There was also an API issue reported on the 10th (resolved on the 11th), -mentioned here: - --  `Service Issue - Cloud Manager and -   API <https://status.linode.com/incidents/vhjm0xpwnnn5>`__ - -Regarding the specific error you were receiving: - -   ``Unable to find device path out of attempted paths`` - -I’m not certain it’s specifically related to those Service Issues, -considering this isn’t the first time a customer has reported this error -in their LKE logs. In fact, if I recall correctly, I’ve run across this -before too, since our volumes are RWO and I had too many replicas in my -deployment that I was trying to attach to, for example. - -   is this a known bug/condition that occurs with Linode CSI/LKE? - -From what I understand, yes, this is a known condition that crops up -from time to time, which we are tracking. However, since there is a -workaround at the moment (e.g. - “After some more manual attempts to fix -things, scaling down deployments, unmounting at Linode and then scaling -up the deployments seems to have worked and all our services have now -been restored.”), there is no ETA for addressing this. With that said, -I’ve let our Storage team know that you’ve run into this, so as to draw -further attention to it. - -If you have any further questions or concerns regarding this, let us -know. - -Best regards, [Redacted] - -Linode Support Team - -.. raw:: html - -   </blockquote> - -.. raw:: html - -   </details> - -.. raw:: html - -   <details> - -.. raw:: html - -   <summary> - -Concluding response from Joe Banks - -.. raw:: html - -   </summary> - -.. raw:: html - -   <blockquote> - -Hey [Redacted]! - -Thanks for the response. We ensure that stateful pods only ever have one -volume assigned to them, either with a single replica deployment or a -statefulset. It appears that the error generally manifests when a -deployment is being migrated from one node to another during a redeploy, -which makes sense if there is some delay on the unmount/remount. - -Confusion occurred because Linode was reporting the volume as attached -when the node had been recycled, but I assume that was because the node -did not cleanly shutdown and therefore could not cleanly unmount -volumes. - -We’ve not seen any resurgence of such issues, and we’ll address the -software fault which overloaded the node which will helpfully mitigate -such problems in the future. - -Thanks again for the response, have a great week! - -Best, - -Joe - -.. raw:: html - -   </blockquote> - -.. raw:: html - -   </details> - -🔎 Five Why’s ---------------- - -*Run a 5-whys analysis to understand the true cause of the incident.* - -**What?** -~~~~~~~~~ - -Several of our services became unavailable because their volumes could -not be mounted. - -Why? -~~~~ - -A node recycle left the node unable to mount volumes using the Linode -CSI. - -.. _why-1: - -Why? -~~~~ - -A node recycle was used because PostgreSQL had a connection surge. - -.. _why-2: - -Why? -~~~~ - -A Django feature deadlocked a table 62 times and suddenly started using -~70 connections to the database, saturating the maximum connections -limit. - -.. _why-3: - -Why? -~~~~ - -The root cause of why Django does this is unclear, and someone with more -Django proficiency is absolutely welcome to share any knowledge they may -have. I presume it’s some sort of worker race condition, but I’ve not -been able to reproduce it. - -🌱 Blameless root cause ------------------------ - -*Note the final root cause and describe what needs to change to prevent -reoccurrence* - -A node being forcefully restarted left volumes in a limbo state where -mounting was difficult, it took multiple hours for this to be resolved -since we had to wait for the volumes to unlock so they could be cloned. - -🤔 Lessons learned ------------------- - -*What did we learn from this incident?* - -Volumes are painful. - -We need to look at why Django is doing this and mitigations of the fault -to prevent this from occurring again. - -☑️ Follow-up tasks ------------------- - -*List any tasks we should complete that are relevant to this incident* - --  ☒ `Follow up on ticket at -   Linode <https://www.notion.so/Cascading-node-failures-and-ensuing-volume-problems-1c6cfdfcadfc4422b719a0d7a4cc5001>`__ --  ☐ Investigate why Django could be connection surging and locking -   tables diff --git a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png b/docs/postmortems/images/2021-01-12/site_cpu_throttle.pngBinary files differ deleted file mode 100644 index b530ec6..0000000 --- a/docs/postmortems/images/2021-01-12/site_cpu_throttle.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png b/docs/postmortems/images/2021-01-12/site_resource_abnormal.pngBinary files differ deleted file mode 100644 index e1e07af..0000000 --- a/docs/postmortems/images/2021-01-12/site_resource_abnormal.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png b/docs/postmortems/images/2021-01-30/linode_loadbalancers.pngBinary files differ deleted file mode 100644 index f0eae1f..0000000 --- a/docs/postmortems/images/2021-01-30/linode_loadbalancers.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/memory_charts.png b/docs/postmortems/images/2021-01-30/memory_charts.pngBinary files differ deleted file mode 100644 index 370d19e..0000000 --- a/docs/postmortems/images/2021-01-30/memory_charts.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/prometheus_status.png b/docs/postmortems/images/2021-01-30/prometheus_status.pngBinary files differ deleted file mode 100644 index e95b8d7..0000000 --- a/docs/postmortems/images/2021-01-30/prometheus_status.png +++ /dev/null diff --git a/docs/postmortems/images/2021-01-30/scaleios.png b/docs/postmortems/images/2021-01-30/scaleios.pngBinary files differ deleted file mode 100644 index 584d74d..0000000 --- a/docs/postmortems/images/2021-01-30/scaleios.png +++ /dev/null diff --git a/docs/postmortems/index.rst b/docs/postmortems/index.rst deleted file mode 100644 index e28dc7a..0000000 --- a/docs/postmortems/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -Postmortems -=========== - -Browse the pages under this category to view historical postmortems for -Python Discord outages. - -.. toctree:: -   :maxdepth: 1 - -   2020-12-11-all-services-outage -   2020-12-11-postgres-conn-surge -   2021-01-10-primary-kubernetes-node-outage -   2021-01-12-site-cpu-ram-exhaustion -   2021-01-30-nodebalancer-fails-memory -   2021-07-11-cascading-node-failures diff --git a/docs/queries/index.rst b/docs/queries/index.rst deleted file mode 100644 index 76218e4..0000000 --- a/docs/queries/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -Queries -======= - -Get the data you desire with these assorted handcrafted queries. - -.. toctree:: -   :maxdepth: 2 -   :caption: Contents: - -   kubernetes -   loki -   postgres diff --git a/docs/queries/kubernetes.rst b/docs/queries/kubernetes.rst deleted file mode 100644 index f8d8984..0000000 --- a/docs/queries/kubernetes.rst +++ /dev/null @@ -1,29 +0,0 @@ -Kubernetes tips -=============== - -Find top pods by CPU/memory ---------------------------- - -.. code:: bash - -   $ kubectl top pods --all-namespaces --sort-by='memory' -   $ top pods --all-namespaces --sort-by='cpu' - -Find top nodes by CPU/memory ----------------------------- - -.. code:: bash - -   $ kubectl top nodes --sort-by='cpu' -   $ kubectl top nodes --sort-by='memory' - -Kubernetes cheat sheet ----------------------- - -`Open Kubernetes cheat -sheet <https://kubernetes.io/docs/reference/kubectl/cheatsheet/>`__ - -Lens IDE --------- - -`OpenLens <https://github.com/MuhammedKalkan/OpenLens>`__ diff --git a/docs/queries/loki.rst b/docs/queries/loki.rst deleted file mode 100644 index 2ee57a3..0000000 --- a/docs/queries/loki.rst +++ /dev/null @@ -1,25 +0,0 @@ -Loki queries -============ - -Find any logs containing “ERROR” --------------------------------- - -.. code:: shell - -   {job=~"default/.+"} |= "ERROR" - -Find all logs from bot service ------------------------------- - -.. code:: shell - -   {job="default/bot"} - -The format is ``namespace/object`` - -Rate of logs from a service ---------------------------- - -.. code:: shell - -   rate(({job="default/bot"} |= "error" != "timeout")[10s]) diff --git a/docs/queries/postgres.rst b/docs/queries/postgres.rst deleted file mode 100644 index 5120145..0000000 --- a/docs/queries/postgres.rst +++ /dev/null @@ -1,336 +0,0 @@ -PostgreSQL queries -================== - -Disk usage ----------- - -Most of these queries vary based on the database you are connected to. - -General Table Size Information Grouped For Partitioned Tables -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   WITH RECURSIVE pg_inherit(inhrelid, inhparent) AS -       (select inhrelid, inhparent -       FROM pg_inherits -       UNION -       SELECT child.inhrelid, parent.inhparent -       FROM pg_inherit child, pg_inherits parent -       WHERE child.inhparent = parent.inhrelid), -   pg_inherit_short AS (SELECT * FROM pg_inherit WHERE inhparent NOT IN (SELECT inhrelid FROM pg_inherit)) -   SELECT table_schema -       , TABLE_NAME -       , row_estimate -       , pg_size_pretty(total_bytes) AS total -       , pg_size_pretty(index_bytes) AS INDEX -       , pg_size_pretty(toast_bytes) AS toast -       , pg_size_pretty(table_bytes) AS TABLE -     FROM ( -       SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes -       FROM ( -            SELECT c.oid -                 , nspname AS table_schema -                 , relname AS TABLE_NAME -                 , SUM(c.reltuples) OVER (partition BY parent) AS row_estimate -                 , SUM(pg_total_relation_size(c.oid)) OVER (partition BY parent) AS total_bytes -                 , SUM(pg_indexes_size(c.oid)) OVER (partition BY parent) AS index_bytes -                 , SUM(pg_total_relation_size(reltoastrelid)) OVER (partition BY parent) AS toast_bytes -                 , parent -             FROM ( -                   SELECT pg_class.oid -                       , reltuples -                       , relname -                       , relnamespace -                       , pg_class.reltoastrelid -                       , COALESCE(inhparent, pg_class.oid) parent -                   FROM pg_class -                       LEFT JOIN pg_inherit_short ON inhrelid = oid -                   WHERE relkind IN ('r', 'p') -                ) c -                LEFT JOIN pg_namespace n ON n.oid = c.relnamespace -     ) a -     WHERE oid = parent -   ) a -   ORDER BY total_bytes DESC; - -General Table Size Information -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT *, pg_size_pretty(total_bytes) AS total -       , pg_size_pretty(index_bytes) AS index -       , pg_size_pretty(toast_bytes) AS toast -       , pg_size_pretty(table_bytes) AS table -     FROM ( -     SELECT *, total_bytes-index_bytes-coalesce(toast_bytes,0) AS table_bytes FROM ( -         SELECT c.oid,nspname AS table_schema, relname AS table_name -                 , c.reltuples AS row_estimate -                 , pg_total_relation_size(c.oid) AS total_bytes -                 , pg_indexes_size(c.oid) AS index_bytes -                 , pg_total_relation_size(reltoastrelid) AS toast_bytes -             FROM pg_class c -             LEFT JOIN pg_namespace n ON n.oid = c.relnamespace -             WHERE relkind = 'r' -     ) a -   ) a; - -Finding the largest databases in your cluster -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT d.datname as Name,  pg_catalog.pg_get_userbyid(d.datdba) as Owner, -       CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT') -           THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname)) -           ELSE 'No Access' -       END as Size -   FROM pg_catalog.pg_database d -       order by -       CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT') -           THEN pg_catalog.pg_database_size(d.datname) -           ELSE NULL -       END desc -- nulls first -       LIMIT 20; - -Finding the size of your biggest relations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Relations are objects in the database such as tables and indexes, and -this query shows the size of all the individual parts. - -.. code:: sql - -   SELECT nspname || '.' || relname AS "relation", -       pg_size_pretty(pg_relation_size(C.oid)) AS "size" -     FROM pg_class C -     LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) -     WHERE nspname NOT IN ('pg_catalog', 'information_schema') -     ORDER BY pg_relation_size(C.oid) DESC -     LIMIT 20; - -Finding the total size of your biggest tables -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT nspname || '.' || relname AS "relation", -       pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" -     FROM pg_class C -     LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) -     WHERE nspname NOT IN ('pg_catalog', 'information_schema') -       AND C.relkind <> 'i' -       AND nspname !~ '^pg_toast' -     ORDER BY pg_total_relation_size(C.oid) DESC -     LIMIT 20; - -Indexes -------- - -Index summary -~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT -       pg_class.relname, -       pg_size_pretty(pg_class.reltuples::bigint) AS rows_in_bytes, -       pg_class.reltuples AS num_rows, -       count(indexname) AS number_of_indexes, -       CASE WHEN x.is_unique = 1 THEN 'Y' -          ELSE 'N' -       END AS UNIQUE, -       SUM(case WHEN number_of_columns = 1 THEN 1 -                 ELSE 0 -               END) AS single_column, -       SUM(case WHEN number_of_columns IS NULL THEN 0 -                WHEN number_of_columns = 1 THEN 0 -                ELSE 1 -              END) AS multi_column -   FROM pg_namespace -   LEFT OUTER JOIN pg_class ON pg_namespace.oid = pg_class.relnamespace -   LEFT OUTER JOIN -          (SELECT indrelid, -              max(CAST(indisunique AS integer)) AS is_unique -          FROM pg_index -          GROUP BY indrelid) x -          ON pg_class.oid = x.indrelid -   LEFT OUTER JOIN -       ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns FROM pg_index x -              JOIN pg_class c ON c.oid = x.indrelid -              JOIN pg_class ipg ON ipg.oid = x.indexrelid  ) -       AS foo -       ON pg_class.relname = foo.ctablename -   WHERE -        pg_namespace.nspname='public' -   AND  pg_class.relkind = 'r' -   GROUP BY pg_class.relname, pg_class.reltuples, x.is_unique -   ORDER BY 2; - -Index size/usage statistics -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT -       t.schemaname, -       t.tablename, -       indexname, -       c.reltuples AS num_rows, -       pg_size_pretty(pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(t.tablename)::text)) AS table_size, -       pg_size_pretty(pg_relation_size(quote_ident(t.schemaname)::text || '.' || quote_ident(indexrelname)::text)) AS index_size, -       CASE WHEN indisunique THEN 'Y' -           ELSE 'N' -       END AS UNIQUE, -       number_of_scans, -       tuples_read, -       tuples_fetched -   FROM pg_tables t -   LEFT OUTER JOIN pg_class c ON t.tablename = c.relname -   LEFT OUTER JOIN ( -       SELECT -           c.relname AS ctablename, -           ipg.relname AS indexname, -           x.indnatts AS number_of_columns, -           idx_scan AS number_of_scans, -           idx_tup_read AS tuples_read, -           idx_tup_fetch AS tuples_fetched, -           indexrelname, -           indisunique, -           schemaname -       FROM pg_index x -       JOIN pg_class c ON c.oid = x.indrelid -       JOIN pg_class ipg ON ipg.oid = x.indexrelid -       JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid -   ) AS foo ON t.tablename = foo.ctablename AND t.schemaname = foo.schemaname -   WHERE t.schemaname NOT IN ('pg_catalog', 'information_schema') -   ORDER BY 1,2; - -Duplicate indexes -~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, -          (array_agg(idx))[1] as idx1, (array_agg(idx))[2] as idx2, -          (array_agg(idx))[3] as idx3, (array_agg(idx))[4] as idx4 -   FROM ( -       SELECT indexrelid::regclass as idx, (indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| -                                            coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')) as key -       FROM pg_index) sub -   GROUP BY key HAVING count(*)>1 -   ORDER BY sum(pg_relation_size(idx)) DESC; - -Maintenance ------------ - -`PostgreSQL wiki <https://wiki.postgresql.org/wiki/Main_Page>`__ - -CLUSTER-ing -~~~~~~~~~~~ - -`CLUSTER <https://www.postgresql.org/docs/current/sql-cluster.html>`__ - -.. code:: sql - -   CLUSTER [VERBOSE] table_name [ USING index_name ] -   CLUSTER [VERBOSE] - -``CLUSTER`` instructs PostgreSQL to cluster the table specified by -``table_name`` based on the index specified by ``index_name``. The index -must already have been defined on ``table_name``. - -When a table is clustered, it is physically reordered based on the index -information. - -The -`clusterdb <https://www.postgresql.org/docs/current/app-clusterdb.html>`__ -CLI tool is recommended, and can also be used to cluster all tables at -the same time. - -VACUUM-ing -~~~~~~~~~~ - -Proper vacuuming, particularly autovacuum configuration, is crucial to a -fast and reliable database. - -`Introduction to VACUUM, ANALYZE, EXPLAIN, and -COUNT <https://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_COUNT>`__ - -It is not advised to run ``VACUUM FULL``, instead look at clustering. -VACUUM FULL is a much more intensive task and acquires an ACCESS -EXCLUSIVE lock on the table, blocking reads and writes. Whilst -``CLUSTER`` also does acquire this lock it’s a less intensive and faster -process. - -The -`vacuumdb <https://www.postgresql.org/docs/current/app-vacuumdb.html>`__ -CLI tool is recommended for manual runs. - -Finding number of dead rows -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - -   SELECT relname, n_dead_tup FROM pg_stat_user_tables WHERE n_dead_tup <> 0 ORDER BY 2 DESC; - -Finding last vacuum/auto-vacuum date -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - -   SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables; - -Checking auto-vacuum is enabled -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - -   SELECT name, setting FROM pg_settings WHERE name='autovacuum'; - -View all auto-vacuum setting -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: sql - -   SELECT * from pg_settings where category like 'Autovacuum'; - -Locks ------ - -Looking at granted locks -~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT relation::regclass, * FROM pg_locks WHERE NOT granted; - -Сombination of blocked and blocking activity -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code:: sql - -   SELECT blocked_locks.pid     AS blocked_pid, -            blocked_activity.usename  AS blocked_user, -            blocking_locks.pid     AS blocking_pid, -            blocking_activity.usename AS blocking_user, -            blocked_activity.query    AS blocked_statement, -            blocking_activity.query   AS current_statement_in_blocking_process -      FROM  pg_catalog.pg_locks         blocked_locks -       JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid -       JOIN pg_catalog.pg_locks         blocking_locks -           ON blocking_locks.locktype = blocked_locks.locktype -           AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database -           AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation -           AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page -           AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple -           AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid -           AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid -           AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid -           AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid -           AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid -           AND blocking_locks.pid != blocked_locks.pid - -       JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid -      WHERE NOT blocked_locks.granted; diff --git a/docs/runbooks/index.rst b/docs/runbooks/index.rst deleted file mode 100644 index 18690c7..0000000 --- a/docs/runbooks/index.rst +++ /dev/null @@ -1,17 +0,0 @@ -Runbooks -======== - -Learn how to do anything in our infrastructure with these guidelines. - -.. note:: - -   In general, we try to codify manual processes as much as possible. Still, -   this section is important for tasks that are either hard to automate or are -   run so infrequently that it does not make sense to regularly run them. - - -.. toctree:: -   :maxdepth: 2 -   :caption: Contents: - -   postgresql-upgrade diff --git a/docs/runbooks/postgresql-upgrade.rst b/docs/runbooks/postgresql-upgrade.rst deleted file mode 100644 index 98b1642..0000000 --- a/docs/runbooks/postgresql-upgrade.rst +++ /dev/null @@ -1,149 +0,0 @@ -Upgrading PostgreSQL -==================== - -Step 1 - Enable maintenance mode --------------------------------- - -Add a worker route for ``pythondiscord.com/*`` to forward to the -``maintenance`` Cloudflare worker. - -Step 2 - Scale down all services that use PostgreSQL ----------------------------------------------------- - -Notably site, metricity, bitwarden and the like should be scaled down. - -Services that are read only such as Grafana (but NOT Metabase, Metabase -uses PostgreSQL for internal storage) do not need to be scaled down, as -they do not update the database in any way. - -.. code:: bash - -   $ kubectl scale deploy --replicas 0 site metricity metabase bitwarden ... - -Step 3 - Take a database dump and gzip --------------------------------------- - -Using ``pg_dumpall``, dump the contents of all databases to a ``.sql`` -file. - -Make sure to gzip for faster transfer. - -Take a SHA512 sum of the output ``.sql.gz`` file to validate integrity -after copying. - -.. code:: bash - -   $ pg_dumpall -U pythondiscord > backup.sql -   $ gzip backup.sql -   $ sha512sum backup.sql -   a3337bfc65a072fd93124233ac1cefcdfbe8a708e5c1d08adaca2cf8c7cbe9ae4853ffab8c5cfbe943182355eaa701012111a420b29cc4f74d1e87f9df3af459  backup.sql - -Step 4 - Move database dump locally ------------------------------------ - -Use ``kubectl cp`` to move the ``backup.sql.gz`` file from the remote -pod to your local machine. - -Validate the integrity of the received file. - -Step 5 - Attempt local import to new PostgreSQL version -------------------------------------------------------- - -Install the new version of PostgreSQL locally and import the data. Make -sure you are operating on a **completely empty database server.** - -.. code:: bash - -   $ gzcat backup.sql.gz | psql -U joe - -You can use any PostgreSQL superuser for the import. Ensure that no -errors other than those mentioned below occur, you may need to attempt -multiple times to fix errors listed below. - -Handle import errors -~~~~~~~~~~~~~~~~~~~~ - -Monitor the output of ``psql`` to check that no errors appear. - -If you receive locale errors ensure that the locale your database is -configured with matches the import script, this may require some usage -of ``sed``: - -.. code:: bash - -   $ sed -i '' "s/en_US.utf8/en_GB.UTF-8/g" backup.sql - -Ensure that you **RESET THESE CHANGES** before attempting an import on -the remote, if they come from the PostgreSQL Docker image they will need -the same locale as the export. - -Step 7 - Spin down PostgreSQL ------------------------------ - -Spin down PostgreSQL to 0 replicas. - -Step 8 - Take volume backup at Linode -------------------------------------- - -Backup the volume at Linode through a clone in the Linode UI, name it -something obvious. - -Step 9 - Remove the Linode persistent volume --------------------------------------------- - -Delete the volume specified in the ``volume.yaml`` file in the -``postgresql`` directory, you must delete the ``pvc`` first followed by -the ``pv``, you can find the relevant disks through -``kubectl get pv/pvc`` - -Step 10 - Create a new volume by re-applying the ``volume.yaml`` file ---------------------------------------------------------------------- - -Apply the ``volume.yaml`` so a new, empty, volume is created. - -Step 11 - Bump the PostgreSQL version in the ``deployment.yaml`` file ---------------------------------------------------------------------- - -Update the Docker image used in the deployment manifest. - -Step 12 - Apply the deployment ------------------------------- - -Run ``kubectl apply -f postgresql/deployment.yaml`` to start the new -database server. - -Step 13 - Copy the data across ------------------------------- - -After the pod has initialised use ``kubectl cp`` to copy the gzipped -backup to the new Postgres pod. - -Step 14 - Extract and import the new data ------------------------------------------ - -.. code:: bash - -   $ gunzip backup.sql.gz -   $ psql -U pythondiscord -f backup.sql - -Step 15 - Validate data import complete ---------------------------------------- - -Ensure that all logs are successful, you may get duplicate errors for -the ``pythondiscord`` user and database, these are safe to ignore. - -Step 16 - Scale up services ---------------------------- - -Restart the database server - -.. code:: bash - -   $ kubectl scale deploy --replicas 1 metricity bitwarden metabase - -Step 17 - Validate all services interact correctly --------------------------------------------------- - -Validate that all services reconnect successfully and start exchanging -data, ensure that no abnormal logs are outputted and performance remains -as expected. diff --git a/docs/tooling/bots.rst b/docs/tooling/bots.rst deleted file mode 100644 index 7b5e165..0000000 --- a/docs/tooling/bots.rst +++ /dev/null @@ -1,55 +0,0 @@ -Bots -==== - -Our GitHub repositories are supported by two custom bots: - --  Our **Fast Forward Bot**, which ensures that commits merged into main -   are either merged manually on the command line or via a fast-forward, -   ensuring that cryptographic signatures of commits remain intact. -   Information on the bot can be found `in the ff-bot.yml -   configuration <https://github.com/python-discord/infra/blob/main/.github/ff-bot.yml>`__. -   Merges over the GitHub UI are discouraged for this reason. You can -   use it by running ``/merge`` on a pull request. Note that attempting -   to use it without permission to do so will be reported. - --  Our **Craig Dazey Emulator Bot**, which ensures team morale stays -   high at all times by thanking team members for submitted pull -   requests. [1]_ - -Furthermore, our repositories all have dependabot configured on them. - -Dealing with notifications --------------------------- - -This section collects some of our team members’ ways of dealing with the -notifications that originate from our bots. - -Sieve (RFC 5228) script -~~~~~~~~~~~~~~~~~~~~~~~ - -If your mail server supports the `Sieve mail filtering -language <https://datatracker.ietf.org/doc/html/rfc5228.html>`__, which -it should, you can adapt the following script to customize the amount of -notifications you receive: - -.. code:: sieve - -   require ["envelope", "fileinto", "imap4flags"]; - -   if allof (header :is "X-GitHub-Sender" ["coveralls", "github-actions[bot]", "netlify[bot]"], -             address :is "from" "[email protected]") { -       setflag "\\seen"; -       fileinto "Trash"; -       stop; -   } - -If you also want to filter out notifications from renovate, which we use -for dependency updates, you can add ``renovate[bot]`` to the -``X-GitHub-Sender`` list above. - -.. [1] -   Craig Dazey Emulator Bot stands in no affiliation, direct or -   indirect, with Craig Dazey. Craig Dazey Emulator Bot. Craig Dazey -   Emulator Bot is not endorsed by Craig Dazey. Craig Dazey Emulator Bot -   is an independent project of Craig Dazey. No association is made -   between Craig Dazey Emulator Bot and Craig Dazey. diff --git a/docs/tooling/index.rst b/docs/tooling/index.rst deleted file mode 100644 index 2381849..0000000 --- a/docs/tooling/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -Tooling -======= - -Learn about the helperlings that keep Python Discord DevOps running like a -well-oiled machine. - - -.. toctree:: -   :maxdepth: 2 -   :caption: Contents: - -   bots | 
