Commit message (Collapse) | Author | Age | Lines | |
---|---|---|---|---|
* | Exclude 401 from NGINX 4xx alerts | 2025-07-18 | -1/+1 | |
| | ||||
* | Bump dns cache miss alert time | 2025-05-16 | -2/+3 | |
| | ||||
* | Update ingresses with NGINX ingress upgrade | 2025-04-05 | -3/+3 | |
| | ||||
* | Update quay.io/prometheus/node-exporter Docker tag to v1.9.0 | 2025-02-23 | -1/+1 | |
| | | | | | | | datasource | package | from | to | | ---------- | -------------------------------- | ------ | ------ | | docker | quay.io/prometheus/node-exporter | v1.8.2 | v1.9.0 | | |||
* | Update registry.k8s.io/kube-state-metrics/kube-state-metrics Docker tag to ↵ | 2025-02-06 | -1/+1 | |
| | | | | | | | | v2.15.0 | datasource | package | from | to | | ---------- | ----------------------------------------------------- | ------- | ------- | | docker | registry.k8s.io/kube-state-metrics/kube-state-metrics | v2.13.0 | v2.15.0 | | |||
* | Remove prestashop from alert.d nginx exception for p99 timing alert | 2024-10-01 | -1/+1 | |
| | ||||
* | Disable GitHub authentication in Grafana | 2024-09-19 | -11/+0 | |
| | ||||
* | Further optimisations to AlertManager initscript | 2024-09-19 | -1/+1 | |
| | ||||
* | AlertManager init use Alpine | 2024-09-04 | -2/+2 | |
| | | | | Smaller image and faster startup | |||
* | Raise time threshold for 4xx alerts | 2024-09-01 | -1/+1 | |
| | | | | | | At present we get plenty of unactionable, flapping alarms. So far, they have shown us nothing of value. Raise the time consecutive errors need to be seen before we alert. | |||
* | Show status code in nginx alerts | 2024-08-29 | -4/+4 | |
| | ||||
* | Install prometheus-postfix-exporter | 2024-08-26 | -0/+22 | |
| | | | | | | | | As a data-obsessed administrator I want to have more data such that I can widen my sense of power. This also installs rsyslog, because prometheus-postfix-exporter doesn't work with journald's binary log format. | |||
* | Improve alertmanager text e-mail format | 2024-08-26 | -1/+12 | |
| | | | | Already deployed. | |||
* | Update alertmanager config with mail port | 2024-08-25 | -1/+1 | |
| | ||||
* | Configure alertmanager to send e-mails | 2024-08-25 | -0/+11 | |
| | ||||
* | Unify alertmanager naming | 2024-08-25 | -31/+31 | |
| | | | | Closes #451. | |||
* | Mount new config for LDAP to Grafana and add IPA CA cert | 2024-07-26 | -2/+14 | |
| | ||||
* | Add LDAP bind user password for Grafana | 2024-07-26 | -0/+0 | |
| | ||||
* | Add new Grafana LDAP config and ldap.toml config | 2024-07-26 | -0/+65 | |
| | ||||
* | chore(deps): update registry.k8s.io/kube-state-metrics/kube-state-metrics ↵ | 2024-07-24 | -1/+1 | |
| | | | | | | | | | docker tag to v2.13.0 (#412) | datasource | package | from | to | | ---------- | ----------------------------------------------------- | ------- | ------- | | docker | registry.k8s.io/kube-state-metrics/kube-state-metrics | v2.12.0 | v2.13.0 | Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> | |||
* | Update node_exporter daemonset to 1.27+ featureset | 2024-07-18 | -3/+3 | |
| | ||||
* | chore(deps): update quay.io/prometheus/node-exporter docker tag to v1.8.2 | 2024-07-18 | -1/+1 | |
| | | | | | | | datasource | package | from | to | | ---------- | -------------------------------- | ------ | ------ | | docker | quay.io/prometheus/node-exporter | v1.2.0 | v1.8.2 | | |||
* | Add Admins to Grafana authorized Team IDs | 2024-07-14 | -1/+1 | |
| | ||||
* | Allow new kube-state-metrics image to watch ingresses | 2024-07-01 | -0/+1 | |
| | ||||
* | Move away from vendored kube-state-metrics | 2024-07-01 | -1/+1 | |
| | ||||
* | Scale AM back to 3 replicas | 2024-06-24 | -1/+1 | |
| | ||||
* | Add Kubernetes volume alerts | 2024-06-16 | -0/+11 | |
| | | | | | | | | | | | It seems that Linode has added storage reporting info to the CSI driver allowing us to pick up on the storage use of persistent volume claims within the cluster. This creates and deploys an alert that will report if any volume has under 10% of space left. I have excluded Prometheus as our TSDB retention settings mean that it will always stay just below it's volume size by design. | |||
* | Update Prometheus deployment with a tmpfs for the reloader | 2024-06-10 | -0/+9 | |
| | ||||
* | Add secrets for reloader webhook | 2024-06-10 | -0/+0 | |
| | ||||
* | Add sidecar container to reload Prometheus config on change | 2024-06-10 | -0/+25 | |
| | ||||
* | Add reloader hook configmap to reload prometheus on change | 2024-06-10 | -0/+38 | |
| | ||||
* | Add Alert for Prometheus config reload failure | 2024-06-10 | -0/+9 | |
| | ||||
* | Enable scraping of Prometheus pods | 2024-06-10 | -0/+3 | |
| | ||||
* | Remove PostgreSQL Exporter from Kubernetes | 2024-06-02 | -55/+0 | |
| | ||||
* | Remove Kubernetes PostgreSQL Alerts | 2024-06-02 | -29/+0 | |
| | ||||
* | Fix AlertManager Discord instance formatting | 2024-05-27 | -1/+1 | |
| | | | | | | | | | | | We made a change to include the instance in alerts sent to Discord, but not all of our configured alerts send this field. As a result, we would have incorrectly formatted alerts being sent through to Discord which were tricky to read. The format template has now been changed to only conditionally render the instance label if it is present on a triggered alert. | |||
* | Take 15 minutes before alerting on high latency | 2024-05-20 | -2/+2 | |
| | ||||
* | Annotations.instance => Labels.instance | 2024-05-18 | -1/+1 | |
| | ||||
* | Add instance to AlertManager Discord embeds | 2024-05-17 | -1/+1 | |
| | ||||
* | Move AlertManager to 4 replicas | 2024-05-16 | -1/+1 | |
| | ||||
* | Move AlertManager to pydis.wtf | 2024-05-14 | -4/+5 | |
| | ||||
* | Move prometheus to pydis.wtf | 2024-05-14 | -3/+4 | |
| | ||||
* | Update Grafana configmap to grafana.pydis.wtf | 2024-05-14 | -2/+2 | |
| | ||||
* | Update Grafana ingress to grafana.pydis.wtf | 2024-05-14 | -3/+3 | |
| | ||||
* | Stop alerting for slow GitHub webhook filter endpoint calls (#235) | 2024-04-29 | -2/+2 | |
| | | | | | These are directly forwarded to GitHub with no time-consuming processing done on the site. We would therefore be alerting for GitHub's slowness, which is rather useless. | |||
* | Update all secrets to new PostgreSQL service | 2024-04-27 | -0/+0 | |
| | ||||
* | Exclude home and tag views from latency alerts | 2024-04-24 | -2/+2 | |
| | | | | | These are known issues and we probably won't do anything about them, so stop alerting us about it. | |||
* | Update ContainerOOMEvent alert | 2024-04-17 | -4/+4 | |
| | ||||
* | Move Redis to databases namespace | 2024-04-15 | -0/+0 | |
| | ||||
* | Move Grafana to monitoring namespace | 2024-04-15 | -0/+151 | |
| |