tree: 83eb63a17dd0738b8fb87d935190f6c9b8f00a94 [path history] [tgz]
  1. backup-to-gcs/
  2. go/
  3. grafana/
  4. images/
  5. prometheus/
  6. secrets/
  7. .gitignore
  8. build_backup_to_gcs_release
  9. COLLECTD.md
  10. create-backup-to-gcs-internal-sa.sh
  11. create-backup-to-gcs-sa.sh
  12. create-pagerduty-secret.sh
  13. edit-grafana-config.sh
  14. edit-grafana-internal-config.sh
  15. GRAFANA.md
  16. Makefile
  17. README.md
promk/README.md

Grafana

The grafana.ini file should almost never change, so if it does, just delete the pod and have kubernetes restart it so the config gets read.

Edit the config file by running the ./edit-grafana-config.sh script.

Prometheus

Admins

Before deploying yaml files with service accounts you need to give yourself cluster-admin rights:

  kubectl create clusterrolebinding \
    ${USER}-cluster-admin-binding \
    --clusterrole=cluster-admin \
    --user=${USER}@google.com

Thanos

The best way to get an idea of all the parts of Thanos and how they work together is to look at the diagram on the Thanos Tuturial.

There are two protected URLS for Thanos:

Both sites above to restricted to Googlers only.

All alert rules are evaluated by thanos-rule, which then sends alerts to alert-to-pubsub.

If an alert is changed only make push_config_thanos needs to be run.

A Thanos sidecar runs along side each Prometheus instance. For each Prometheus instance that runs outside of skia-public we also run a thanos-bouncer container that sets up a reverse ssh port-forward that allows thanos-query to make queries against the Thanos sidecar.

Additionally thanos-store runs in skia-public and allows querying against all the hsitorical data written by the thanos-sidecars.

The long term storage bucket for metrics is gs://skia-thanos.

We do not currently run an instance of the Thanos compactor.

Grafana

Obviously we can‘t get alerts if thanos-ruler stops sending alerts to alert-to-pubsub, so we need a second path for such alerts. We use Grafana’s ability to send alert emails to cover that case. There is a dashboard for Thanos setup at: https://grafana2.skia.org/d/7giJAG3Wk/thanos?orgId=1 and the Liveness panel has an alert set if alert-to-pubsub goes too long without seeing an alert come from thanos-ruler. When firing the alert will send email to skiabot@google.com.