The grafana.ini file should almost never change, so if it does, just delete the pod and have kubernetes restart it so the config gets read.
Edit the config file by running the ./edit-grafana-config.sh script.
Before deploying yaml files with service accounts you need to give yourself cluster-admin rights:
kubectl create clusterrolebinding \ ${USER}-cluster-admin-binding \ --clusterrole=cluster-admin \ --user=${USER}@google.com
The best way to get an idea of all the parts of Thanos and how they work together is to look at the diagram on the Thanos Tuturial.
There are two protected URLS for Thanos:
Both sites above to restricted to Googlers only.
All alert rules are evaluated by thanos-rule, which then sends alerts to alert-to-pubsub
.
If you add/remove alerts, please run make update_alerts
to deploy them. am.skia.org will take 5-10 minutes to see these changed alerts.
A Thanos sidecar runs alongside each Prometheus instance. For each Prometheus instance that runs outside of skia-public
we also run a thanos-bouncer
container that sets up a reverse ssh port-forward that allows thanos-query
to make queries against the Thanos sidecar.
Additionally thanos-store
runs in skia-public
and allows querying against all the hsitorical data written by the thanos-sidecar
s.
The long term storage bucket for metrics is gs://skia-thanos
.
We do not currently run an instance of the Thanos compactor.
Obviously we can‘t get alerts if thanos-ruler
stops sending alerts to alert-to-pubsub
, so we need a second path for such alerts. We use Grafana’s ability to send alert emails to cover that case. There is a dashboard for Thanos setup at: https://grafana2.skia.org/d/7giJAG3Wk/thanos?orgId=1 and the Liveness panel has an alert set if alert-to-pubsub
goes too long without seeing an alert come from thanos-ruler
. When firing the alert will send email to skiabot@google.com.
kube-state-metrics is run in all clusters allowing collection of metrics on the state of objects, in particular states you normally can't get from default metrics, such as cronjobs that have failed.
To update the version of kube-state-metrics we use the Makefile target release_kube-state-metrics
can be updated to use a different tag.