Cluster Telemetry Production Manual

General information about the Cluster Telemetry is available in the design doc. The maintenance doc details how to maintain CT's different components.

Alerts

ctfe_pending_tasks

CT normally picks up tasks in < 1m. Having any task be pending in the queue could mean that the CT poller is down (see below) or that something is wrong with the CT framework possibly related to a recent push.

ct_poller_health_check

The CT poller health check is failing. The poller's error logs are here. The poller runs on the CT master in the Chrome Golo. See the instructions here for how to access the master.