Cluster Telemetry Production Manual

General information about the Cluster Telemetry is available in the design doc. The maintenance doc details how to maintain CT's different components.

Alerts

ctfe_pending_tasks

This alert indicates there are many tasks in the queue. There are several possibilities:

CT may not have enough capacity to handle the current task requests. If there are many bare-metal tasks (these currently include all CaptureSKPs and ChromiumPerf tasks as well as ChromiumAnalysis tasks where RunOnGCE is false) requested in a short period of time, it may take a while to complete all tasks.
Check the “Task Details” of each task in the queue for "TsStarted": 0 (ignoring “scheduled in the future” tasks). CT normally picks up tasks in < 1m, so if a task is not started, that could mean that the CT poller is down (see below) or that something is wrong with the CT framework possibly related to a recent push.
Check the status of the bots in the CT SwarmingPool.
- Note that the GCE bots will be dead if all pending tasks are bare-metal (see above for which tasks are bare-metal).
- If many build*-m5 bots are dead, investigate why the bots are dead and/or file a bug with ChOps (Chrome infra team).
- If many bots are idle, check the Swarming task logs (see next bullet) to see if the dimensions of the pending tasks match the bot dimensions.
Open the SwarmingLogs link shown in the “Task Details.” If build_chromium has been running for > 1h, something is probably wrong.

ct_poller_health_check

The CT poller health check is failing. The poller's error logs are here. The poller runs on the CT master (ct-master Google Cloud Kubernetes service).