Cluster Telemetry Production Manual
General information about the Cluster Telemetry is available in the design doc. The maintenance doc details how to maintain CT's different components.
Alerts
ctfe_pending_tasks
This alert indicates there are many tasks in the queue. There are several possibilities:
- CT may not have enough capacity to handle the current task requests. If there are many bare-metal tasks (these currently include all CaptureSKPs and ChromiumPerf tasks as well as ChromiumAnalysis tasks where
RunOnGCE
is false
) requested in a short period of time, it may take a while to complete all tasks. - Check the “Task Details” of each task in the queue for
"TsStarted": 0
(ignoring “scheduled in the future” tasks). CT normally picks up tasks in < 1m, so if a task is not started, that could mean that the CT poller is down (see below) or that something is wrong with the CT framework possibly related to a recent push. - Check the status of the bots in the CT SwarmingPool.
- Note that the GCE bots will be dead if all pending tasks are bare-metal (see above for which tasks are bare-metal).
- If many build*-m5 bots are dead, investigate why the bots are dead and/or file a bug with ChOps (Chrome infra team).
- If many bots are idle, check the Swarming task logs (see next bullet) to see if the dimensions of the pending tasks match the bot dimensions.
- Open the
SwarmingLogs
link shown in the “Task Details.” If build_chromium
has been running for > 1h, something is probably wrong.