Cluster Telemetry Production Manual

General information about the Cluster Telemetry is available in the design doc. The maintenance doc details how to maintain CT's different components.

Alerts

ctfe_pending_tasks

This alert indicates there are many tasks in the queue. There are several possibilities:

  • CT may not have enough capacity to handle the current task requests. If there are many bare-metal tasks (these currently include ChromiumPerf tasks as well as ChromiumAnalysis tasks where RunOnGCE is false) requested in a short period of time, it may take a while to complete all tasks.
  • Check the “Task Details” of each task in the queue for "TsStarted": 0 (ignoring “scheduled in the future” tasks). CT normally picks up tasks in < 1m, so if a task is not started, that could mean that the CT poller is down (see below) or that something is wrong with the CT framework possibly related to a recent push.
  • Check the status of the bots in the CT SwarmingPool.
    • Note that the GCE bots will be dead if all pending tasks are bare-metal (see above for which tasks are bare-metal).
    • If many build*-m5 bots are dead, investigate why the bots are dead and/or file a bug with ChOps (Chrome infra team).
    • If many bots are idle, check the Swarming task logs (see next bullet) to see if the dimensions of the pending tasks match the bot dimensions.
  • Open the SwarmingLogs link shown in the “Task Details.” If build_chromium has been running for > 1h, something is probably wrong.