blob: 2b9db82bb820dbf6fb4c93a323186fda57b1a2ac [file] [log] [blame] [view] [edit]
# Cluster Telemetry Production Manual
General information about the Cluster Telemetry is available in the
[design doc](./DESIGN.md).
The [maintenance doc](./maintenance.md) details how to maintain CT's
different components.
# Alerts
## ctfe_pending_tasks
This alert indicates there are many tasks in the
[queue](https://ct.skia.org/queue/). There are several possibilities:
- CT may not have enough capacity to handle the current task requests. If there
are many bare-metal tasks (these currently include ChromiumPerf tasks as well
as ChromiumAnalysis tasks where `RunOnGCE` is
`false`) requested in a short period of time, it may take a while to complete
all tasks.
- Check the "Task Details" of each task in the
[queue](https://ct.skia.org/queue/) for `"TsStarted": 0` (ignoring "scheduled
in the future" tasks). CT normally picks up tasks in < 1m, so if a task is not
started, that could mean that the CT poller is down (see below) or that
something is wrong with the CT framework possibly related to a recent push.
- Check the status of the bots in the [CT SwarmingPool](https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3ACT&l=100&s=id%3Aasc).
- Note that the GCE bots will be dead if all pending tasks are bare-metal (see
above for which tasks are bare-metal).
- If many build\*-m5 bots are dead, investigate why the bots are dead and/or
[file a bug](https://code.google.com/p/chromium/issues/entry?template=Build%20Infrastructure)
with ChOps (Chrome infra team).
- If many bots are idle, check the Swarming task logs (see next bullet) to see
if the dimensions of the pending tasks match the bot dimensions.
- Open the `SwarmingLogs` link shown in the "Task Details." If `build_chromium`
has been running for > 1h, something is probably wrong.