blob: c38a28e636eeffc5c523f6bb0a65089326769a8f [file] [log] [blame] [view]
Task Scheduler Production Manual
================================
General information about the Task Scheduler is available in the
[README](./README.md).
GS bucket lifecycle config
--------------------------
The file bucket-lifecycle-config.json configures a Google Storage bucket to move
files in the skia-task-scheduler bucket to nearline or coldline storage after a
period of time. The configuration can be set by running `gsutil lifecycle set
bucket-lifecycle-config.json gs://skia-task-scheduler`.
[More documentation of object lifecycle](https://cloud.google.com/storage/docs/lifecycle).
Troubleshooting
===============
git-related errors in the log
-----------------------------
Eg. fatal: unable to access '$REPO': The requested URL returned error: 502
Unfortunately, these are pretty common, especially in the early afternoon when
googlesource is under load. Usually they manifest as a 502, or "repository not
found". If these are occurring at an unusually high rate (more than one or two
per hour) or the errors look different, contact an admin and ask if there are
any known issues: http://go/gob-oncall
Extremely slow startup
----------------------
Ie. more than a few minutes before the server starts responding to requests, or
liveness_last_successful_task_scheduling_s greater than a few minutes
immediately after server startup.
The task scheduler has to load a lot of data from the DB on startup.
Additionally, each of its clients reloads all of its data when the scheduler
goes offline and comes back. These long-running reads can interact with writes
such that operations get blocked and continue piling up. Requests time out and
are retried, compounding the problem. If you notice this happening (an extremely
long list of "task_scheduler_db Active Transactions" in the log is a clue), you
can ease the load on the scheduler by shutting down both Status and Datahopper,
restart the scheduler and wait until it is up and running, then restart Status,
wait until it is up and running, and finally restart Datahopper. In each case,
watch the logs and ensure that all "Reading Tasks from $start_ts to $end_ts"
have completed successfully.
TODO(borenet): This should not be necessary with the new DB implementation.
DB is very slow
---------------
Eg. liveness_last_successful_task_scheduling_s is consistently greater than a
few minutes, and "task_scheduler_db Active Transactions" in the log are piling
up.
Similar to the above, but not caused by long-running reads. BoltDB performance
can degrade for a number of reasons, including excessive free pages. A potential
fix is to stop the scheduler and run "bolt compact" over the database file.
TODO(borenet): This should not be necessary with the new DB implementation.
Alerts
======
scheduling_failed
-----------------
The Task Scheduler has failed to schedule for some time. You should check the
logs to try to diagnose what's failing. It's also possible that the scheduler
has slowed down substantially and simply hasn't actually completed a scheduling
loop in the required time period. That needs to be addressed with additional
optimization.
http_latency
------------
The server is taking too long to respond. Look at the logs to determine why it
is slow.
error_rate
----------
The server is logging errors at a higher-than-normal rate. This warrants
investigation in the logs.
old_db_backup
-------------
The most recent backup of the local BoltDB database on Google Storage is more
than 25 hours old.
- If db_backup_trigger_liveness is firing, resolve that first.
- Look for backup files in the
[skia-task-scheduler bucket](https://console.cloud.google.com/storage/browser/skia-task-scheduler/db-backup/)
that are more recent, in case the alert is incorrect.
- Check that task-scheduler-db-backup is deployed to the server and the systemd
service is enabled.
- Check if there are any files in the directory
`/mnt/pd0/task_scheduler_workdir/trigger-backup`. If not, check the systemd
logs for task-scheduler-db-backup for errors.
- If the systemd timer failed to execute, you can trigger a manual
backup by running `touch
/mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual`.
- Otherwise, check logs for "Automatic DB backup failed" or other errors.
too_many_recent_db_backups
--------------------------
There are too many recent backups in the
[skia-task-scheduler bucket](https://console.cloud.google.com/storage/browser/skia-task-scheduler/db-backup/).
This indicates a runaway process is creating unnecessary backups. Review the
task scheduler logs for "Beginning manual DB backup" to determine what is
triggering the excessive backups.
db_backup_trigger_liveness
--------------------------
The function DBBackup.Tick is not being called periodically. If
scheduling_failed alert is firing, resolve that first. Otherwise, check for
recent code changes that may have unintentionally removed the callback to
trigger a DB backup from the task scheduler loop.
incremental_backup_liveness
---------------------------
The function gsDBBackup.incrementalBackupStep has not succeeded recently. Check
logs for "Incremental Job backup failed". If Task Scheduler is otherwise
operating normally, this is not a critical alert, since we also perform a full
nightly backup.
incremental_backup_reset
------------------------
The function gsDBBackup.incrementalBackupStep is not able to keep up with the
rate of new and modified Jobs. This likely indicates a problem with the
connection to Google Storage or the need for additional concurrency. Check logs
for "Incremental Job backup failed" or "incrementalBackupStep too slow". This
alert will also resolve itself after the next full backup, which can be manually
triggered by running `touch
/mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual`.
db_too_many_free_pages
----------------------
The number of cached free pages in the Task Scheduler BoltDB has grown
large. As this number grows, DB performance suffers. Please file a bug and
increase the threshold in alerts.cfg. It's unclear what causes this issue, but
it might be due to killing the process without gracefully closing the DB or due
to large read transactions concurrent with write transactions.
too_many_candidates
-------------------
The number of task candidates for a given dimension set is very high. This may
not actually indicate that anything is wrong with the Task Scheduler. Instead,
it may just mean that demand has exceeded bot capacity for one or more types of
bots for an extended period. If possible, increase the bot capacity by adding
more bots or by fixing offline or quarantined bots. Consider temporarily
skipping backfill tasks for these bots to reduce load on the scheduler. An
alternative long-term fix is to remove tasks for overloaded bots.
trigger_nightly
---------------
The nightly trigger has not run in over 25 hours. Check that the
task-scheduler-trigger-nightly.service has run. If not, check the systemctl
settings on the server. If so, check the Task Scheduler logs.
trigger_weekly
--------------
The weekly trigger has not run in over 8 days. Check that the
task-scheduler-trigger-weekly.service has run. If not, check the systemctl
settings on the server. If so, check the Task Scheduler logs.
overdue_metrics_liveness
------------------------
The function TaskScheduler.updateOverdueJobSpecMetrics is not being called
periodically. If scheduling_failed alert is firing, resolve that first.
Otherwise, check the logs for error messages, check the
`timer_func_timer_ns{func="updateOverdueJobSpecMetrics"}` metric, or look
for recent changes that may have affected this function.
overdue_job_spec
----------------
Tasks have not completed recently for the indicated job, even though a
reasonable amount of time has elapsed since an eligible commit. If any other
task scheduler alerts are firing, resolve those first. Otherwise:
- Check Status for pending or running tasks for this job. The Swarming UI
provides the best information on why the task has not completed.
- Check that the dimensions specified for the job's tasks match the bot that
should run those tasks.
- Check that the bots are available to run the tasks. Remember that forced jobs
will always be completed before other jobs, and tryjobs get a higher score
than regular jobs.
- If there are many forced jobs that were triggered accidentally, the [Job
search UI](https://task-scheduler.skia.org/jobs/search) can be used to
bulk-cancel jobs.
latest_job_age
--------------
Jobs have not been triggered recently enough for the indicated job spec. This
normally indicates that the periodic triggers have stopped working for some
reason. Double check that the "periodic-trigger" cron jobs have run at the
expected time in Kubernetes. If they have not, look into why. If they have,
check the Task Scheduler logs to verify that the scheduler received the pubsub
message and if so determine why it did not create the job.
update_repos_failed
-------------------
The scheduler (job creator) has failed to update its git repos for too long.
Check the logs and determine what is going on. If the git servers are down or
having problems, make sure that the team is aware by filing a bug or pinging
IRC:
http://go/gob-oncall
poll_buildbucket_failed
-----------------------
The scheduler (job creator) has not successfully polled Buildbucket for new
tryjobs in a while. Any tryjobs started by the CQ or manually during this period
have not been picked up yet. Check the logs and determine what is going on.
If the git servers are down or having problems, make sure that the team is aware
by filing a bug or pinging IRC:
http://go/gob-oncall
If Buildbucket is down or having problems, see https://g.co/bugatrooper
You may want to notify skia-team about the disruption.
update_buildbucket_failed
-------------------------
The scheduler (job creator) has not successfully sent heartbeats to Buildbucket
for in-progress tryjobs or sent status updates to Buildbucket for completed
tryjobs in a while. Any tryjobs that have completed during this time will not be
reflected on Gerrit or the CQ. Check the logs and determine what is going on.
If Buildbucket is down or having problems, see https://g.co/bugatrooper
You may want to notify skia-team about the disruption.