Task Scheduler Production Manual

General information about the Task Scheduler is available in the README.

GS bucket lifecycle config

The file bucket-lifecycle-config.json configures a Google Storage bucket to move files in the skia-task-scheduler bucket to nearline or coldline storage after a period of time. The configuration can be set by running gsutil lifecycle set bucket-lifecycle-config.json gs://skia-task-scheduler.

More documentation of object lifecycle.

Troubleshooting

git-related errors in the log

Eg. fatal: unable to access ‘$REPO’: The requested URL returned error: 502

Unfortunately, these are pretty common, especially in the early afternoon when googlesource is under load. Usually they manifest as a 502, or “repository not found”. If these are occurring at an unusually high rate (more than one or two per hour) or the errors look different, contact an admin and ask if there are any known issues: http://go/gob-oncall

Extremely slow startup

Ie. more than a few minutes before the server starts responding to requests, or liveness_last_successful_task_scheduling_s greater than a few minutes immediately after server startup.

The task scheduler has to load a lot of data from the DB on startup. Additionally, each of its clients reloads all of its data when the scheduler goes offline and comes back. These long-running reads can interact with writes such that operations get blocked and continue piling up. Requests time out and are retried, compounding the problem. If you notice this happening (an extremely long list of “task_scheduler_db Active Transactions” in the log is a clue), you can ease the load on the scheduler by shutting down both Status and Datahopper, restart the scheduler and wait until it is up and running, then restart Status, wait until it is up and running, and finally restart Datahopper. In each case, watch the logs and ensure that all “Reading Tasks from $start_ts to $end_ts” have completed successfully. TODO(borenet): This should not be necessary with the new DB implementation.

DB is very slow

Eg. liveness_last_successful_task_scheduling_s is consistently greater than a few minutes, and “task_scheduler_db Active Transactions” in the log are piling up.

Similar to the above, but not caused by long-running reads. BoltDB performance can degrade for a number of reasons, including excessive free pages. A potential fix is to stop the scheduler and run “bolt compact” over the database file. TODO(borenet): This should not be necessary with the new DB implementation.

Alerts

scheduling_failed

The Task Scheduler has failed to schedule for some time. You should check the logs to try to diagnose what‘s failing. It’s also possible that the scheduler has slowed down substantially and simply hasn't actually completed a scheduling loop in the required time period. That needs to be addressed with additional optimization.

http_latency

The server is taking too long to respond. Look at the logs to determine why it is slow.

error_rate

The server is logging errors at a higher-than-normal rate. This warrants investigation in the logs.

old_db_backup

The most recent backup of the local BoltDB database on Google Storage is more than 25 hours old.

  • If db_backup_trigger_liveness is firing, resolve that first.

  • Look for backup files in the skia-task-scheduler bucket that are more recent, in case the alert is incorrect.

  • Check that task-scheduler-db-backup is deployed to the server and the systemd service is enabled.

  • Check if there are any files in the directory /mnt/pd0/task_scheduler_workdir/trigger-backup. If not, check the systemd logs for task-scheduler-db-backup for errors.

  • If the systemd timer failed to execute, you can trigger a manual backup by running touch /mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual.

  • Otherwise, check logs for “Automatic DB backup failed” or other errors.

too_many_recent_db_backups

There are too many recent backups in the skia-task-scheduler bucket. This indicates a runaway process is creating unnecessary backups. Review the task scheduler logs for “Beginning manual DB backup” to determine what is triggering the excessive backups.

db_backup_trigger_liveness

The function DBBackup.Tick is not being called periodically. If scheduling_failed alert is firing, resolve that first. Otherwise, check for recent code changes that may have unintentionally removed the callback to trigger a DB backup from the task scheduler loop.

incremental_backup_liveness

The function gsDBBackup.incrementalBackupStep has not succeeded recently. Check logs for “Incremental Job backup failed”. If Task Scheduler is otherwise operating normally, this is not a critical alert, since we also perform a full nightly backup.

incremental_backup_reset

The function gsDBBackup.incrementalBackupStep is not able to keep up with the rate of new and modified Jobs. This likely indicates a problem with the connection to Google Storage or the need for additional concurrency. Check logs for “Incremental Job backup failed” or “incrementalBackupStep too slow”. This alert will also resolve itself after the next full backup, which can be manually triggered by running touch /mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual.

db_too_many_free_pages

The number of cached free pages in the Task Scheduler BoltDB has grown large. As this number grows, DB performance suffers. Please file a bug and increase the threshold in alerts.cfg. It's unclear what causes this issue, but it might be due to killing the process without gracefully closing the DB or due to large read transactions concurrent with write transactions.

too_many_candidates

The number of task candidates for a given dimension set is very high. This may not actually indicate that anything is wrong with the Task Scheduler. Instead, it may just mean that demand has exceeded bot capacity for one or more types of bots for an extended period. If possible, increase the bot capacity by adding more bots or by fixing offline or quarantined bots. Consider temporarily skipping backfill tasks for these bots to reduce load on the scheduler. An alternative long-term fix is to remove tasks for overloaded bots.

trigger_nightly

The nightly trigger has not run in over 25 hours. Check that the task-scheduler-trigger-nightly.service has run. If not, check the systemctl settings on the server. If so, check the Task Scheduler logs.

trigger_weekly

The weekly trigger has not run in over 8 days. Check that the task-scheduler-trigger-weekly.service has run. If not, check the systemctl settings on the server. If so, check the Task Scheduler logs.

overdue_metrics_liveness

The function TaskScheduler.updateOverdueJobSpecMetrics is not being called periodically. If scheduling_failed alert is firing, resolve that first. Otherwise, check the logs for error messages, check the timer_func_timer_ns{func="updateOverdueJobSpecMetrics"} metric, or look for recent changes that may have affected this function.

overdue_job_spec

Tasks have not completed recently for the indicated job, even though a reasonable amount of time has elapsed since an eligible commit. If any other task scheduler alerts are firing, resolve those first. Otherwise:

  • Check Status for pending or running tasks for this job. The Swarming UI provides the best information on why the task has not completed.

  • Check that the dimensions specified for the job's tasks match the bot that should run those tasks.

  • Check that the bots are available to run the tasks. Remember that forced jobs will always be completed before other jobs, and tryjobs get a higher score than regular jobs.

    • If there are many forced jobs that were triggered accidentally, the Job search UI can be used to bulk-cancel jobs.

latest_job_age

Jobs have not been triggered recently enough for the indicated job spec. This normally indicates that the periodic triggers have stopped working for some reason. Double check that the “periodic-trigger” cron jobs have run at the expected time in Kubernetes. If they have not, look into why. If they have, check the Task Scheduler logs to verify that the scheduler received the pubsub message and if so determine why it did not create the job.

update_repos_failed

The scheduler (job creator) has failed to update its git repos for too long. Check the logs and determine what is going on. If the git servers are down or having problems, make sure that the team is aware by filing a bug or pinging IRC: http://go/gob-oncall

poll_buildbucket_failed

The scheduler (job creator) has not successfully polled Buildbucket for new tryjobs in a while. Any tryjobs started by the CQ or manually during this period have not been picked up yet. Check the logs and determine what is going on.

If the git servers are down or having problems, make sure that the team is aware by filing a bug or pinging IRC: http://go/gob-oncall

If Buildbucket is down or having problems, see https://g.co/bugatrooper

You may want to notify skia-team about the disruption.

update_buildbucket_failed

The scheduler (job creator) has not successfully sent heartbeats to Buildbucket for in-progress tryjobs or sent status updates to Buildbucket for completed tryjobs in a while. Any tryjobs that have completed during this time will not be reflected on Gerrit or the CQ. Check the logs and determine what is going on.

If Buildbucket is down or having problems, see https://g.co/bugatrooper

You may want to notify skia-team about the disruption.