task_scheduler/PROD.md - buildbot - Git at Google

 Task Scheduler Production Manual
 ================================

 General information about the Task Scheduler is available in the
 [README](./README.md).

 GS bucket lifecycle config
 --------------------------

 The file bucket-lifecycle-config.json configures a Google Storage bucket to move
 files in the skia-task-scheduler bucket to nearline or coldline storage after a
 period of time. The configuration can be set by running `gsutil lifecycle set
 bucket-lifecycle-config.json gs://skia-task-scheduler`.

 [More documentation of object lifecycle](https://cloud.google.com/storage/docs/lifecycle).


 Troubleshooting
 ===============

 git-related errors in the log
 -----------------------------

 Eg. fatal: unable to access '$REPO': The requested URL returned error: 502

 Unfortunately, these are pretty common, especially in the early afternoon when
 googlesource is under load. Usually they manifest as a 502, or "repository not
 found". If these are occurring at an unusually high rate (more than one or two
 per hour) or the errors look different, contact an admin and ask if there are
 any known issues: http://go/gob-oncall


 Extremely slow startup
 ----------------------

 Ie. more than a few minutes before the server starts responding to requests, or
 liveness_last_successful_task_scheduling_s greater than a few minutes
 immediately after server startup.

 The task scheduler has to load a lot of data from the DB on startup.
 Additionally, each of its clients reloads all of its data when the scheduler
 goes offline and comes back. These long-running reads can interact with writes
 such that operations get blocked and continue piling up. Requests time out and
 are retried, compounding the problem. If you notice this happening (an extremely
 long list of "task_scheduler_db Active Transactions" in the log is a clue), you
 can ease the load on the scheduler by shutting down both Status and Datahopper,
 restart the scheduler and wait until it is up and running, then restart Status,
 wait until it is up and running, and finally restart Datahopper. In each case,
 watch the logs and ensure that all "Reading Tasks from $start_ts to $end_ts"
 have completed successfully.
 TODO(borenet): This should not be necessary with the new DB implementation.


 DB is very slow
 ---------------

 Eg. liveness_last_successful_task_scheduling_s is consistently greater than a
 few minutes, and "task_scheduler_db Active Transactions" in the log are piling
 up.

 Similar to the above, but not caused by long-running reads. BoltDB performance
 can degrade for a number of reasons, including excessive free pages. A potential
 fix is to stop the scheduler and run "bolt compact" over the database file.
 TODO(borenet): This should not be necessary with the new DB implementation.


 Alerts
 ======

 scheduling_failed
 -----------------

 The Task Scheduler has failed to schedule for some time. You should check the
 logs to try to diagnose what's failing. It's also possible that the scheduler
 has slowed down substantially and simply hasn't actually completed a scheduling
 loop in the required time period. That needs to be addressed with additional
 optimization.


 http_latency
 ------------

 The server is taking too long to respond. Look at the logs to determine why it
 is slow.


 error_rate
 ----------

 The server is logging errors at a higher-than-normal rate. This warrants
 investigation in the logs.


 old_db_backup
 -------------

 The most recent backup of the local BoltDB database on Google Storage is more
 than 25 hours old.

  - If db_backup_trigger_liveness is firing, resolve that first.

  - Look for backup files in the
    [skia-task-scheduler bucket](https://console.cloud.google.com/storage/browser/skia-task-scheduler/db-backup/)
    that are more recent, in case the alert is incorrect.

  - Check that task-scheduler-db-backup is deployed to the server and the systemd
    service is enabled.

  - Check if there are any files in the directory
    `/mnt/pd0/task_scheduler_workdir/trigger-backup`. If not, check the systemd
    logs for task-scheduler-db-backup for errors.

  - If the systemd timer failed to execute, you can trigger a manual
    backup by running `touch
    /mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual`.

  - Otherwise, check logs for "Automatic DB backup failed" or other errors.


 too_many_recent_db_backups
 --------------------------

 There are too many recent backups in the
 [skia-task-scheduler bucket](https://console.cloud.google.com/storage/browser/skia-task-scheduler/db-backup/).
 This indicates a runaway process is creating unnecessary backups. Review the
 task scheduler logs for "Beginning manual DB backup" to determine what is
 triggering the excessive backups.


 db_backup_trigger_liveness
 --------------------------

 The function DBBackup.Tick is not being called periodically. If
 scheduling_failed alert is firing, resolve that first. Otherwise, check for
 recent code changes that may have unintentionally removed the callback to
 trigger a DB backup from the task scheduler loop.


 incremental_backup_liveness
 ---------------------------

 The function gsDBBackup.incrementalBackupStep has not succeeded recently. Check
 logs for "Incremental Job backup failed". If Task Scheduler is otherwise
 operating normally, this is not a critical alert, since we also perform a full
 nightly backup.


 incremental_backup_reset
 ------------------------

 The function gsDBBackup.incrementalBackupStep is not able to keep up with the
 rate of new and modified Jobs. This likely indicates a problem with the
 connection to Google Storage or the need for additional concurrency. Check logs
 for "Incremental Job backup failed" or "incrementalBackupStep too slow". This
 alert will also resolve itself after the next full backup, which can be manually
 triggered by running `touch
 /mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual`.


 db_too_many_free_pages
 ----------------------

 The number of cached free pages in the Task Scheduler BoltDB has grown
 large. As this number grows, DB performance suffers. Please file a bug and
 increase the threshold in alerts.cfg. It's unclear what causes this issue, but
 it might be due to killing the process without gracefully closing the DB or due
 to large read transactions concurrent with write transactions.


 too_many_candidates
 -------------------

 The number of task candidates for a given dimension set is very high. This may
 not actually indicate that anything is wrong with the Task Scheduler. Instead,
 it may just mean that demand has exceeded bot capacity for one or more types of
 bots for an extended period. If possible, increase the bot capacity by adding
 more bots or by fixing offline or quarantined bots. Consider temporarily
 skipping backfill tasks for these bots to reduce load on the scheduler. An
 alternative long-term fix is to remove tasks for overloaded bots.


 trigger_nightly
 ---------------

 The nightly trigger has not run in over 25 hours. Check that the
 task-scheduler-trigger-nightly.service has run. If not, check the systemctl
 settings on the server. If so, check the Task Scheduler logs.


 trigger_weekly
 --------------

 The weekly trigger has not run in over 8 days. Check that the
 task-scheduler-trigger-weekly.service has run. If not, check the systemctl
 settings on the server. If so, check the Task Scheduler logs.


 overdue_metrics_liveness
 ------------------------

 The function TaskScheduler.updateOverdueJobSpecMetrics is not being called
 periodically. If scheduling_failed alert is firing, resolve that first.
 Otherwise, check the logs for error messages, check the
 `timer_func_timer_ns{func="updateOverdueJobSpecMetrics"}` metric, or look
 for recent changes that may have affected this function.


 overdue_job_spec
 ----------------

 Tasks have not completed recently for the indicated job, even though a
 reasonable amount of time has elapsed since an eligible commit. If any other
 task scheduler alerts are firing, resolve those first. Otherwise:

  - Check Status for pending or running tasks for this job. The Swarming UI
    provides the best information on why the task has not completed.

  - Check that the dimensions specified for the job's tasks match the bot that
    should run those tasks.

  - Check that the bots are available to run the tasks. Remember that forced jobs
    will always be completed before other jobs, and tryjobs get a higher score
    than regular jobs.

     - If there are many forced jobs that were triggered accidentally, the [Job
       search UI](https://task-scheduler.skia.org/jobs/search) can be used to
       bulk-cancel jobs.


 latest_job_age
 --------------

 Jobs have not been triggered recently enough for the indicated job spec. This
 normally indicates that the periodic triggers have stopped working for some
 reason. Double check that the "periodic-trigger" cron jobs have run at the
 expected time in Kubernetes. If they have not, look into why. If they have,
 check the Task Scheduler logs to verify that the scheduler received the pubsub
 message and if so determine why it did not create the job.


 update_repos_failed
 -------------------

 The scheduler (job creator) has failed to update its git repos for too long.
 Check the logs and determine what is going on. If the git servers are down or
 having problems, make sure that the team is aware by filing a bug or pinging
 IRC:
 http://go/gob-oncall


 poll_buildbucket_failed
 -----------------------

 The scheduler (job creator) has not successfully polled Buildbucket for new
 tryjobs in a while. Any tryjobs started by the CQ or manually during this period
 have not been picked up yet. Check the logs and determine what is going on.

 If the git servers are down or having problems, make sure that the team is aware
 by filing a bug or pinging IRC:
 http://go/gob-oncall

 If Buildbucket is down or having problems, see https://g.co/bugatrooper

 You may want to notify skia-team about the disruption.


 update_buildbucket_failed
 -------------------------

 The scheduler (job creator) has not successfully sent heartbeats to Buildbucket
 for in-progress tryjobs or sent status updates to Buildbucket for completed
 tryjobs in a while. Any tryjobs that have completed during this time will not be
 reflected on Gerrit or the CQ. Check the logs and determine what is going on.

 If Buildbucket is down or having problems, see https://g.co/bugatrooper

 You may want to notify skia-team about the disruption.
	Task Scheduler Production Manual
	================================

	General information about the Task Scheduler is available in the
	[README](./README.md).

	GS bucket lifecycle config
	--------------------------

	The file bucket-lifecycle-config.json configures a Google Storage bucket to move
	files in the skia-task-scheduler bucket to nearline or coldline storage after a
	period of time. The configuration can be set by running `gsutil lifecycle set
	bucket-lifecycle-config.json gs://skia-task-scheduler`.

	[More documentation of object lifecycle](https://cloud.google.com/storage/docs/lifecycle).


	Troubleshooting
	===============

	git-related errors in the log
	-----------------------------

	Eg. fatal: unable to access '$REPO': The requested URL returned error: 502

	Unfortunately, these are pretty common, especially in the early afternoon when
	googlesource is under load. Usually they manifest as a 502, or "repository not
	found". If these are occurring at an unusually high rate (more than one or two
	per hour) or the errors look different, contact an admin and ask if there are
	any known issues: http://go/gob-oncall


	Extremely slow startup
	----------------------

	Ie. more than a few minutes before the server starts responding to requests, or
	liveness_last_successful_task_scheduling_s greater than a few minutes
	immediately after server startup.

	The task scheduler has to load a lot of data from the DB on startup.
	Additionally, each of its clients reloads all of its data when the scheduler
	goes offline and comes back. These long-running reads can interact with writes
	such that operations get blocked and continue piling up. Requests time out and
	are retried, compounding the problem. If you notice this happening (an extremely
	long list of "task_scheduler_db Active Transactions" in the log is a clue), you
	can ease the load on the scheduler by shutting down both Status and Datahopper,
	restart the scheduler and wait until it is up and running, then restart Status,
	wait until it is up and running, and finally restart Datahopper. In each case,
	watch the logs and ensure that all "Reading Tasks from $start_ts to $end_ts"
	have completed successfully.
	TODO(borenet): This should not be necessary with the new DB implementation.


	DB is very slow
	---------------

	Eg. liveness_last_successful_task_scheduling_s is consistently greater than a
	few minutes, and "task_scheduler_db Active Transactions" in the log are piling
	up.

	Similar to the above, but not caused by long-running reads. BoltDB performance
	can degrade for a number of reasons, including excessive free pages. A potential
	fix is to stop the scheduler and run "bolt compact" over the database file.
	TODO(borenet): This should not be necessary with the new DB implementation.


	Alerts
	======

	scheduling_failed
	-----------------

	The Task Scheduler has failed to schedule for some time. You should check the
	logs to try to diagnose what's failing. It's also possible that the scheduler
	has slowed down substantially and simply hasn't actually completed a scheduling
	loop in the required time period. That needs to be addressed with additional
	optimization.


	http_latency
	------------

	The server is taking too long to respond. Look at the logs to determine why it
	is slow.


	error_rate
	----------

	The server is logging errors at a higher-than-normal rate. This warrants
	investigation in the logs.


	old_db_backup
	-------------

	The most recent backup of the local BoltDB database on Google Storage is more
	than 25 hours old.

	- If db_backup_trigger_liveness is firing, resolve that first.

	- Look for backup files in the
	[skia-task-scheduler bucket](https://console.cloud.google.com/storage/browser/skia-task-scheduler/db-backup/)
	that are more recent, in case the alert is incorrect.

	- Check that task-scheduler-db-backup is deployed to the server and the systemd
	service is enabled.

	- Check if there are any files in the directory
	`/mnt/pd0/task_scheduler_workdir/trigger-backup`. If not, check the systemd
	logs for task-scheduler-db-backup for errors.

	- If the systemd timer failed to execute, you can trigger a manual
	backup by running `touch
	/mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual`.

	- Otherwise, check logs for "Automatic DB backup failed" or other errors.


	too_many_recent_db_backups
	--------------------------

	There are too many recent backups in the
	[skia-task-scheduler bucket](https://console.cloud.google.com/storage/browser/skia-task-scheduler/db-backup/).
	This indicates a runaway process is creating unnecessary backups. Review the
	task scheduler logs for "Beginning manual DB backup" to determine what is
	triggering the excessive backups.


	db_backup_trigger_liveness
	--------------------------

	The function DBBackup.Tick is not being called periodically. If
	scheduling_failed alert is firing, resolve that first. Otherwise, check for
	recent code changes that may have unintentionally removed the callback to
	trigger a DB backup from the task scheduler loop.


	incremental_backup_liveness
	---------------------------

	The function gsDBBackup.incrementalBackupStep has not succeeded recently. Check
	logs for "Incremental Job backup failed". If Task Scheduler is otherwise
	operating normally, this is not a critical alert, since we also perform a full
	nightly backup.


	incremental_backup_reset
	------------------------

	The function gsDBBackup.incrementalBackupStep is not able to keep up with the
	rate of new and modified Jobs. This likely indicates a problem with the
	connection to Google Storage or the need for additional concurrency. Check logs
	for "Incremental Job backup failed" or "incrementalBackupStep too slow". This
	alert will also resolve itself after the next full backup, which can be manually
	triggered by running `touch
	/mnt/pd0/task_scheduler_workdir/trigger-backup/task-scheduler-manual`.


	db_too_many_free_pages
	----------------------

	The number of cached free pages in the Task Scheduler BoltDB has grown
	large. As this number grows, DB performance suffers. Please file a bug and
	increase the threshold in alerts.cfg. It's unclear what causes this issue, but
	it might be due to killing the process without gracefully closing the DB or due
	to large read transactions concurrent with write transactions.


	too_many_candidates
	-------------------

	The number of task candidates for a given dimension set is very high. This may
	not actually indicate that anything is wrong with the Task Scheduler. Instead,
	it may just mean that demand has exceeded bot capacity for one or more types of
	bots for an extended period. If possible, increase the bot capacity by adding
	more bots or by fixing offline or quarantined bots. Consider temporarily
	skipping backfill tasks for these bots to reduce load on the scheduler. An
	alternative long-term fix is to remove tasks for overloaded bots.


	trigger_nightly
	---------------

	The nightly trigger has not run in over 25 hours. Check that the
	task-scheduler-trigger-nightly.service has run. If not, check the systemctl
	settings on the server. If so, check the Task Scheduler logs.


	trigger_weekly
	--------------

	The weekly trigger has not run in over 8 days. Check that the
	task-scheduler-trigger-weekly.service has run. If not, check the systemctl
	settings on the server. If so, check the Task Scheduler logs.


	overdue_metrics_liveness
	------------------------

	The function TaskScheduler.updateOverdueJobSpecMetrics is not being called
	periodically. If scheduling_failed alert is firing, resolve that first.
	Otherwise, check the logs for error messages, check the
	`timer_func_timer_ns{func="updateOverdueJobSpecMetrics"}` metric, or look
	for recent changes that may have affected this function.


	overdue_job_spec
	----------------

	Tasks have not completed recently for the indicated job, even though a
	reasonable amount of time has elapsed since an eligible commit. If any other
	task scheduler alerts are firing, resolve those first. Otherwise:

	- Check Status for pending or running tasks for this job. The Swarming UI
	provides the best information on why the task has not completed.

	- Check that the dimensions specified for the job's tasks match the bot that
	should run those tasks.

	- Check that the bots are available to run the tasks. Remember that forced jobs
	will always be completed before other jobs, and tryjobs get a higher score
	than regular jobs.

	- If there are many forced jobs that were triggered accidentally, the [Job
	search UI](https://task-scheduler.skia.org/jobs/search) can be used to
	bulk-cancel jobs.


	latest_job_age
	--------------

	Jobs have not been triggered recently enough for the indicated job spec. This
	normally indicates that the periodic triggers have stopped working for some
	reason. Double check that the "periodic-trigger" cron jobs have run at the
	expected time in Kubernetes. If they have not, look into why. If they have,
	check the Task Scheduler logs to verify that the scheduler received the pubsub
	message and if so determine why it did not create the job.


	update_repos_failed
	-------------------

	The scheduler (job creator) has failed to update its git repos for too long.
	Check the logs and determine what is going on. If the git servers are down or
	having problems, make sure that the team is aware by filing a bug or pinging
	IRC:
	http://go/gob-oncall


	poll_buildbucket_failed
	-----------------------

	The scheduler (job creator) has not successfully polled Buildbucket for new
	tryjobs in a while. Any tryjobs started by the CQ or manually during this period
	have not been picked up yet. Check the logs and determine what is going on.

	If the git servers are down or having problems, make sure that the team is aware
	by filing a bug or pinging IRC:
	http://go/gob-oncall

	If Buildbucket is down or having problems, see https://g.co/bugatrooper

	You may want to notify skia-team about the disruption.


	update_buildbucket_failed
	-------------------------

	The scheduler (job creator) has not successfully sent heartbeats to Buildbucket
	for in-progress tryjobs or sent status updates to Buildbucket for completed
	tryjobs in a while. Any tryjobs that have completed during this time will not be
	reflected on Gerrit or the CQ. Check the logs and determine what is going on.

	If Buildbucket is down or having problems, see https://g.co/bugatrooper

	You may want to notify skia-team about the disruption.