datahopper/PROD.md - buildbot - Git at Google

 # Datahopper Production Manual

 # Alerts

 ## job_metrics

 The [job
 metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/datahopper/jobs.go)
 goroutine has not successfully updated its job cache for some time.

 If there are Task Scheduler alerts, resolve those first.

 Otherwise, you should check the logs to try to diagnose what's failing.

 ## bot_coverage_metrics

 The [bot coverage
 metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/bot_metrics/bot_metrics.go)
 goroutine has not successfully completed a cycle for some time. You should
 check the logs to try to diagnose what's failing.

 ## swarming_task_metrics

 The [Swarming task
 metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/swarming_metrics/tasks.go)
 goroutine has not successfully queried for Swarming tasks for some time. You should
 check the logs to try to diagnose what's failing.

 ## event_metrics

 The [event
 metrics](https://skia.googlesource.com/buildbot/+show/main/go/metrics2/events/events.go)
 goroutine has not successfully updated metrics based on event data for some
 time. You should check the logs to try to diagnose what's failing. Double-check
 the instance name to verify which log stream to investigate.

 ## swarming_bot_metrics

 The [Swarming bot
 metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/swarming_metrics/bots.go)
 goroutine has not successfully queried for Swarming bots for some time. See the
 alert for which pool and server is failing. You should check the logs to try
 to diagnose what's failing.

 ## firestore_backup_metrics

 The [Firestore backup
 metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/datahopper/firestore_backup_metrics.go)
 goroutine has not successfully updated the metric for most recent Firestore
 backup for some time.

 Try running `gcloud beta firestore operations list --project=skia-firestore `. If
 no output or error, check for a GCP Firestore outage.

 Otherwise, you should check the logs to try to diagnose what's failing.

 ## firestore_weekly_backup

 The weekly backup of all Firestore collections in the skia-firestore project
 has not succeeded in more than 24 hours. There are several things to check:

 - Run `gcloud beta firestore operations list --project=skia-firestore
 "--filter=metadata.outputUriPrefix~^gs://skia-firestore-backup/everything/" |
 grep -C 14 "endTime: '$(date -u +%Y-%m-)"` (please modify if it's the
   first week of the month).

   - If you see a recent endTime with "operationState: SUCCESSFUL," see below
     for diagnosing issues in Datahopper.
   - If you see a recent endTime with any other operationState, see below for
     diagnosing issues with the Firestore export.
   - If you don't see a recent endTime, see below for diagnosing issues with the
     Kubernetes CronJob.
   - If no output (without filtering through grep) or error, check for a GCP
     Firestore outage.

 - Check the Datahopper logs for any warnings or errors. One likely
   problem is a change in the output of the REST API. See [the
   code](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/datahopper/firestore_backup_metrics.go)
   for the URL used to retrieve Firestore export operations. You can also run
   Datahopper locally using the --local flag to set up a TokenSource to
   authenticate to this URL. Add logging of the HTTP response.

 - If the export operation is in progress more than an hour after the startTime
   (remember it's UTC), it's probably stuck. You can cancel it with `gcloud beta
 firestore operations cancel --project=skia-firestore <value of name
 field>`. Then manually trigger a new export (see below).

 - If the export operation failed for any other reason, look for an error
   message in the output from `operations list` above. If the error is
   transient, manually trigger a new export (see below). Otherwise, try a Google
   search for the error.

 - Check the logs for the most recent run of the
   [firestore-export-everything-weekly](https://console.cloud.google.com/kubernetes/cronjob/us-central1-a/skia-public/default/firestore-export-everything-weekly?project=skia-public&folder&organizationId=433637338589)
   CronJob. If no recent run, check for misconfiguration. You can update the
   CronJob by running `make push` in the `firestore` directory. The
   configuration for the CronJob is
   [here](https://skia.googlesource.com/k8s-config/+show/master/skia-public/firestore-export-everything-weekly.yaml).

 - To manually trigger a new export, run `gcloud beta firestore export
 --project=skia-firestore --async gs://skia-firestore-backup/everything/$(date
 -u +%Y-%m-%dT%H:%M:%SZ)`. Alternatively, run `kubectl create job
 --from=cronjob/firestore-export-everything-weekly
 firestore-export-everything-manual`, wait for the job to finish, then run
   `kubectl delete job firestore-export-everything-manual`.
	# Datahopper Production Manual

	# Alerts

	## job_metrics

	The [job
	metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/datahopper/jobs.go)
	goroutine has not successfully updated its job cache for some time.

	If there are Task Scheduler alerts, resolve those first.

	Otherwise, you should check the logs to try to diagnose what's failing.

	## bot_coverage_metrics

	The [bot coverage
	metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/bot_metrics/bot_metrics.go)
	goroutine has not successfully completed a cycle for some time. You should
	check the logs to try to diagnose what's failing.

	## swarming_task_metrics

	The [Swarming task
	metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/swarming_metrics/tasks.go)
	goroutine has not successfully queried for Swarming tasks for some time. You should
	check the logs to try to diagnose what's failing.

	## event_metrics

	The [event
	metrics](https://skia.googlesource.com/buildbot/+show/main/go/metrics2/events/events.go)
	goroutine has not successfully updated metrics based on event data for some
	time. You should check the logs to try to diagnose what's failing. Double-check
	the instance name to verify which log stream to investigate.

	## swarming_bot_metrics

	The [Swarming bot
	metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/swarming_metrics/bots.go)
	goroutine has not successfully queried for Swarming bots for some time. See the
	alert for which pool and server is failing. You should check the logs to try
	to diagnose what's failing.

	## firestore_backup_metrics

	The [Firestore backup
	metrics](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/datahopper/firestore_backup_metrics.go)
	goroutine has not successfully updated the metric for most recent Firestore
	backup for some time.

	Try running `gcloud beta firestore operations list --project=skia-firestore `. If
	no output or error, check for a GCP Firestore outage.

	Otherwise, you should check the logs to try to diagnose what's failing.

	## firestore_weekly_backup

	The weekly backup of all Firestore collections in the skia-firestore project
	has not succeeded in more than 24 hours. There are several things to check:

	- Run `gcloud beta firestore operations list --project=skia-firestore
	"--filter=metadata.outputUriPrefix~^gs://skia-firestore-backup/everything/" \|
	grep -C 14 "endTime: '$(date -u +%Y-%m-)"` (please modify if it's the
	first week of the month).

	- If you see a recent endTime with "operationState: SUCCESSFUL," see below
	for diagnosing issues in Datahopper.
	- If you see a recent endTime with any other operationState, see below for
	diagnosing issues with the Firestore export.
	- If you don't see a recent endTime, see below for diagnosing issues with the
	Kubernetes CronJob.
	- If no output (without filtering through grep) or error, check for a GCP
	Firestore outage.

	- Check the Datahopper logs for any warnings or errors. One likely
	problem is a change in the output of the REST API. See [the
	code](https://skia.googlesource.com/buildbot/+show/main/datahopper/go/datahopper/firestore_backup_metrics.go)
	for the URL used to retrieve Firestore export operations. You can also run
	Datahopper locally using the --local flag to set up a TokenSource to
	authenticate to this URL. Add logging of the HTTP response.

	- If the export operation is in progress more than an hour after the startTime
	(remember it's UTC), it's probably stuck. You can cancel it with `gcloud beta
	firestore operations cancel --project=skia-firestore <value of name
	field>`. Then manually trigger a new export (see below).

	- If the export operation failed for any other reason, look for an error
	message in the output from `operations list` above. If the error is
	transient, manually trigger a new export (see below). Otherwise, try a Google
	search for the error.

	- Check the logs for the most recent run of the
	[firestore-export-everything-weekly](https://console.cloud.google.com/kubernetes/cronjob/us-central1-a/skia-public/default/firestore-export-everything-weekly?project=skia-public&folder&organizationId=433637338589)
	CronJob. If no recent run, check for misconfiguration. You can update the
	CronJob by running `make push` in the `firestore` directory. The
	configuration for the CronJob is
	[here](https://skia.googlesource.com/k8s-config/+show/master/skia-public/firestore-export-everything-weekly.yaml).

	- To manually trigger a new export, run `gcloud beta firestore export
	--project=skia-firestore --async gs://skia-firestore-backup/everything/$(date
	-u +%Y-%m-%dT%H:%M:%SZ)`. Alternatively, run `kubectl create job
	--from=cronjob/firestore-export-everything-weekly
	firestore-export-everything-manual`, wait for the job to finish, then run
	`kubectl delete job firestore-export-everything-manual`.