Datahopper Production Manual

Alerts

job_metrics

The job metrics goroutine has not successfully updated its job cache for some time.

If there are Task Scheduler alerts, resolve those first.

Otherwise, you should check the logs to try to diagnose what's failing.

bot_coverage_metrics

The bot coverage metrics goroutine has not successfully completed a cycle for some time. You should check the logs to try to diagnose what's failing.

swarming_task_metrics

The Swarming task metrics goroutine has not successfully queried for Swarming tasks for some time. You should check the logs to try to diagnose what's failing.

event_metrics

The event metrics goroutine has not successfully updated metrics based on event data for some time. You should check the logs to try to diagnose what's failing. Double-check the instance name to verify which log stream to investigate.

swarming_bot_metrics

The Swarming bot metrics goroutine has not successfully queried for Swarming bots for some time. See the alert for which pool and server is failing. You should check the logs to try to diagnose what's failing.

firestore_backup_metrics

The Firestore backup metrics goroutine has not successfully updated the metric for most recent Firestore backup for some time.

Try running gcloud beta firestore operations list --project=skia-firestore. If no output or error, check for a GCP Firestore outage.

Otherwise, you should check the logs to try to diagnose what's failing.

firestore_weekly_backup

The weekly backup of all Firestore collections in the skia-firestore project has not succeeded in more than 24 hours. There are several things to check:

  • Run gcloud beta firestore operations list --project=skia-firestore "--filter=metadata.outputUriPrefix~^gs://skia-firestore-backup/everything/" | grep -C 14 "endTime: '$(date --utc +%Y-%m-)" (please modify if it's the first week of the month).

    • If you see a recent endTime with “operationState: SUCCESSFUL,” see below for diagnosing issues in Datahopper.
    • If you see a recent endTime with any other operationState, see below for diagnosing issues with the Firestore export.
    • If you don't see a recent endTime, see below for diagnosing issues with the Kubernetes CronJob.
    • If no output (without filtering through grep) or error, check for a GCP Firestore outage.
  • Check the Datahopper logs for any warnings or errors. One likely problem is a change in the output of the REST API. See the code for the URL used to retrieve Firestore export operations. You can also run Datahopper locally using the --local flag to set up a TokenSource to authenticate to this URL. Add logging of the HTTP response.

  • If the export operation is in progress more than an hour after the startTime (remember it‘s UTC), it’s probably stuck. You can cancel it with gcloud beta firestore operations cancel --project=skia-firestore <value of name field>. Then manually trigger a new export (see below).

  • If the export operation failed for any other reason, look for an error message in the output from operations list above. If the error is transient, manually trigger a new export (see below). Otherwise, try a Google search for the error.

  • Check the logs for the most recent run of the firestore-export-everything-weekly CronJob. If no recent run, check for misconfiguration. You can update the CronJob by running make push in the firestore directory. The configuration for the CronJob is here.

  • To manually trigger a new export, run gcloud beta firestore export --project=skia-firestore --async gs://skia-firestore-backup/everything/$(date --utc +%Y-%m-%dT%H:%M:%SZ). Alternatively, run kubectl create job --from=cronjob/firestore-export-everything-weekly firestore-export-everything-manual, wait for the job to finish, then run kubectl delete job firestore-export-everything-manual.