| Gold Production Manual |
| ====================== |
| |
| First make sure you are familiar with the design of gold by reading the |
| [architectural overview](https://goto.google.com/life-of-a-gold-image) doc. |
| |
| Clients can file a bug against Gold at [go/gold-bug](https://goto.google.com/gold-bug). |
| |
| General Metrics |
| =============== |
| The following dashboard is for the skia-public instances: |
| <https://grafana2.skia.org/d/m8kl1amWk/gold-panel-public>. |
| |
| The following dashboard is for the skia-corp instances: |
| <https://skia-mon.corp.goog/d/m8kl1amWk/gold-panel-corp> |
| |
| Some things to look for: |
| |
| - Do goroutines or memory increase continuously (e.g leaks)? |
| - How fresh is the tile data? (this could indicate something is stuck). |
| - How is ingestion liveness? Anything stuck? |
| |
| QPS |
| --- |
| To investigate the load of Gold's RPCs navigate to https://thanos-query.skia.org and |
| try doing the search: |
| |
| rate(gold_rpc_call_counter[1m]) |
| |
| You can use the func timers to even search by package, e.g. finding QPS of all |
| Firestore related functions: |
| |
| rate(timer_func_timer_ns_count{package=~".+fs_.+"}[1m]) |
| |
| If you find something problematic, then `timer_func_timer_ns{appgroup=~"gold.+"}/1000000000` is |
| how to see how many milliseconds a given timer actually took. |
| |
| General Logs |
| ============ |
| Logs for Gold instances in skia-public/skia-corp are in the usual |
| GKE container grouping, for example: |
| <https://console.cloud.google.com/logs/viewer?project=skia-public&resource=container&logName=projects%2Fskia-public%2Flogs%2Fgold-flutter-skiacorrectness> |
| |
| Alerts |
| ====== |
| |
| Items below here should include target links from alerts. |
| |
| GoldStreamingIngestionStalled |
| -------------------- |
| Gold has a pubsub subscription for events created in its bucket. |
| This alert means we haven't successfully ingested a file in over 24 hours. |
| This could mean that ingestion is throwing errors on every file or |
| the repo isn't very busy. |
| |
| This has happened before because gitsync stopped, so check that out too. |
| |
| Key metrics: liveness_gold_bt_s{metric="last-successful-process"}, liveness_last_successful_git_sync_s |
| |
| |
| GoldPollingIngestionStalled |
| -------------------- |
| Gold regularly polls its GCS buckets for any files that were not |
| successfully ingested via PubSub event when the file was created (aka "streaming"). |
| This alert means it has been at least 10 minutes since this happened; |
| this should happen every 5 minutes or so, even in not-busy repos. |
| |
| This has happened before because gitsync stopped, so check that out too. |
| |
| Key metrics: liveness_gold_bt_s{metric="since-last-run"}, liveness_last_successful_git_sync_s |
| |
| |
| GoldIgnoreMonitoring |
| -------------------- |
| This alert means gold was unable to calculate which ignore rules were expired. |
| Search the logs for "ignorestore.go" to get a hint as to why. |
| |
| This has happened before because of manually-edited (and incorrect) Firestore data |
| so maybe check out the raw data |
| <https://console.cloud.google.com/firestore/data/gold/skia/ignorestore_rules?project=skia-firestore> |
| |
| Key metrics: gold_expired_ignore_rules_monitoring |
| |
| GoldCommitTooOldWallTime |
| ---------------------- |
| Too much time has elapsed since Gold noticed a commit. This occasionally is a false positive |
| if a commit simply hasn't landed in the repo we are tracking. |
| |
| In the past, this has indicated git-sync might have had problems, so check out |
| the logs of the relevant git-sync instance. |
| |
| Key metrics: gold_last_commit_age_s |
| |
| GoldCommitTooOldNewerCommit |
| ---------------------- |
| Gold has noticed there is a newer commit available for processing, but hasn't |
| succeeded on moving forward. |
| |
| This would usually indicate an issue with Gold itself, so check |
| the logs of the Gold instance. |
| |
| Key metrics: gold_last_commit_age_s |
| |
| GoldStatusStalled |
| ---------------------- |
| The underlying metric here is reset when the frontend status is recomputed. This |
| normally gets recomputed when the Gold sliding window of N commits (aka "tile") |
| is updated or when expectations are changed (e.g. something gets triaged). |
| |
| This could fire because of a problem in golden/go/status.go or computing the current |
| tile takes longer than the minimum for the alert. |
| |
| Key metrics: liveness_gold_status_monitoring_s |
| |
| GoldIngestionErrorRate |
| ---------------------- |
| The recent rate of errors for ingestion is high, it is typically well below 0.1. |
| See the error logs for the given instance for more. |
| |
| GoldDiffServerErrorRate |
| ---------------------- |
| The recent rate of errors for the diff server is high, it is typically well |
| below 0.1. |
| See the error logs for the given instance for more. |
| |
| GoldErrorRate |
| ---------------------- |
| The recent rate of errors for the main gold instance is high, it is |
| typically well below 0.1. |
| See the error logs for the given instance for more. |
| |
| GoldExpectationsStale |
| ---------------------- |
| Currently, our baseline servers use QuerySnapshotIterators when fetching expectations out of |
| Firestore. Those run on goroutines. This alert will be active if any of those sharded |
| iterators are down, thus yielding stale results. |
| |
| To fix, delete one baseliner pod of the affected instance at a time until all of them |
| have restarted and are healthy. |
| |
| If this alert fires, it probably means the related logic in fs_expstore needs to be rethought. |
| |
| GoldCorruptTryJobData |
| --------------------- |
| This section covers both GoldCorruptTryJobParamMaps and GoldTryJobResultsIncompleteData which are |
| probably both active or both inactive. TryJobResults are stored in firestore in a separate |
| document from the Param maps that store the keys so as to lower request data. However, if somehow |
| data was only partially uploaded or corrupted, there might be TryJobResults that reference |
| Params that don't exist. |
| |
| If this happens, we might need to re-ingest the TryJob data to re-construct the missing data. |
| |
| GoldNoDataAtHead |
| ---------------- |
| The last 20 commits (100 for Chrome, since their tests are slower) have seen 0 data. This probably |
| means something is wrong with goldctl or whatever means is getting data into gold. |
| |
| Check out the bucket for the instance to confirm nothing is being uploaded and the logs |
| of the ingester if newer stuff is in the bucket, but hasn't been processed already. (If it's |
| an issue with ingestion, expect other alerts to be firing) |
| |
| Key metrics: gold_empty_commits_at_head |
| |
| GoldTooManyCLs |
| -------------- |
| There are many open CLs that have recently seen data from Gold. Having too many open CLs may cause |
| a higher load on CodeReviewSystems (e.g. Gerrit, GitHub) than usual, as we scan over all of these |
| to see if they are still open. Seeing this alert may indicate issues with marking CLs as closed |
| or some other problem with processing CLs. |
| |
| Key metrics: gold_num_recent_open_cls |
| |
| GoldCommentingStalled |
| --------------------- |
| Gold hasn't been able to go through all the open CLs that have produced data and decide whether |
| to comment on them or not in a while. The presence of this alert might mean we are seeing errors |
| when talking to Firestore or to the Code Review System (CRS). Check the logs on that pod's |
| frontend server (skiacorrectness) to see what's up. |
| |
| This might mean we are doing too much and running out of quota to talk to the CRS. Usually |
| out of quota messages will be in the error messages or the bodies of the failing requests. |
| |
| Key metrics: liveness_gold_comment_monitoring_s, gold_num_recent_open_cls |
| |
| rate(firestore_ops_count{app=~"gold.+"}[10m]) > 100 |
| |
| HighFirestoreUsageBurst or HighFirestoreUsageSustainedGold |
| ---------------------------------------------------------- |
| This type of alert means that Gold is probably using more Firestore quota than expected. In an |
| extreme case, this can exhaust our project's entire Firestore quota (it's shared, unfortunately) |
| causing wider outages. |
| |
| In addition to the advice of identifying QPS above, it can be helpful to identify which collections |
| are receiving a lot of reads/writes. For this, a query like: |
| |
| ``` |
| rate(firestore_ops_count{app=~"gold.+"}[10m]) > 100 |
| ``` |
| |
| can help identify those and possibly narrow in on the cause. `rate(gold_rpc_call_counter[1m]) > 1` |
| is also a good query to cross-reference this with. |
| |
| GoldHeavyTraffic |
| ---------------- |
| Gold is seeing over 50 QPS to a specific RPC. As of writing, there are only two RPCs that |
| are not throttled from anonymous traffic, so it is likely one of these. See <https://skbug.com/9476> |
| and <https://skbug.com/10768> for more context on these. |
| |
| This is potentially problematic in that the excess load could be causing Gold to act slowly |
| or even affect other tenants of the k8s pod. The cause of this load should be identified. |
| |