Perf Production Manual

Alerts

success_rate_too_low

The rate of successful ingestion is too low. Look for errors in the logs of the perf-ingest process.

android_clustering_rate

Android Clustering Rate is too low. Look to see if PubSub events are being sent: http://go/android-perf-ingest-stall

Also confirm that files are being sent with actual data in them (sometimes they can be corrupted with a bad config on the sending side). Look in:

gs://skia-perf/android-master-ingest/tx_log/

clustering_rate

Perf Clustering Rate is too low. Look to see if PubSub events are being sent:

Also confirm that files are being sent with actual data in them (sometimes they can be corrupted with a bad config on the sending side). Look in:

gs://skia-perf/nano-json-v1/

regression_detection_slow

The perf instance has not detected any regressions in an hour, which is unlikely because of the large amount of traces these instances ingest.

Check that data is arriving to the instances that do event driven regression:

https://prom2.skia.org/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=rate(ack%5B30m%5D)&g0.tab=0

Check that PubSub messages are being processed: http://go/android-perf-ingest-stall

And determine when regression detection stopped:

https://prom2.skia.org/graph?g0.range_input=1d&g0.max_source_resolution=0s&g0.expr=sum(rate(perf_regression_store_found%7Bapp%3D~%22perf-clustering-android%7Cskiaperf%7Cskiaperf-android-x%22%7D%5B30m%5D))%20by%20(app)&g0.tab=0

issue_tracker_rate_high

The number of issues being created in the issue tracker is too high.

To quickly stop Perf from creating more issues set the --noemail flag on the offending instance.

querying_wedged

This alert is raised when the number of long running queries remains high. It might just be high traffic, but if perf_progress_tracker_num_entries_in_cache looks constant over time then the querying might be wedged. Check for Go routine leaks or contents that time out in the logs.

too_much_data

There are times when a process may inject too much data into Perf.

These may be useful queries to run against the database to find where the data is coming from:

Sample a subset of param_values.

SELECT param_value FROM paramsets WHERE tile_number=265 AND param_key='sub_result' ORDER BY random() LIMIT 100;

You can look at the (Perf dashboard)[https://grafana2.skia.org/d/VNdBF9Ciz/perf] to see relevant tile numbers.

Once you find the paramsets that are affected you can count how many new paramset values are being added, in this case we have found from the above sampling query that the param_value always begins with showmap_granular, so we construct a query that limits us to those param_values.

SELECT
    COUNT(param_value)
FROM
    paramsets
AS OF SYSTEM TIME '-5s'
WHERE
  tile_number=283
  AND param_key='sub_result'
  AND param_value>'showmap_granular'
  AND param_value<'showmap_granulas';

For example:

root@perf-cockroachdb-public:26257/android> SELECT
  COUNT(param_value)
FROM
  paramsets
WHERE
  tile_number=282
  AND param_key='sub_result'
  AND param_value>'showmap_granular'
  AND param_value<'showmap_granulas';

   count
+---------+
  9687198
(1 row)

Time: 8.331250012s

And then you can use the same query to remove all the matching paramsets:

DELETE
FROM
  paramsets
WHERE
  tile_number=282
  AND param_key='sub_result'
  AND param_value>'showmap_granular'
  AND param_value<'showmap_granulas';

Make sure to remove the erroneous params from all the tiles where they appear.

You may encounter contention that will slow the deletes, particularly if there are any rows to delete. It will help to temporarily scale the number of clusterers down to zero:

kubectl scale --replicas=0 deployment/perf-clustering-android

Don't forget to scale them back up!

Also deleting in batches will also speed things up, trying different values for the LIMIT:

DELETE
FROM
    paramsets
WHERE
    tile_number=282 AND
    param_key='sub_result' AND
    param_value>'showmap_granular' AND
    param_value<'showmap_granulas' LIMIT 100000

The bash file //perf/migrations/batch-delete.sh does batches of deletes using //perf/migrations/batch-delete.sql as the SQL to run. Modify that file to control which params to delete.