blob: df325388f6c67721e38d351dd3b9d5ee6f7db53a [file] [log] [blame] [view]
# Perf Production Manual
## Alerts
### success_rate_too_low
The rate of successful ingestion is too low. Look for errors in the logs of the
perf-ingest process.
### android_clustering_rate
Android Clustering Rate is too low. Look to see if PubSub events are being sent:
http://go/android-perf-ingest-stall
Also confirm that files are being sent with actual data in them (sometimes they
can be corrupted with a bad config on the sending side). Look in:
gs://skia-perf/android-master-ingest/tx_log/
### clustering_rate
Perf Clustering Rate is too low. Look to see if PubSub events are being sent:
Also confirm that files are being sent with actual data in them (sometimes they
can be corrupted with a bad config on the sending side). Look in:
gs://skia-perf/nano-json-v1/
### regression_detection_slow
The perf instance has not detected any regressions in an hour, which is unlikely
because of the large amount of traces these instances ingest.
Check that data is arriving to the instances that do event driven regression:
https://prom2.skia.org/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=rate(ack%5B30m%5D)&g0.tab=0
Check that PubSub messages are being processed:
http://go/android-perf-ingest-stall
And determine when regression detection stopped:
https://prom2.skia.org/graph?g0.range_input=1d&g0.max_source_resolution=0s&g0.expr=sum(rate(perf_regression_store_found%7Bapp%3D~%22perf-clustering-android%7Cskiaperf%7Cskiaperf-android-x%22%7D%5B30m%5D))%20by%20(app)&g0.tab=0
## too_much_data
There are times when a process may inject too much data into Perf.
These may be useful queries to run against the database to find where the data
it coming from:
# Sample a subset of param_values.
```
SELECT param_value FROM paramsets WHERE tile_number=265 AND param_key='sub_result' ORDER BY random() LIMIT 100;
```
You can look at the (Perf dashboard)[https://grafana2.skia.org/d/VNdBF9Ciz/perf]
to see relevant tile numbers.
Once you find the paramsets that are affected you can count how many new
paramset values are being added, in this case we have found from the above
sampling query that the param_value always begins with `showmap_granular`, so we
construct a query that limits us to those param_values.
```
SELECT
COUNT(param_value)
FROM
paramsets
AS OF SYSTEM TIME '-5s'
WHERE
tile_number=283
AND param_key='sub_result'
AND param_value>'showmap_granular'
AND param_value<'showmap_granulas';
```
For example:
```
root@perf-cockroachdb-public:26257/android> SELECT
COUNT(param_value)
FROM
paramsets
WHERE
tile_number=282
AND param_key='sub_result'
AND param_value>'showmap_granular'
AND param_value<'showmap_granulas';
count
+---------+
9687198
(1 row)
Time: 8.331250012s
```
And then you can use the same query to remove all the matching paramsets:
```
DELETE
FROM
paramsets
WHERE
tile_number=282
AND param_key='sub_result'
AND param_value>'showmap_granular'
AND param_value<'showmap_granulas';
```
Make sure to remove the erroneous params from all the tiles where they appear.
You may encounter contention that will slow the deletes, particularly if there
are any rows to delete. It will help to temporarily scale the number of
clusterers down to zero:
```
kubectl scale --replicas=0 deployment/perf-clustering-android
```
Don't forget to scale them back up!
Also deleting in batches will also speed things up, trying different values for
the LIMIT:
```
DELETE
FROM
paramsets
WHERE
tile_number=282 AND
param_key='sub_result' AND
param_value>'showmap_granular' AND
param_value<'showmap_granulas' LIMIT 100000
```
The bash file `//perf/migrations/batch-delete.sh` does batches of deletes using
`//perf/migrations/batch-delete.sql` as the SQL to run. Modify that file to
control which params to delete.