The rate of successful ingestion is too low. Look for errors in the logs of the perf-ingest process.
Android Clustering Rate is too low. Look to see if PubSub events are being sent: http://go/android-perf-ingest-stall
Also confirm that files are being sent with actual data in them (sometimes they can be corrupted with a bad config on the sending side). Look in:
gs://skia-perf/android-master-ingest/tx_log/
Perf Clustering Rate is too low. Look to see if PubSub events are being sent:
Also confirm that files are being sent with actual data in them (sometimes they can be corrupted with a bad config on the sending side). Look in:
gs://skia-perf/nano-json-v1/
The perf instance has not detected any regressions in an hour, which is unlikely because of the large amount of traces these instances ingest.
Check that data is arriving to the instances that do event driven regression:
Check that PubSub messages are being processed: http://go/android-perf-ingest-stall
And determine when regression detection stopped:
The number of issues being created in the issue tracker is too high.
To quickly stop Perf from creating more issues set the --noemail
flag on the offending instance.
This alert is raised when the number of long running queries remains high. It might just be high traffic, but if perf_progress_tracker_num_entries_in_cache
looks constant over time then the querying might be wedged. Check for Go routine leaks or contents that time out in the logs.
There are times when a process may inject too much data into Perf.
These may be useful queries to run against the database to find where the data is coming from:
SELECT param_value FROM paramsets WHERE tile_number=265 AND param_key='sub_result' ORDER BY random() LIMIT 100;
You can look at the (Perf dashboard)[https://grafana2.skia.org/d/VNdBF9Ciz/perf] to see relevant tile numbers.
Once you find the paramsets that are affected you can count how many new paramset values are being added, in this case we have found from the above sampling query that the param_value always begins with showmap_granular
, so we construct a query that limits us to those param_values.
SELECT COUNT(param_value) FROM paramsets AS OF SYSTEM TIME '-5s' WHERE tile_number=283 AND param_key='sub_result' AND param_value>'showmap_granular' AND param_value<'showmap_granulas';
For example:
root@perf-cockroachdb-public:26257/android> SELECT COUNT(param_value) FROM paramsets WHERE tile_number=282 AND param_key='sub_result' AND param_value>'showmap_granular' AND param_value<'showmap_granulas'; count +---------+ 9687198 (1 row) Time: 8.331250012s
And then you can use the same query to remove all the matching paramsets:
DELETE FROM paramsets WHERE tile_number=282 AND param_key='sub_result' AND param_value>'showmap_granular' AND param_value<'showmap_granulas';
Make sure to remove the erroneous params from all the tiles where they appear.
You may encounter contention that will slow the deletes, particularly if there are any rows to delete. It will help to temporarily scale the number of clusterers down to zero:
kubectl scale --replicas=0 deployment/perf-clustering-android
Don't forget to scale them back up!
Also deleting in batches will also speed things up, trying different values for the LIMIT:
DELETE FROM paramsets WHERE tile_number=282 AND param_key='sub_result' AND param_value>'showmap_granular' AND param_value<'showmap_granulas' LIMIT 100000
The bash file //perf/migrations/batch-delete.sh
does batches of deletes using //perf/migrations/batch-delete.sql
as the SQL to run. Modify that file to control which params to delete.