perf/PROD.md - buildbot - Git at Google

 # Perf Production Manual

 ## Alerts

 ### success_rate_too_low

 The rate of successful ingestion is too low. Look for errors in the logs of the
 perf-ingest process.

 ### android_clustering_rate

 Android Clustering Rate is too low. Look to see if PubSub events are being sent:
 http://go/android-perf-ingest-stall

 Also confirm that files are being sent with actual data in them (sometimes they
 can be corrupted with a bad config on the sending side). Look in:

     gs://skia-perf/android-master-ingest/tx_log/

 ### clustering_rate

 Perf Clustering Rate is too low. Look to see if PubSub events are being sent:

 Also confirm that files are being sent with actual data in them (sometimes they
 can be corrupted with a bad config on the sending side). Look in:

     gs://skia-perf/nano-json-v1/

 ### regression_detection_slow

 The perf instance has not detected any regressions in an hour, which is unlikely
 because of the large amount of traces these instances ingest.

 Check that data is arriving to the instances that do event driven regression:

 https://prom2.skia.org/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=rate(ack%5B30m%5D)&g0.tab=0

 Check that PubSub messages are being processed:
 http://go/android-perf-ingest-stall

 And determine when regression detection stopped:

 https://prom2.skia.org/graph?g0.range_input=1d&g0.max_source_resolution=0s&g0.expr=sum(rate(perf_regression_store_found%7Bapp%3D~%22perf-clustering-android%7Cskiaperf%7Cskiaperf-android-x%22%7D%5B30m%5D))%20by%20(app)&g0.tab=0

 ## too_much_data

 There are times when a process may inject too much data into Perf.

 These may be useful queries to run against the database to find where the data
 it coming from:

 # Sample a subset of param_values.

 ```
 SELECT param_value FROM paramsets WHERE tile_number=265 AND param_key='sub_result' ORDER BY random() LIMIT 100;
 ```

 You can look at the (Perf dashboard)[https://grafana2.skia.org/d/VNdBF9Ciz/perf]
 to see relevant tile numbers.

 Once you find the paramsets that are affected you can count how many new
 paramset values are being added, in this case we have found from the above
 sampling query that the param_value always begins with `showmap_granular`, so we
 construct a query that limits us to those param_values.

 ```
 SELECT
     COUNT(param_value)
 FROM
     paramsets
 AS OF SYSTEM TIME '-5s'
 WHERE
   tile_number=283
   AND param_key='sub_result'
   AND param_value>'showmap_granular'
   AND param_value<'showmap_granulas';
 ```

 For example:

 ```
 root@perf-cockroachdb-public:26257/android> SELECT
   COUNT(param_value)
 FROM
   paramsets
 WHERE
   tile_number=282
   AND param_key='sub_result'
   AND param_value>'showmap_granular'
   AND param_value<'showmap_granulas';

    count
 +---------+
   9687198
 (1 row)

 Time: 8.331250012s
 ```

 And then you can use the same query to remove all the matching paramsets:

 ```
 DELETE
 FROM
   paramsets
 WHERE
   tile_number=282
   AND param_key='sub_result'
   AND param_value>'showmap_granular'
   AND param_value<'showmap_granulas';
 ```

 Make sure to remove the erroneous params from all the tiles where they appear.

 You may encounter contention that will slow the deletes, particularly if there
 are any rows to delete. It will help to temporarily scale the number of
 clusterers down to zero:

 ```
 kubectl scale --replicas=0 deployment/perf-clustering-android
 ```

 Don't forget to scale them back up!

 Also deleting in batches will also speed things up, trying different values for
 the LIMIT:

 ```
 DELETE
 FROM
     paramsets
 WHERE
     tile_number=282 AND
     param_key='sub_result' AND
     param_value>'showmap_granular' AND
     param_value<'showmap_granulas' LIMIT 100000
 ```

 The bash file `//perf/migrations/batch-delete.sh` does batches of deletes using
 `//perf/migrations/batch-delete.sql` as the SQL to run. Modify that file to
 control which params to delete.
	# Perf Production Manual

	## Alerts

	### success_rate_too_low

	The rate of successful ingestion is too low. Look for errors in the logs of the
	perf-ingest process.

	### android_clustering_rate

	Android Clustering Rate is too low. Look to see if PubSub events are being sent:
	http://go/android-perf-ingest-stall

	Also confirm that files are being sent with actual data in them (sometimes they
	can be corrupted with a bad config on the sending side). Look in:

	gs://skia-perf/android-master-ingest/tx_log/

	### clustering_rate

	Perf Clustering Rate is too low. Look to see if PubSub events are being sent:

	Also confirm that files are being sent with actual data in them (sometimes they
	can be corrupted with a bad config on the sending side). Look in:

	gs://skia-perf/nano-json-v1/

	### regression_detection_slow

	The perf instance has not detected any regressions in an hour, which is unlikely
	because of the large amount of traces these instances ingest.

	Check that data is arriving to the instances that do event driven regression:

	https://prom2.skia.org/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=rate(ack%5B30m%5D)&g0.tab=0

	Check that PubSub messages are being processed:
	http://go/android-perf-ingest-stall

	And determine when regression detection stopped:

	https://prom2.skia.org/graph?g0.range_input=1d&g0.max_source_resolution=0s&g0.expr=sum(rate(perf_regression_store_found%7Bapp%3D~%22perf-clustering-android%7Cskiaperf%7Cskiaperf-android-x%22%7D%5B30m%5D))%20by%20(app)&g0.tab=0

	## too_much_data

	There are times when a process may inject too much data into Perf.

	These may be useful queries to run against the database to find where the data
	it coming from:

	# Sample a subset of param_values.

	```
	SELECT param_value FROM paramsets WHERE tile_number=265 AND param_key='sub_result' ORDER BY random() LIMIT 100;
	```

	You can look at the (Perf dashboard)[https://grafana2.skia.org/d/VNdBF9Ciz/perf]
	to see relevant tile numbers.

	Once you find the paramsets that are affected you can count how many new
	paramset values are being added, in this case we have found from the above
	sampling query that the param_value always begins with `showmap_granular`, so we
	construct a query that limits us to those param_values.

	```
	SELECT
	COUNT(param_value)
	FROM
	paramsets
	AS OF SYSTEM TIME '-5s'
	WHERE
	tile_number=283
	AND param_key='sub_result'
	AND param_value>'showmap_granular'
	AND param_value<'showmap_granulas';
	```

	For example:

	```
	root@perf-cockroachdb-public:26257/android> SELECT
	COUNT(param_value)
	FROM
	paramsets
	WHERE
	tile_number=282
	AND param_key='sub_result'
	AND param_value>'showmap_granular'
	AND param_value<'showmap_granulas';

	count
	+---------+
	9687198
	(1 row)

	Time: 8.331250012s
	```

	And then you can use the same query to remove all the matching paramsets:

	```
	DELETE
	FROM
	paramsets
	WHERE
	tile_number=282
	AND param_key='sub_result'
	AND param_value>'showmap_granular'
	AND param_value<'showmap_granulas';
	```

	Make sure to remove the erroneous params from all the tiles where they appear.

	You may encounter contention that will slow the deletes, particularly if there
	are any rows to delete. It will help to temporarily scale the number of
	clusterers down to zero:

	```
	kubectl scale --replicas=0 deployment/perf-clustering-android
	```

	Don't forget to scale them back up!

	Also deleting in batches will also speed things up, trying different values for
	the LIMIT:

	```
	DELETE
	FROM
	paramsets
	WHERE
	tile_number=282 AND
	param_key='sub_result' AND
	param_value>'showmap_granular' AND
	param_value<'showmap_granulas' LIMIT 100000
	```

	The bash file `//perf/migrations/batch-delete.sh` does batches of deletes using
	`//perf/migrations/batch-delete.sql` as the SQL to run. Modify that file to
	control which params to delete.