k8s-checker/PROD.md - buildbot - Git at Google

 # Kubernetes Checker Production Manual

 The goal of this service is to monitor our Kubernetes (k8s) cluster and pipe relevant metrics into
 Prometheus, so we can get alerts based on them.

 When dealing with k8s pods/containers/deployments, it is handy to reference:
 <https://github.com/skia-dev/textfiles/blob/master/kubectl-cheatsheet.md>

 This app syncs https://skia.googlesource.com/k8s-config/ and uses it to generate some metrics.

 One instance of k8s-checker should run on each production GCE cluster where we run our services.
 As of now, we do not run it in any skolo clusters.

 # Alerts

 Items below here should include target links from alerts.

 ## K8sCheckerLiveness

 This alert signifies that the k8s-checker service is running, but has failed to update the metrics
 about its cluster recently. It normally attempts to do so once per minute.

 Check out the logs of the relevant k8s-checker service.

 If the data for this alert is missing, that probably means k8s-checker is not running.

 Key metrics: liveness_k8s_checker_s

 ## TooManyPodRestarts

 This alert triggers if a pod has restarted many times since it was deployed. This can indicate a
 rare crash (e.g. nil dereference) or a burst of restarts due to an external dependency outage.

 To gather more information, use `kubectl logs -f <pod_name>` to get the current logs from the
 container (to see if it is currently running ok) and `kubectl logs -f <pod_name> --previous`
 to attempt to ascertain the cause of the previous restarts.

 `kubectl describe <pod_name> | grep -A 5 "Last State"` can also give information about the previous
 life of the pod (e.g "Reason: Error" or "Reason: OOMKilled").

 Ways to address the alert include deploying a new (hopefully fixed) version of the container or
 explicitly re-deploying the current version to reset the restart count metric. Contact the service
 owner to discuss the best mitigation approach.

 Key metrics: pod_restart_count

 ## PodRestartingFrequently

 This alert triggers if a pod has restarted multiple times in the last hour. This can indicate a
 currently down or crash-looping service.

 The same advice as the TooManyPodRestarts alert applies.

 Key metrics: pod_restart_count

 ## EvictedPod

 A pod has been evicted, commonly for using too much memory.

 To get the reason, try `k describe pod <pod_name> | grep -A 4 "Status"`. Contact the service owner
 with this reason, file a bug if necessary, and then clean up the Evicted pod with
 `kubectl delete pod <pod_name> --grace-period=0 --force`

 Key metrics: evicted_pod_metric

 ## DirtyCommittedK8sImage

 A dirty image has been commited to the prod checkout prod checkout. Check with the service owner
 and/or the image author to see if they are done experimenting and if we can land/push a clean image.

 Key metrics: dirty_committed_image_metric

 ## DirtyRunningK8sConfig

 A dirty image has been running in production for at least two hours. Check with the service owner
 and/or the image author to see if they are done experimenting and if we can land/push a clean image.

 Key metrics: dirty_config_metric

 ## StaleK8sImage

 The same k8s image has been running in production for at least 30 days. We should push an updated
 image soon to pick up new changes and ensure things continue to work (and aren't secretly broken
 for weeks).

 Contact the service owner to see if the image can be updated.

 Key metrics: stale_image_metric

 ## CheckedInK8sAppNotRunning

 An app has a checked in .yaml file, but it currently is not running in production. This might mean
 that a service owner forgot to push it after checking it in, or somehow it has stopped being run.

 Check with the service owner to see if it needs to be deployed.

 Known exceptions:

 - Gold has some test server configs for doing integration tests of goldpushk. These shouldn't be
   running unless those tests are being run manually.

 Key metrics: app_running_metric

 ## CheckedInK8sContainerNotRunning

 A container exists in a checked in .yaml file, but it currently is not running in production.
 This might mean that a service owner forgot to push it after checking it in, or somehow it has
 stopped being run.

 Check with the service owner to see if it needs to be deployed.

 Key metrics: container_running_metric

 ## RunningK8sAppNotCheckedIn

 An app is running in production, but does not belong to a checked in .yaml file.

 This typically happens if someone is testing out a new service. Reach out to them for more details.

 Key metrics: running_app_has_config_metric

 ## RunningK8sContainerNotCheckedIn

 A container is running in production, but does not belong to a checked in .yaml file.

 This typically happens if someone is testing out a new service. Reach out to them for more details.

 Key metrics: running_container_has_config_metric

 ## AppRunningInDefaultNamespace

 An app is running in the default namespace. This is undesirable for
 organizational reasons and because we use namespaces as security boundaries. If
 this app was pushed to the default namespace by accident, delete it and re-push
 to the correct namespace. Otherwise, create a namespace following the example
 of the other apps in the cluster.

 Key metrics: pod_running

 ## K8sFailedScheduling

 The Kubernetes cluster is failing to schedule a pod. This may be because the
 node pool needs to scale up, or because the pod is requesting too many
 resources. Read the associated message field or investigate cluster logs to
 determine the cause and appropriate resolution.

 Key metrics: k8s_events

 ## PodUnschedulable

 The Kubernetes cluster is failing to schedule a pod. This may be because the
 node pool needs to scale up, or because the pod is requesting too many
 resources. Read the associated message field or investigate cluster logs to
 determine the cause and appropriate resolution.

 Key metrics: pod_unschedulable

 ## K8sFailedScaleUp

 The Kubernetes cluster is failing to scale up the node pool. This may be because
 the node pool is at its upper limit, or because we've exceeded our quota. Read
 the associated message field or investigate cluster logs to determine the cause
 and appropriate resolution.

 Key metrics: k8s_events
	# Kubernetes Checker Production Manual

	The goal of this service is to monitor our Kubernetes (k8s) cluster and pipe relevant metrics into
	Prometheus, so we can get alerts based on them.

	When dealing with k8s pods/containers/deployments, it is handy to reference:
	<https://github.com/skia-dev/textfiles/blob/master/kubectl-cheatsheet.md>

	This app syncs https://skia.googlesource.com/k8s-config/ and uses it to generate some metrics.

	One instance of k8s-checker should run on each production GCE cluster where we run our services.
	As of now, we do not run it in any skolo clusters.

	# Alerts

	Items below here should include target links from alerts.

	## K8sCheckerLiveness

	This alert signifies that the k8s-checker service is running, but has failed to update the metrics
	about its cluster recently. It normally attempts to do so once per minute.

	Check out the logs of the relevant k8s-checker service.

	If the data for this alert is missing, that probably means k8s-checker is not running.

	Key metrics: liveness_k8s_checker_s

	## TooManyPodRestarts

	This alert triggers if a pod has restarted many times since it was deployed. This can indicate a
	rare crash (e.g. nil dereference) or a burst of restarts due to an external dependency outage.

	To gather more information, use `kubectl logs -f <pod_name>` to get the current logs from the
	container (to see if it is currently running ok) and `kubectl logs -f <pod_name> --previous`
	to attempt to ascertain the cause of the previous restarts.

	`kubectl describe <pod_name> \| grep -A 5 "Last State"` can also give information about the previous
	life of the pod (e.g "Reason: Error" or "Reason: OOMKilled").

	Ways to address the alert include deploying a new (hopefully fixed) version of the container or
	explicitly re-deploying the current version to reset the restart count metric. Contact the service
	owner to discuss the best mitigation approach.

	Key metrics: pod_restart_count

	## PodRestartingFrequently

	This alert triggers if a pod has restarted multiple times in the last hour. This can indicate a
	currently down or crash-looping service.

	The same advice as the TooManyPodRestarts alert applies.

	Key metrics: pod_restart_count

	## EvictedPod

	A pod has been evicted, commonly for using too much memory.

	To get the reason, try `k describe pod <pod_name> \| grep -A 4 "Status"`. Contact the service owner
	with this reason, file a bug if necessary, and then clean up the Evicted pod with
	`kubectl delete pod <pod_name> --grace-period=0 --force`

	Key metrics: evicted_pod_metric

	## DirtyCommittedK8sImage

	A dirty image has been commited to the prod checkout prod checkout. Check with the service owner
	and/or the image author to see if they are done experimenting and if we can land/push a clean image.

	Key metrics: dirty_committed_image_metric

	## DirtyRunningK8sConfig

	A dirty image has been running in production for at least two hours. Check with the service owner
	and/or the image author to see if they are done experimenting and if we can land/push a clean image.

	Key metrics: dirty_config_metric

	## StaleK8sImage

	The same k8s image has been running in production for at least 30 days. We should push an updated
	image soon to pick up new changes and ensure things continue to work (and aren't secretly broken
	for weeks).

	Contact the service owner to see if the image can be updated.

	Key metrics: stale_image_metric

	## CheckedInK8sAppNotRunning

	An app has a checked in .yaml file, but it currently is not running in production. This might mean
	that a service owner forgot to push it after checking it in, or somehow it has stopped being run.

	Check with the service owner to see if it needs to be deployed.

	Known exceptions:

	- Gold has some test server configs for doing integration tests of goldpushk. These shouldn't be
	running unless those tests are being run manually.

	Key metrics: app_running_metric

	## CheckedInK8sContainerNotRunning

	A container exists in a checked in .yaml file, but it currently is not running in production.
	This might mean that a service owner forgot to push it after checking it in, or somehow it has
	stopped being run.

	Check with the service owner to see if it needs to be deployed.

	Key metrics: container_running_metric

	## RunningK8sAppNotCheckedIn

	An app is running in production, but does not belong to a checked in .yaml file.

	This typically happens if someone is testing out a new service. Reach out to them for more details.

	Key metrics: running_app_has_config_metric

	## RunningK8sContainerNotCheckedIn

	A container is running in production, but does not belong to a checked in .yaml file.

	This typically happens if someone is testing out a new service. Reach out to them for more details.

	Key metrics: running_container_has_config_metric

	## AppRunningInDefaultNamespace

	An app is running in the default namespace. This is undesirable for
	organizational reasons and because we use namespaces as security boundaries. If
	this app was pushed to the default namespace by accident, delete it and re-push
	to the correct namespace. Otherwise, create a namespace following the example
	of the other apps in the cluster.

	Key metrics: pod_running

	## K8sFailedScheduling

	The Kubernetes cluster is failing to schedule a pod. This may be because the
	node pool needs to scale up, or because the pod is requesting too many
	resources. Read the associated message field or investigate cluster logs to
	determine the cause and appropriate resolution.

	Key metrics: k8s_events

	## PodUnschedulable

	The Kubernetes cluster is failing to schedule a pod. This may be because the
	node pool needs to scale up, or because the pod is requesting too many
	resources. Read the associated message field or investigate cluster logs to
	determine the cause and appropriate resolution.

	Key metrics: pod_unschedulable

	## K8sFailedScaleUp

	The Kubernetes cluster is failing to scale up the node pool. This may be because
	the node pool is at its upper limit, or because we've exceeded our quota. Read
	the associated message field or investigate cluster logs to determine the cause
	and appropriate resolution.

	Key metrics: k8s_events