blob: 57408bb2c6f29ec97277ed2740ae72a9c178c067 [file] [log] [blame] [view]
# Kubernetes Deployer Production Manual
The purpose of this service is to apply changes to a Kubernetes (k8s) cluster.
One instance of k8s-deployer should run in each cluster.
# Alerts
Items below here should include target links from alerts.
## K8sDeployerLiveness
This alert signifies that the k8s-deployer service is running, but has failed to
apply the configurations to its cluster recently. Check out the logs of the
relevant k8s-deployer service.
If the data for this alert is missing, that probably means k8s-deployer is not
running.
Key metrics: liveness_k8s_deployer_s
## TooManyPodRestarts
This alert triggers if a pod has restarted many times since it was deployed.
This can indicate a rare crash (e.g. nil dereference) or a burst of restarts due
to an external dependency outage.
To gather more information, use `kubectl logs -f <pod_name>` to get the current logs from the
container (to see if it is currently running ok) and `kubectl logs -f <pod_name> --previous`
to attempt to ascertain the cause of the previous restarts.
`kubectl describe <pod_name> | grep -A 5 "Last State"` can also give information about the previous
life of the pod (e.g "Reason: Error" or "Reason: OOMKilled").
Ways to address the alert include deploying a new (hopefully fixed) version of the container or
explicitly re-deploying the current version to reset the restart count metric. Contact the service
owner to discuss the best mitigation approach.
Key metrics: pod_restart_count
## PodRestartingFrequently
This alert triggers if a pod has restarted multiple times in the last hour. This can indicate a
currently down or crash-looping service.
The same advice as the TooManyPodRestarts alert applies.
Key metrics: pod_restart_count
## EvictedPod
A pod has been evicted, commonly for using too much memory.
To get the reason, try `k describe pod <pod_name> | grep -A 4 "Status"`. Contact the service owner
with this reason, file a bug if necessary, and then clean up the Evicted pod with
`kubectl delete pod <pod_name> --grace-period=0 --force`
Key metrics: evicted_pod_metric
## DirtyCommittedK8sImage
A dirty image has been commited to the prod checkout prod checkout. Check with the service owner
and/or the image author to see if they are done experimenting and if we can land/push a clean image.
Key metrics: dirty_committed_image_metric
## DirtyRunningK8sConfig
A dirty image has been running in production for at least two hours. Check with the service owner
and/or the image author to see if they are done experimenting and if we can land/push a clean image.
Key metrics: dirty_config_metric
## StaleK8sImage
The same k8s image has been running in production for at least 30 days. We should push an updated
image soon to pick up new changes and ensure things continue to work (and aren't secretly broken
for weeks).
Contact the service owner to see if the image can be updated.
Key metrics: stale_image_metric
## CheckedInK8sAppNotRunning
An app has a checked in .yaml file, but it currently is not running in production. This might mean
that a service owner forgot to push it after checking it in, or somehow it has stopped being run.
Check with the service owner to see if it needs to be deployed.
Known exceptions:
- Gold has some test server configs for doing integration tests of goldpushk. These shouldn't be
running unless those tests are being run manually.
Key metrics: app_running_metric
## CheckedInK8sContainerNotRunning
A container exists in a checked in .yaml file, but it currently is not running in production.
This might mean that a service owner forgot to push it after checking it in, or somehow it has
stopped being run.
Check with the service owner to see if it needs to be deployed.
Key metrics: container_running_metric
## RunningK8sAppNotCheckedIn
An app is running in production, but does not belong to a checked in .yaml file.
This typically happens if someone is testing out a new service. Reach out to them for more details.
Key metrics: running_app_has_config_metric
## RunningK8sContainerNotCheckedIn
A container is running in production, but does not belong to a checked in .yaml file.
This typically happens if someone is testing out a new service. Reach out to them for more details.
Key metrics: running_container_has_config_metric