blob: 36235bf38701e334600382b14315aa30d3c676db [file] [log] [blame] [view]
General Production Manual
=========================
This file documents things that don't belong to a specific service.
Other Resources (For Googlers only)
-----------------------------------
- [https://goto.google.com/skia-infra-gardener]
- [https://goto.google.com/skolo-maintenance]
- [https://goto.google.com/skolo-playbook]
Alerts
======
Items below here should include target links from alerts.
DiskSpaceLow
------------
This means a given disk on one of our machines has a low disk. Running out of disk space causes
problems, so we try to keep a healthy buffer (which varies depending on the total disk size).
For machines running Swarming, this can cause issues when trying to download a task from Isolate,
which has been a problem before ().
To fix, [connect to the machine](https://skia.org/dev/testing/swarmingbots#connecting-to-swarming-bots),
and use `df -h` or a similar command to identify which disk(s) are low. `du -hd 2` can be a useful
tool for identifying which folders are taking up a lot of space.
- If a /root disk is full, try cleaning out the APT cache `sudo apt-get clean`
- If a /var disk is full, try deleting /var/logs/* and restarting the machine.
- If a /tmp disk is full, it usually cleans itself up on a reboot.
- On a swarming machine, if /b (/mnt/pd0) is full, there are few things to check:
- `/b/s/*_cache` folders have gotten very large. If so, stop swarming, delete the folders, and
reboot.
- /b/docker (the docker cache) can take up 100+ GB. Clean it with `sudo docker system prune -fa`.
If many machines are experiencing this, you may want to use the
[run_on_swarming_bots](../scripts/run_on_swarming_bots) script to fix them all at once.
Key metrics: collectd_df_df_complex