An excellent video by a Google Site Reliability Engineer, from Goto Conference 2017. What I liked in particular were three key points:
- Being honest that trying to have operations act as border guards, who attempt to vet code changes with an increasingly-long checklist before they go live, is a path to failure and frustration.
- Agreeing a common definition of acceptable reliability, and then being honest about the consequences means that you have an error budget of so many errors per month that you deem OK.
- You set the error budget up to act as a gate on the flow of new code to production. Error budget > 0 => new code can go live. Error budget = 0 => no new code goes live and everyone (including programmers) works to make the system happy again. This leads to self-regulation amongst the programmers, who would much rather be doing new things than fire-fighting, so there are psychological forces at work that can counter the urge to chuck new stuff over the fence to production.