Psychology, not technology, is the key to Google’s reliability

An excellent video by a Google Site Reliability Engineer, from Goto Conference 2017.  What I liked in particular were three key points:

  1. Being honest that trying to have operations act as border guards, who attempt to vet code changes with an increasingly-long checklist before they go live, is a path to failure and frustration.
  2. Agreeing a common definition of acceptable reliability, and then being honest about the consequences means that you have an error budget of so many errors per month that you deem OK.
  3. You set the error budget up to act as a gate on the flow of new code to production.  Error budget > 0 => new code can go live.  Error budget = 0 => no new code goes live and everyone (including programmers) works to make the system happy again.  This leads to self-regulation amongst the programmers, who would much rather be doing new things than fire-fighting, so there are psychological forces at work that can counter the urge to chuck new stuff over the fence to production.

Leave a comment