Psychology, not technology, is the key to Google’s reliability

An excellent video by a Google Site Reliability Engineer, from Goto Conference 2017.  What I liked in particular were three key points:

  1. Being honest that trying to have operations act as border guards, who attempt to vet code changes with an increasingly-long checklist before they go live, is a path to failure and frustration.
  2. Agreeing a common definition of acceptable reliability, and then being honest about the consequences means that you have an error budget of so many errors per month that you deem OK.
  3. You set the error budget up to act as a gate on the flow of new code to production.  Error budget > 0 => new code can go live.  Error budget = 0 => no new code goes live and everyone (including programmers) works to make the system happy again.  This leads to self-regulation amongst the programmers, who would much rather be doing new things than fire-fighting, so there are psychological forces at work that can counter the urge to chuck new stuff over the fence to production.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s