Cloudy VP Benjamin Treynor Sloss said the problem started when engineers removed an unused Google Compute Engine (GCE) IP block from Google’s network configuration, and instructed Google’s automated systems to propagate the new configuration across the network.
Apparently Google announces the IP blocks it is using to help route traffic into its cloud.
However this time the propagation failed due to “a timing quirk in the IP block removal”. These quirks are an odd species, sort of like grammar nazis with stop watches.
“When propagation fails, Google usually fails over to the configuration in place before the new block was added. But on this occasion “a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.”
Normally Google says it has a “canary step” designed to catch messes like that described above. However in this case the canary also had a bug and was home in bed with a hot waterbottle and was not stepping anywhere.
So the push system decided in the absense of stepping canaries the new broken configuration was valid and began its progressive rollout.”
Google says it’s found the bugs in its network configuration software responsible for the first mess, and the canary is back at work singing like it is expected to.
It is also making “14 distinct engineering changes planned spanning prevention, detection and mitigation” and expects more will follow.