While Cloud based systems are being touted as a “cure for cancer” and the future of computing, Amazon’s spectacular crash last week has revealed all sorts of problems with the idea.
The online book seller has revealed that its EC2 cloud computer network crashed when a routine server upgrade gone wrong caused a cascade of further problems that took down thousands of websites in a ‘perfect storm’ .
Amazon promised to learn lessons from the crash and offered customers affected 10 free days of storage to compensate them for their loss.
Normally a cluster of computers, or ‘availability zone’, being upgraded would have all traffic routed to another section on the primary network while the work was undertaken.
But an engineer mistakenly sent the huge volume of traffic to a backup system – which couldn’t handle the strain and crashed.
Engineers fixed the problem but a fail safe mechanism then kicked in and hard drives throughout the facility tried to back them selves up.
More drives joined in and the network ran out of space. This caused widespread failure on a number of supposedly isolated availability zones.
While Amazon touts its cloud as ‘bomb-proof’ it is fairly clear that the cloud is a complex network in which chaos theory is king.
To make matters worse, it appears that Amazon has lost valuable customer data in the crash. It is not clear how much at this point. Although Amazon said they had lost only a small percentage of the total data stored, thousands of websites could be affected.
But if complex networks are more vulnerable to Murph’ Law then it would probably not be a good idea to have all your data in one company. This would mean that to be sensible a company would need two clouds to back up data. It would make cloud a brilliant target for a hacker who is only interested in causing damage.