Tuesday's major Amazon Web Services outage was caused through human error, the retailer has confirmed, with the downtime that impacted a number of online services, including Apple's, traced back to a single wrongly-entered command performed during debugging.
The note to customers for the S3 (Simple Storage Service) disruption for the US-East-1 region advises the team were working on an issue that caused the S3 billing system run slower than expected. One team member executed a command from an "established playbook" to take down a small number of servers used for a subsystem in the billing process, but mistakenly took down more than required.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the Amazon note states.
The extra servers were used to support two other S3 subsystems, one being the "index subsystem" used to manage metadata and location information for all S3 objects in a region, required for the service to perform data storage and management tasks. The second "placement subsystem" relied on the index subsystem in order to function, and is used to allocate storage for new data.
Enough servers were taken down in both of these subsystems caused a drop in capacity, forcing the team to restart all of the systems. During this restart period, S3 was unable to service requests, with it also impacting other AWS services in the region, including Amazon's Elastic Compute Cloud (EC2), Elastic Block Store (EBS) volumes, AWS Lambda, and the S3 console.
S3's subsystems are said by Amazon to be "designed to support the removal or failure of significant capacity with little or no customer impact," built with the assumption that systems will fail and can be replaced by another. Noting there has not been a complete restart of the index subsystem for "many years," the massive growth of AWS has caused the process of restarting the services and running safety checks took "longer than expected."
In order to prevent such a mistake from impacting assorted service as profoundly again, the tool has been modified to remove capacity more slowly, with added safeguards that will maintain the minimum required capacity level for each subsystem. Other operational tools will also undergo auditing to ensure they have similar checks in place.
Additionally, work is being carried out on the index subsystem to repartition it, dividing it down into smaller sections to speed up the recovery time.
The Service Health Dashboard, a page that displayed to AWS users the status of services, failed to show that there was an issue during the downtime, as it relied on S3 in order to function and couldn't update. Amazon's updating the dashboard so that it functions across multiple AWS regions, making sure it works without being dependent on any single region.
Amazon ends the note by apologizing for the impact of the event on its customers. "While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their business."
"We will do everything we can to learn from this event and use it to improve our availability even further."
The outage caused a number of websites that relied on S3 to suffer issues, as well as a number of apps that used Amazon's cloud servers for their services. Apple customers were also affected by the outage, with some users of the iOS and Mac App Stores, iCloud Drive, Notes, iCloud backup, Apple TV, and Apple Music encountering issues during the downtime.
Apple is believed to be making progress moving away from relying on Amazon for its cloud services, by creating its own data centers instead. Apple's Mesa facility is being turned into a "global command center," with the company working to establish new data centers in Ireland and Denmark.
Apple's existing Reno data center, handling Siri, FaceTime, and iMessage among other tasks, may increase its size in the future. It was recently reported Apple is planning to expand the data center by over 375,000 square feet, at a cost of around $50.7 million.