Human error caused Amazon Web Services outage, Apple iCloud service issues

Tuesday's major Amazon Web Services outage was caused through human error, the retailer has confirmed, with the downtime that impacted a number of online services, including Apple's, traced back to a single wrongly-entered command performed during debugging.

The note to customers for the S3 (Simple Storage Service) disruption for the US-East-1 region advises the team were working on an issue that caused the S3 billing system run slower than expected. One team member executed a command from an "established playbook" to take down a small number of servers used for a subsystem in the billing process, but mistakenly took down more than required.

"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the Amazon note states.

The extra servers were used to support two other S3 subsystems, one being the "index subsystem" used to manage metadata and location information for all S3 objects in a region, required for the service to perform data storage and management tasks. The second "placement subsystem" relied on the index subsystem in order to function, and is used to allocate storage for new data.

Enough servers were taken down in both of these subsystems caused a drop in capacity, forcing the team to restart all of the systems. During this restart period, S3 was unable to service requests, with it also impacting other AWS services in the region, including Amazon's Elastic Compute Cloud (EC2), Elastic Block Store (EBS) volumes, AWS Lambda, and the S3 console.

S3's subsystems are said by Amazon to be "designed to support the removal or failure of significant capacity with little or no customer impact," built with the assumption that systems will fail and can be replaced by another. Noting there has not been a complete restart of the index subsystem for "many years," the massive growth of AWS has caused the process of restarting the services and running safety checks took "longer than expected."

In order to prevent such a mistake from impacting assorted service as profoundly again, the tool has been modified to remove capacity more slowly, with added safeguards that will maintain the minimum required capacity level for each subsystem. Other operational tools will also undergo auditing to ensure they have similar checks in place.

Additionally, work is being carried out on the index subsystem to repartition it, dividing it down into smaller sections to speed up the recovery time.

The Service Health Dashboard, a page that displayed to AWS users the status of services, failed to show that there was an issue during the downtime, as it relied on S3 in order to function and couldn't update. Amazon's updating the dashboard so that it functions across multiple AWS regions, making sure it works without being dependent on any single region.

Amazon ends the note by apologizing for the impact of the event on its customers. "While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their business."

"We will do everything we can to learn from this event and use it to improve our availability even further."

The outage caused a number of websites that relied on S3 to suffer issues, as well as a number of apps that used Amazon's cloud servers for their services. Apple customers were also affected by the outage, with some users of the iOS and Mac App Stores, iCloud Drive, Notes, iCloud backup, Apple TV, and Apple Music encountering issues during the downtime.

Apple is believed to be making progress moving away from relying on Amazon for its cloud services, by creating its own data centers instead. Apple's Mesa facility is being turned into a "global command center," with the company working to establish new data centers in Ireland and Denmark.

Apple's existing Reno data center, handling Siri, FaceTime, and iMessage among other tasks, may increase its size in the future. It was recently reported Apple is planning to expand the data center by over 375,000 square feet, at a cost of around $50.7 million.

Follow AppleInsider on Google News

Latest Exclusives

Latest comparisons

42 Comments

NY1822 620 comments · 8 Years

About 7 years ago

for some reason this made me think autonomous cars can't come soon enough...imagine all the human error that can go wrong getting behind the wheel at 50 mph

lekowsky5 2 comments · 8 Years

Am I the only one who STILL can't log into iCloud from this outage?

maestro64 5029 comments · 19 Years

Why is their system still using command line prompts, this is the reason Unix is not favor for your average IT worker one wrong syntax error and you delete everything. Sounds like Amazon may have gotten off easy on this on.

macxpress 5846 comments · 16 Years

lekowsky5 said:

Am I the only one who STILL can't log into iCloud from this outage?

Yup...

MplsP 3956 comments · 8 Years

NY1822 said:

for some reason this made me think autonomous cars can't come soon enough...imagine all the human error that can go wrong getting behind the wheel at 50 mph

Except when some sleep deprived, slightly hungover coder at Ford makes a mistake it causes 100,000 cars to crash instead of one...