AWS outage: As services return to normal, the inquest is only just beginning

Amazon reports that yesterday’s huge Amazon Web Services (AWS) outage that “broke the internet” has now been fixed.

The outage took out dozens of popular apps and websites, from global tech giants such as Alexa to social media favourites like Reddit and Snapchat, and more prosaic online services such as HMRC in the UK, demonstrating the sheer volume of AWS’ reach into the global online infrastructure.

The problems started at around 7.40am BST on Monday, October 20 when a large spike on Downdetector showed reported problems with Amazon Web Services – which in turn took down the hundreds of services that rely on its cloud computing power.

The issues were apparently caused by a simple Domain Name System (DNS) error, but after hours of wrangling the AWS dashboard announced at11.53pm BST last night that “all AWS returned to normal operations”.

The knock-on effects over the course of the day were global, with over 1,000 businesses impacted – and many are still dealing with the knock-on effects.

Amazon has also revealed that its attempts to fix the problem actually caused other services to fail in the short term,. Its status page noted yesterday that “after resolving the DynamoDB DNS issue, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB.”

The inability to launch EC2 instances meant Amazon’s foundational rent-a-server offering was degraded, a significant issue because many users rely on the ability to automatically create servers as and when needed.

While Amazonian engineers tried to get EC2 working properly again “network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch.”

AWS said it recovered Network Load Balancer health checks at about 5:38PM BST, but “temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations.”

Thankfully “over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered,” with everything seemingly back to normal between 11pm and midnight, UK time.

Unsurprisingly, the post-mortem has now begun, not least with questions from Europe about why we are so reliant on US tech for our day-to-day existence: “My robot vacuum cleaner no longer works and can someone explain why a robot in Paris is linked to U.S. East? Talk about European digital sovereignty…” mused Ulrike Franke, senior fellow at the European Council on Foreign Relations, on Bluesky.

A little closer to home, Dr Aybars Tuncdogan, reader in digital Innovation and information security at the King’s Business School, noted: “[The issue] is tech monoculture. We are building global infrastructure with very little diversity in platforms or providers. That’s why we are seeing systemic failures: Amazon Web Services now, a multi-airport outage a few weeks ago, CrowdStrike last year. It’s like agricultural monoculture – when everything relies on a single strain, one disease can wipe out entire plantations, because they all have the same genetics.

“We need to diversify our technology infrastructure. Customers can design redundancy (i.e., systems that come online when something goes wrong) using on-premise failover or alternative providers. However, this can also be achieved by the providers themselves, such as by developing different competing infrastructures within their ecosystems.

“This incident will likely be resolved quickly. However, unless we rethink the architecture (that is, we decentralise and diversify), we should expect more outages of this scale, whether from glitches or targeted attacks.”

Mona Schroedel, a specialist data protection lawyer at Manchester law firm Freeths, added: “This is of course not the first major outage we have experienced in recent memory. Only a little over a year ago a Microsoft outage caused airports and banks to grind to a halt.

“Modern life, especially after the pandemic, has become dependent on virtual connectivity and systems. It isn’t that long ago that most people carried cash and would have been perfectly able to bridge a banking issue without complications. However, nowadays cashless payments are the norm and most of us don’t habitually carry cash anymore.

“As with the law in this area, the need for practical review and adjustments just cannot keep up with the speed of the advancement. That leaves end users vulnerable to be negatively impacted if the few big providers are targeted or have a technical issue. More ought to be done to ensure that there are (a) backup systems for critical services and (b) that the practical aspects of our modern convenient virtual life are reviewed and regulated.”

Amazon may have solved its problems for now, but for both the tech giant and its many, many users, it looks like discussion of the bigger problem may only just be beginning.




Subscribe to the Prolific North Daily Newsletter Today!

Want all the latest content from Prolific North delivered direct to your inbox daily? Of course you do!

Related News