What Comes Next After AWS Disruption

On Tuesday, which was supposed to be AWS Innovation Day in Re:Invent 2021, Amazon Web Services was instead facing yet another outage in the region that affected large segments of the internet. Analysts at Forrester and Gartner say that while the problem was significant, it was neither a reason, nor a realistic one, to roll back the migration to the cloud.

According to AWS updates, the cause of the outage was mostly resolved after about seven hours. Service recovery continued after that. Beyond questions of how this might happen, interests turn to what it means for methodical collapses in the cloud of this scale in a world dominated by a small group of super-disciples.

AWS noted that the most recent outage was caused by “multiple network hardware vulnerabilities” that affected the Northern Virginia region of the USA. The outage hit EC2, DynamoDB, Athena, Chime, as well as other AWS APIs and services. This caused issues and downtime for third parties like Disney Plus and Netflix. It also affected Amazon’s own resources such as package delivery management software and the Alexa virtual assistant.

If this sounds a bit like déjà vu, it should be. About one year ago, in late November 2020, AWS’ US-East-1 region experienced an outage that the company attributed to issues as more capacity was added to front-end servers for its Kinesis data stream.

Sid Nag, vice president of cloud services and technology research at Gartner, says while the frequency of these cloud outages hasn’t necessarily increased, the overall impact is. “This was one of the largest operations since AWS started doing business.”

Mission-critical applications are more sensitive

Back when organizations mostly ran non-critical applications on the cloud, outages could have been bypassed more easily. Nag says that the migration to the cloud means that more mission-critical applications are vulnerable to such disruptions. “The cloud is a multi-tenant model,” he says. “Many different organizations were affected, not just IT services.” For example, the recent service outage also cut Amazon Prime Video and Ring home monitoring service. “We’re seeing a bigger impact because of the reliance on the cloud,” says Nag.

The consolidation of the cloud landscape has placed the responsibility for maintaining this resource on the shoulders of a shrinking group of service providers. This focus may be a concern. When they are affected, Nag says, it is almost “too big to fail.” “This kind of thing worries me.”

In addition to wanting to see greater architectural flexibility across data centers, he says it may be time for major cloud providers to work side by side when outages occur and cover each other’s traffic during large-scale outages. “They don’t do that today,” says Nag.

He adds that there are competitive business reasons that prevent this from happening, but there may come a time when providers either act on their own or under some form of regulation. “Cloud providers are getting too big; they can’t go down and crash the world around them for 24 to 48 hours.” “Unacceptable.”

If major cloud providers don’t adopt such a strategy, Nag says there could be a way for these providers to create ecosystems of smaller cloud providers as their backups. There may also be a way to use edge computing solutions to power the distributed cloud as another alternative, he says.

Hyperscalers have a different risk profile

Brent Ellis, senior analyst at Forrester, says redundant people have a different risk profile than other data centers, which creates complications for their environments, which can cascade. “A local problem can spread very quickly,” he says.

Outages are not just an issue for AWS. Other hypercalers, such as Microsoft Azure and Google Cloud, have had their share of outages and issues addressed, Ellis says. In some cases, an interruption may occur due to an error typing the command. However, human error shouldn’t be an issue, he says, if greater automation is deployed properly. He still sees great value in cloud adoption, but organizations must also think about how they can mitigate risks. Trying to get back to local data centers can be more difficult than expected. Once you start migrating in bulk, it’s hard to replicate that infrastructure,” says Ellis.

As systems and cloud infrastructure become more interconnected, he says outages could mean organizations will have to wait until the issue is resolved. “You can’t do much,” Ellis says. “There is a reason why everything is measured in nines.”

He adds that the consolidation of cloud resources enhances risk, which can be a major concern in a country where so much of the economy relies on superlatives. “When a very large data center goes down, it affects tens of thousands, if not more, companies at the same time,” Ellis says.

Related content:

AWS CTO Vogels on the cloud that removes restrictions on innovation

Nasdaq CEO at AWS re: Invent Talks Cloud’s Impact on FinTech

How do organizations work with the cloud?


Leave a Comment