What we know:
- Amazon Web Services experienced an outage on Tuesday that caused other apps, like Netflix, Coinbase, and Tinder to see slower services.
- An internal AWS analysis pointed to traffic congestion across multiple network devices in the Northern Virginia region.
- But employees are still trying to figure out the root cause of the problem and seem confused by the depth of the outage, which has affected all parts of AWS’ global operations in some capacity.
As Amazon Web Services experienced one of the largest outages in company history on Tuesday, more than 600 employees joined an emergency conference call to assess the root cause of the service disruption.
The main culprit: a sudden increase in traffic that caused congestion across multiple network devices in Northern Virginia, the biggest region for AWS data centers.
The company had initially pegged the “root cause” of the outage on “a problem with several network devices within the internal AWS network,” according to a screenshot of an internal AWS communique from Tuesday morning obtained by Insider. “Specifically, these devices are receiving more traffic than they are able to process, which is leading to elevated latency and packet loss for the traffic traversing them.”
The problems are ongoing as of Tuesday afternoon, and have resulted in hours of service disruptions across the web, causing some of the world’s biggest online services, including Disney Plus, Netflix, and even Amazon’s own e commerce store to experience widespread glitches and slowdowns. The list of companies that saw outages on Tuesday include Spotify, Zoom and Airbnb, to name a few.
While the outage was linked to a disruption in Northern Virginia, it has disrupted all parts of AWS’ global operations in some capacity. Moreover, Amazon’s retail and delivery networks, which rely on AWS’ tools, were in some cases thrown into a screeching halt.
The outage snarled Amazon’s internal warehousing and logistics operations in the midst of the holiday shopping season. Some warehouse workers and drivers were sent home as the company’s internal communications, delivery routing and monitoring systems stalled.
The network issue “specifically impacted” Amazon’s internal DNS servers. As of 2:04 p.m. Seattle time, the company did not have an estimate on when the system would be fully operational, according to a message on the public AWS status console.
A separate internal note said “firewalls are being overwhelmed by an as of yet unknown source,” adding that the AWS networking teams are working on “blocking the traffic from the top talkers/offending hosts at the firewall.”
Activity from Amazon’s real-time digital advertising auction may be responsible for much of the traffic overwhelming the firewall, according to internal Slack messages seen by Insider.
In an email to Insider, Amazon’s spokesperson said, “There is an AWS service event in the US-East Region (Virginia) affecting Amazon Operations and other customers with resources running from this region. The AWS team is working to resolve the issue as quickly as possible.”
Even inside AWS, however, information on the outage remains sketchy. As engineers and executives worked decode the issue on a 600-person conference call, led by AWS’s VP of infrastructure Peter Desantis, rumors spread among staff. One AWS employee speculated that the outage was caused by an “orchestrated DNS attack,” while another employee downplayed those concerns, saying it was more of an “internal thing” related to networking and firewall saturation.
“It’s the fog of war,” said an AWS manager.
In a message sent just before 2:00 p.m. Pacific Time, the company’s internal communications team told employees it was “beginning to see significant recovery for AWS service availability in the US-EAST-1 Region.” The division’s “most senior engineers” are continuing to monitor the issue, including “identifying the specific traffic flows that were leading to congestion within these devices,” the note said.