Slack outage 2021

8/26/2023

I still see these as a form of saturation: as a system gets more difficult for humans to deal with, it effectively increases the cost of using the system, and it makes errors more likely.Īnd so, the Slack Cloud Engineering team adapted to meet this saturation risk by adopting AWS child accounts. The other two limits are cognitive: the system made it harder for humans to deal with separating out costs and, it led to confusion for internal teams. The first is a traditional sort of limit we software folks think of: they were running into AWS rate limits associated with an individual AWS account. The above quote makes reference to three different categories of saturation. That cloud isn’t looking so happy anymore Having all our infrastructure in a single AWS account led to AWS rate-limiting issues, cost-separation issues, and general confusion for our internal engineering service teams. However, everything we built still lived in one big AWS account. Here’s a quote from Slack’s blog post Building the Next Evolution of Cloud Networks at Slack by Archie Gunasekara:Īs our customer base grew and the tool evolved, we developed more services and built more infrastructure as needed. However, as Slack grew, it encountered it problems. In the beginning, Slack’s AWS footprint fit nicely into one account and VPC: that’s one happy cloud! In the beginning, they (like, I presume, all small companies) started with a single AWS account. I’m going purely from the text of the original write-up, which means I’ll likely get some things wrong here. In this post, I’m going to walk Laura’s write-up, highlighting all of the examples of saturation and how the system adopted to it. In particular, in socio-technical systems, people will adapt in order to reduce the risk of saturation. Saturation plays a big role in Woods’s model of the adaptive universe. If you’ve done software operations work, I bet you’ve encountered resource exhaustion, which is an example of saturation. Saturation is a phrase often used by the safety science researcher David Woods: it refers to a system that is reaching the limit of what it can handle. On the other hand, it’s an outage story with multiple examples of saturation. There’s nothing about a bug that somehow made its way into a production, or an accidentally incorrect configuration change, or how some corrupt data ended up in the database. One of the things that struck me about this writeup is the contributing factors that aren’t part of this outage. 4, 2021 outage on Slack’s engineering blog. Laura Nolan of Slack recently published an excellent write-up of their Jan.

0 Comments

Slack outage 2021

Leave a Reply.

Author

Archives

Categories