Jun 23, 2026
Latest PostJun 23, 2026
Latest PostDisclaimer: We, ngrok, have sponsored Mac to write this post because we think it's an underexplored perspective on the topic of reliability. We're glad to have the opportunity to give writers the space and time to do this, but the opinions are Mac's, not the company's. Enjoy!
Picture yourself traveling back to August 7th, 1996. Close your eyes and imagine a world where tensions are high with Russia, China, and in the Middle East, people are concerned about a tech bubble, and bell-bottoms are back in style. Difficult to imagine, I know.
Open your eyes, you're in 1996 now. You probably just got back from work or school, hoping to unwind. Maybe you put something on the stereo, still clinging to the waning grunge era. You sit down in your squeaky desk chair and are welcomed by the Windows 95 boot screen. But this time, when you try to connect to America Online, rather than seeing your email inbox, info about popular sitcoms, or NASA announcing evidence of life on Mars, instead you see:
America Online was down, and it would stay down for 19 hours. It pushed that news of life on Mars right off the front page of the New York Times.
Now, technically this outage shouldn't have been that notable. America Online went down for maintenance regularly. This regular maintenance was what triggered the outage in the first place. There was even a similar outage during peak hours a few months prior that didn't make the news at all (I only found out about it through oral history which I'll get into later). Why did this one make the front page?
At that time, the world was joining the internet in droves. The number of people online was beginning to hockey-stick. My theory is that we had clearly passed some kind of inflection point where the internet was starting to become integral to our daily lives. And us humans really don't like when we are reminded of the fragility of things we depend on.
As someone who works in the field of site reliability engineering (SRE), I became a little obsessed with researching this outage. It was essentially the first example of people outside of the industry realizing how important it is for internet stuff to keep running. And that collective desire is what keeps me employed.
So what does this 30-year-old outage have to do with today? I think it can teach us a lot about the way we experience outages, the economic forces we're subject to, and how the modern field of site reliability engineering should account for that. This article is my chance to write a more human postmortem, one that asks more than just five "why's" and digs into our messy techno-social reality that isn't captured by golden signals and SLOs.
If I'm going to write a postmortem, I guess I should start with some technical details. Contemporary reporting just has statements from spokespeople and pundits, so the language used is pretty vague. If I wanted to track down something more specific, I needed to talk to an AOL employee. I found some old financial documents from AOL on archive.org, which listed all of the board members, executives, and VPs in 1996, and that's where I found the VP of Operations: Matt Korn. All I could find was his LinkedIn, so I signed up for (and immediately cancelled) LinkedIn Premium so I could send him a message. And he responded!
He sent a lovely message where he mentioned digging up his old paper calendars from 1996 to jog his memory! The things I'd do to see those calendars in person… Anyway, he didn't have technical notes on the August outage but did have notes about a similar (never-before-reported-on) outage in May. He found it odd that no newspaper wasted a single inch of column space on the May outage, but they were suddenly all over the August outage.
In case you're curious, he said the May outage happened at Westwood Center Drive, the location of the old AOL headquarters. Only one phase of the three-phase power feed cut out, which meant the generators didn't notice the power was out and so didn't kick on, knocking the whole datacenter out once the batteries drained. Funnily enough, I had a similar thing happen to me a few years ago. I guess generator manufacturers need to do some postmortems of their own.
But all he remembered about the August 7th outage is that the system went down for maintenance and didn't come back online properly. Eventually they improved the system so it didn't need to be taken down for maintenance anymore, silently resolving the original issue. How mundane, right?
I could have kept searching for other AOL employees, but I started to realize: why was I so focused on the technicals? Here we have an event of national interest, where millions of people each have a story to tell about it, and I'm focused on what was happening inside one building in Virginia?
The thing that first got me interested in this 1996 AOL outage was a CBS News video where they went to an internet cafe and interviewed people affected by it. The impact varied widely: one company was unable to launch a new product, one person was just bored, and another person lost "a potential relationship." Maybe that last one was wishful thinking, but the uniqueness of these perspectives was starting to teach me something.
In search of more of these unique perspectives, I started looking up old websites from 1996. I thought putting your whole life online didn't exist until the 2000s, but there are quite a few fascinating people who were publishing diary entries online at that time. On the day of the outage, one person was busy working on a piece of the Hubble Telescope and healing from a back injury. Another person was visiting his wife's side of the family in China at the time and so, unfortunately for me, wrote about visiting the grave of his father-in-law rather than about his internet service provider.
The one person I could find who mentioned an outage at all that day was someone named Steve Schalchlin. Steve started his online diary in March 1996 because he was dying. His early posts discuss his declining health and updates about his viral load and T-Cell count, because he had acquired a certain immunodeficiency syndrome that you've probably heard of. In early 1996, your best bet was a drug that delayed your death by about one extra year compared to the control group, not unlike the name of his blog, "Bonus Round."
What does this have to do with the internet? Well a mere four months prior to the AOL outage, Steve was using his internet connection to browse an online bulletin board service (BBS) for people with AIDS. That website is where he learned about a newly-approved antiretroviral drug called Crixivan. Fast-forward two months and his viral load (the amount of HIV RNA detected in your bloodstream) fell from 60,000 to under 100, putting him into the range of "viral suppression" and onto the road of recovery. Fast-forward 30 years and he's still posting regular updates. Would his life be the same if an outage happened a little earlier? Perhaps that BBS post would have been bumped off the front page by the time he checked. Maybe he could have decided the outage was the last straw, and maybe the monthly fee was better spent paying back friends and family. If this alternate reality happened, would we even hear about it?

Having just skimmed through someone's literal life story where the outage was (thankfully) just a footnote, it made me realize that us SREs often focus on the technology as the protagonist of a story where the people affected are reduced to statistics. We've got it all backwards. In reality, outages insert themselves into our unique lives and can be anything from benign to catastrophic. But with the impersonality of massive internet services like we have today, these stories rarely get told.
When I started my career, I had grand visions of building solid, reliable systems that helped thousands or millions of people. I'd put in extra hours and extra care into the code I wrote since I was thinking of each and every person (like Steve) and how distraught I'd be if I ended up being the cause of a bug or outage. This was around the time of zero-interest rate policy so all the VC funny-money was flowing freely and I was happy receiving my small cut for all that extra care. When the music stopped, suddenly corners needed to be cut. They stopped asking the question of "what can we do for our customers?" and started asking "what can we get away with?" With the tech sector being so centralized, unregulated, and depended-on, the answer to that question was "a lot." Then the term "enshittification" was coined in 2022 to put a name on this phenomenon.
Needless to say, I found this new (to me) world of glorifying lower quality work difficult to stomach. It felt like we were throwing quality and reliability overboard as ballast from the ship of capitalism. In researching the AOL outage, I was disappointed to see that enshittification was nothing new. Companies have been cutting corners since the invention of corners. For example, here are some quotes about the outage, which sound like they could easily have been said today about some startup:
"Analysts agree that online services such as AOL may be growing too fast for their own systems."
"A lot of these internet service providers that people rely on don't have that much experience with [outages] yet, and they screw things up."
"given the new nature of the medium", the CEO "could not guarantee it wouldn't happen again."
Other than when the dot-com bubble burst, AOL never really paid a price for this newsworthy outage. They were still cutting corners in 2011, churning out so much low-quality content that it allegedly gave employees panic attacks, and then got sold for $1.5B to a company famous for mass layoffs and price increases.
Why does this keep happening? Site reliability engineering is still all about money. Sure, if your reliability is lower than a competitor, you might lose money. But you don't necessarily have to invest in reliability to fix that problem. You can just make it harder to switch or buy your competitors. In 1996, there were 3,840 internet service providers in the US. A magazine from 1996 lists 63 providers in California alone. Now you'd be lucky to find more than one in a given area.
With high switching costs and nowhere to run, suddenly those economic arguments for reliability can be turned against it. Now too much reliability is considered a waste of money. If economic arguments for reliability can so easily be used to argue for unreliability, then my answer is simple: I will totally concede the economic argument. People still deserve high-quality and reliable systems, and my job is to ensure that, even if it makes economic sense to do otherwise.
So how do we begin to argue for something that's unprofitable? For starters, you won't find an SRE textbook acknowledging that we have a highly unequal economic system which our technology reinforces. When asking my friends and family what they think about outages, a common theme I found is that many people who are solidly middle class and above see outages as a mild annoyance or sometimes a good thing, like an excuse to take a break from a white collar job. However, some people below that line can get seriously screwed by outages, like the wifi on public transit being down, a gig work app crashing, or millions of people not getting their unemployment benefits on time during the pandemic. Guess which side of that line makes the investment decisions.
Even recovering from outages reveals our inequality. When a system goes down, it often pages a tiered system of engineers, where higher tiers make more money and get fewer calls. This may be apocryphal, but apparently during the 1996 outage, an AOL employee noticed that every 20-30 minutes of the outage, a newer and nicer set of cars came "screaming into the parking lot" until they were seeing the Ferraris and Lamborghinis of early shareholders. Oh did I mention that AOL was being investigated at the time by the SEC for allegedly inflating profits?
So when we're cutting corners, the cost is often being borne by people already lacking economic power, on both sides of any outage. Clinging to economic arguments when we lack economic power is like bringing a knife to a gun fight. We have to create alternative narratives, ones that center the individuals over profit margins.
So how can we get more people to focus on the individual impact in an industry that was never designed for it? Or in an economic system that is allergic to it?
One idea could be borrowing the concept of "victim impact statements" from the criminal justice system (just borrowing that one concept, none of the others, no thank you!). We could ask a few impacted customers to explain from the heart how the outage impacted them, to help drive investment into reliability. Many postmortem templates already include an "impact statement", but it's written by the company who caused the outage, not by customers. Can you imagine if you were the victim of a crime and the court only asked the defendant what the impact was on you? Really this idea is just a form of strategic manipulation: playing on our human emotions and our love of stories to get people to prioritize something other than money. Maybe we can have a little manipulation, as a treat?
It's a bit of a quaint idea that I don't think will happen. A natural problem would be carving out the time to do this. SRE teams are often quite busy just staying above water, trying to keep those customers barely appeased above the switching costs.
Another idea would be to outsource this critical-but-unprofitable work to the last bastion of pure research: universities. After the next newsworthy outage, universities could send a swarm of grad students (who should be paid more btw) into action gathering these victim impact statements as raw material to hit those publishing quotas. I think it's a bit tragic that site reliability engineering is considered a STEM field. We leave out the sociology and economics at our own peril.
Other than those ideas, what are concerned SREs supposed to do then? We want to focus on individual impact even though we operate at massive scale. We want to build technological systems that are higher quality than our economic systems are willing to bear. We do want to solve hard problems and work with cool tech, but not at the price of our humanity.
Short of changing the whole system or burning ourselves out trying to fight within it, our role right now is to act as a backstop, preventing the back-sliding of reliability and quality as best we can. That could mean proposing outlandish ideas like the above, emphasizing individual impact, coordinating with coworkers to agree on standards, naming and shaming corner-cutters, and more. Creativity is the key here; we have 30 years of failed experiments we don't need to repeat. Just remember that it'll be a marathon, not a sprint. Clock in, fight like hell for the people using our systems, and clock out. The resulting compromise may not be exactly what we want, but it'll be way better than if we all just gave up. What good would a postmortem be if it concluded the status quo is fine? Let's fix it. I'll file a ticket for us.