4 Instructive Postmortems on Data Downtime and Loss

0

More than a decade ago, the concept of the ‘blameless’ postmortem changed how tech companies recognize failures at scale.

John Allspaw, who coined the term during his tenure at Etsy, argued postmortems were all about controlling our natural reaction to an incident, which is to point fingers: “One option is to assume the single cause is incompetence and scream at engineers to make them ‘pay attention!’ or ‘be more careful!’ Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.”

What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?

What happened: Back in 2017, GitLab experienced a painful 18-hour outage. That story, and GitLab’s subsequent honesty and transparency, has significantly impacted how organizations handle data security today.

The incident began when GitLab’s secondary database, which replicated the primary and acted as a failover, could no longer sync changes fast enough due to increased load. Assuming a temporary spam attack created said load, GitLab engineers decided to manually re-sync the secondary database by deleting its contents and running the associated script.

When the re-sync process failed, another engineer tried the process again, only to realize they had run it against the primary.

What was lost: Even though the engineer stopped their command in two seconds, it had already deleted 300GB of recent user data, affecting GitLab’s estimates, 5,000 projects, 5,000 comments, and 700 new user accounts.

How they recovered: Because engineers had just deleted the secondary database’s contents, they couldn’t use it for its intended purpose as a failover. Even worse, their daily database backups, which were supposed to be uploaded to S3 every 24 hours, had failed. Due to an email misconfiguration, no one received the notification emails informing them as much.

In any other circumstance, their only choice would have been to restore from their previous snapshot, which was nearly 24 hours old. Enter a very fortunate happenstance: Just 6 hours before the data loss, an engineer had taken a snapshot of the primary database for testing, inadvertently saving the company from 18 additional hours of lost data.

After an excruciatingly slow 18 hours of copying data across slow network disks, GitLab engineers fully restored service.

The deeper you diagnose what went wrong, the better you can build data security and business continuity systems that address the long chain of unfortunate events that might cause failure again.

Read the rest: Postmortem of database outage of January 31.

What happened: One morning in the summer of 2023, this one-person backup service went completely offline.

Tarsnap is run by Colin Percival, who’s been working on FreeBSD for over 20 years and is largely responsible for bringing that OS to Amazon’s EC2 cloud computing service. In other words, few people better understood how FreeBSD, EC2, and Amazon S3, which stored Tarsnap’s customer data, could work together… or fail.

Colin’s monitoring service notified him the central Tarsnap EC2 server had gone offline. When he checked on the instance’s health, he immediately found catastrophic filesystem damage—he knew right away he’d have to rebuild the service from scratch.

What was lost: No user backups, thanks to two smart decisions on Colin’s part.

First, Colin had built Tarsnap on a log-structured filesystem. While he cached logs on the EC2 instance, he stored all data in S3 object storage, which has its own data resilience and recovery strategies. He knew Tarsnap user backups were safe—the challenge was making them easily accessible again.

Second, when Colin built the system, he’d written automation scripts but had not configured them to run unattended. Instead of letting the infrastructure rebuild and restart services automatically, he wanted to double-check the state himself before letting scripts take over. He wrote, “‘Preventing data loss if something breaks’ is far more important than ‘maximize service availability.'”

How they recovered: Colin fired up a new EC2 instance to read the logs stored in S3, which took about 12 hours. After fixing a few bugs in his data restoration script, he could “replay” each log entry in the correct order, which took another 12 hours. With logs and S3 block data once again properly associated, Tarsnap was up and running again.

Read the rest: 2023-07-02 — 2023-07-03 Tarsnap outage post-mortem

What happened: Around Halloween 2021, a game played by millions every day on an infrastructure of 18,000 servers and 170,000 containers experienced a full-blown outage.

The service didn’t go down all at once—a few hours after Roblox engineers detected a single cluster with high CPU load, the number of online players had dropped to 50% below normal. This cluster hosted Consul, which operated like middleware between many distributed Roblox services, and when Consul could no longer handle even the diminished player count, it became a single point of failure for the entire online experience.

What was lost: Only system configuration data. Most Roblox services used other storage systems within their on-premises data centers. For those that did use Consul’s key-value store, data was either stored after engineers solved the load and contention issues or safely cached elsewhere.

How they recovered: Roblox engineers first attempted to redeploy the Consul cluster on much faster hardware and then very slowly let new requests enter the system, but neither worked.

With assistance from HashiCorp engineers and many long hours, the teams finally narrowed down two root causes:

After fixing these two bugs, the Roblox team restored service—a stressful 73 hours after that first high CPU alert.

Read the rest: Roblox Return to Service 10/28-10/31, 2021

What happened: A few days before Thanksgiving Day 2023, an attacker used stolen credentials to access Cloudflare’s on-premises Atlassian server, which ran Confluence and Jira. Not long after, they used those credentials to create a persistent connection to this piece of Cloudflare’s global infrastructure.

The attacker attempted to move laterally through the network but was denied access at every turn. The day after Thanksgiving, Atlassian engineers permanently removed the attacker and took down the affected Atlassian server.

In their postmortem, Cloudflare states their belief the attacker was backed by a nation-state eager for widespread access to Cloudflare’s network. The attacker had opened hundreds of internal documents in Confluence related to their network’s architecture and security management practices.

What was lost: No user data. Cloudflare’s Zero Trust architecture prevented the attacker from jumping from the Atlassian server to other services or accessing customer data.

Atlassian has been in the news for another reason lately—their Server offering has reached its end-of-life, forcing organizations to migrate to Cloud or Data Center alternatives. During or after that drawn-out process, engineers realize their new platform doesn’t come with the same data security and backup capabilities they were used to, forcing them to rethink their data security practices.

How they recovered: After booting the attacker, Cloudflare engineers rotated over 5,000 production credentials, triaged 4,893 systems, and reimaged and rebooted every machine. Because the attacker had attempted to access a new data center in Brazil, Cloudflare replaced all the hardware out of extreme precaution.

Read the rest: Thanksgiving 2023 security incident

These postmortems, detailing exactly what went wrong and elaborating on how engineers are preventing another occurrence, are more than just good role models for how an organization can act with honesty, transparency, and empathy for customers during a crisis.

If you can take a single lesson from allthese situations, someone in your organization, whether an ambitious engineer or an entire team, must own the data security lifecycle. Test and document everything because only practice makes perfect.

But also recognize that all these incidents occurred on owned cloud or on-premises infrastructure. Engineers had full access to systems and data to diagnose, protect, and restore them. You can’t say the same about the many cloud-based SaaS platforms your peers use daily, like versioning code and managing projects on GitHub or deploying lucrative email campaigns via Mailchimp. If something happens to those services, you can’t just SSH to check logs or rsync your data.

As shadow IT grows exponentially—a 1,525% increase in just seven years—the best continuity strategies won’t cover the infrastructure you own but the SaaS data your peers depend on. You could wait for a new postmortem to give you solid recommendations about the SaaS data frontier… or take the necessary steps to ensure you’re not the one writing it.

LEAVE A REPLY

Please enter your comment!
Please enter your name here