SRE NEWSLETTER

Issue #39 // October 8, 2021

The theme of this week's issue is complexity. For those not in the trenches, site reliability sounds simple - just add some monitoring and a load balancer, right?

Facebook, the 3rd most visited site on the internet, went down proving that even with $1 million engineers things can go wrong. AWS is struggles figuring out how to provide usable IaC tools. And finally, GitLab and eBay provide some great details on what it takes to observe complex, distributed systems.

Why we Spent the Last Month Eliminating PostgreSQL Subtransactions
// about.gitlab.com
[Long Read] Since last June, GitLab would mysteriously stall for minutes leading to 500 errors. While the solution is provided (even showing the specific pull requests), the beauty of this post is its detailed walkthrough of how they tracked down the issue.
More Details About the October 4 Outage
// engineering.fb.com
This week, Facebook and their affiliated services WhatsApp and Instagram went down due to a faulty script and a bug in their audit tools. This is the fuffy, simplified version of the event.
Understanding How Facebook Disappeared from the Internet
// blog.cloudflare.com
When Facebook and Co. went down, Cloudflare thought it might have been something on their end. This is a more indepth explanation of how BGP and DNS work in relation to the Facebook outage and how traffic on other sites was affected as everyone tried to figure out what was going on.
Do Not use AWS CloudFormation
// gswallow.medium.com
When should you use CloudFormation? According to Greg Swallows, almost never. Terraform works directly against AWS API which makes it faster and easier to troubleshoot.
AWS Cloud Control API, a Uniform API to Access AWS & Third-Party Services
// aws.amazon.com
AWS has announced their Cloud Control API, a set of common APIs to make it easy to manage AWS services. They key here is common, since before you'd have very different request and response structures when working with AWS services. It exposes five common verbs: CreateResource, GetResource, UpdateResource, DeleteResource, and ListResource.
Groot: eBay’s Event-graph-based Approach for Root Cause Analysis
// tech.ebayinc.com
eBay partnered with the University of Illinois Urbana-Champaign and Peking University to create a novel, event-driven and graph-based approach to conduct root cause analysis investigations. The framework was able to provide the root cause of issues significantly more accurately than traditional service-dependency mappings.
2021 Accelerate State of DevOps Report
// cloud.google.com
Unfortunately, you'll have to sign up if you want to download the Google's DevOps Research and Assessment team's DevOps report, but this article provide a short summary of the key insights: a healthy team culture mitigates burnout during challenging times, the highest performers continue to raise the bar, SRE and DevOps are complementary philosophies, cloud adoption continues to drive performance, a secure software supply chain is both essential and drives performance, and finally, good documentation is foundational for successfully implementing DevOps capabilities.