// cloud.google.com
A recurring question "is what we're currently doing 'SRE work'?" or, with a little more existential dread, "can we call ourselves SREs yet?" This article hopes to help answer this discussing some principles fundamental to how an SRE team operates.
// downtimeproject.com
After 10 post-mortems in the first season of the podcast, "The Downtime Project", Tom and Jamie reflect on the common issues they've seen.
// medium.com
Each additional 9 of reliability (eg. moving from 99% to 99.9% reliability) costs 10 times (10x) more to achieve. But what contributes to that cost increase?
// review.firstround.com
What are the little daily habits that often go unnoticed, but when linked together add up to form an incredibly strong chain between manager and direct report?
// bunny.net
Bunny.net experienced a 2+ hour near system-wide outage caused by DNS failure. This article shares what happened and what they're doing to resolve this going into future.
// rootly.io
Incident severity levels are a measurement of the impact an incident has on the business. Classifying the severity of an issue is critical to decide how quickly and efficiently problems get resolved.
// pqvst.com
PQVST decides to move away from self-hosting Grafana and InfluxDB. This post provides a comparison between InfluxDB Cloud and Grafana Cloud.
// about.gitlab.com
GitLab pushed a major release that includes: epic boards, Terraform module registry, streamlined UI, merge request reviews in VS Code, and more.