Issue #33 // June 25, 2021

Assessing an SRE Team’s Maturity
A recurring question "is what we're currently doing 'SRE work'?" or, with a little more existential dread, "can we call ourselves SREs yet?" This article hopes to help answer this discussing some principles fundamental to how an SRE team operates.
7 Lessons From 10 Outages
After 10 post-mortems in the first season of the podcast, "The Downtime Project", Tom and Jamie reflect on the common issues they've seen.
The Cost of 100% Reliability
Each additional 9 of reliability (eg. moving from 99% to 99.9% reliability) costs 10 times (10x) more to achieve. But what contributes to that cost increase?
Testing Factorio
Factorio (yes, the game) talk about their software development and testing practices.
The 25 Micro-Habits of High-Impact Managers
What are the little daily habits that often go unnoticed, but when linked together add up to form an incredibly strong chain between manager and direct report?
The Stack Overflow of Death. How We Lost DNS.
// experienced a 2+ hour near system-wide outage caused by DNS failure. This article shares what happened and what they're doing to resolve this going into future.
Practical Guide to SRE: Incident Severity Levels
Incident severity levels are a measurement of the impact an incident has on the business. Classifying the severity of an issue is critical to decide how quickly and efficiently problems get resolved.
Hosted Monitoring: Evaluating InfluxDB Cloud and Grafana Cloud
PQVST decides to move away from self-hosting Grafana and InfluxDB. This post provides a comparison between InfluxDB Cloud and Grafana Cloud.
AWS Pricing Problems Deter Cloud Engineers
AWS users old and new are regularly hit by unexpected bills, and this could harm the development of the in-demand cloud workforce.
GitLab 14.0 Released
GitLab pushed a major release that includes: epic boards, Terraform module registry, streamlined UI, merge request reviews in VS Code, and more.