// tomrussell.co.uk
Tom Russell describes how they plan 8 figure software projects that involve multiple teams across several quarters of effort while still remaining somewhat agile.
// infoq.com
Cascading failures are failures that involve something causes a reduction in capacity, or an increase in latency, or a spike in errors. What happens next is that the response of other components of your software system causes widespread failure, your load will increase, and your backends will get flattened.
// firehydrant.io
Declare and run retros for the small incidents. Decrease the time it takes to analyze an incident. Alert on pain felt by people — not computers.
// rootly.io
De-siloing the organization is such a crucial part of managing reliability. This post explains why breaking down the silos that separate SREs from other teams is so important, and practical strategies for doing so.
// blog.cloudflare.com
Cloudflare is open sourcing Sciuro, their replacement of node-problem-detector that has one job: synchronize Kubernetes node conditions with currently firing alerts in Alertmanager.
// ably.com
Ably explains the tradeoffs of running thousands of Docker instances without Kubernetes and why they think it doesn't make sense for a lot of companies that have adopted it.