Yet another great article by one of Uber Engineering Manager Gergely Orosz on "Operating a Large, Distributed System in a Reliable Way". In this article Gergely takes us in a high level overview of the key themes he has identified managing the payments system at Uber. In this post he covers fundamental (and super interesting concepts) including Monitoring, Oncall, Anomaly Detection, Alerting, Outages, Incident Management Processes, Postmortems, Incident Reviews, a Culture of Ongoin, Improvements, Failover Drills, Capacity Planning & Blackbox Testing and more (much, much more).
|