Site Reliability Engineering

Name: Site Reliability Engineering
Author: Betsy Beyer

How Google Runs Production Systems

Paperback Engels 2016 1e druk 9781491929124

€ 67,46

In winkelwagen

Levertijd ongeveer 16 werkdagen

Gratis verzonden

Samenvatting

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:
- Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
- Management—Explore Google's best practices for training, communication, and meetings that your organization can use

Trefwoorden

Specificaties

ISBN13:9781491929124

Taal:Engels

Bindwijze:paperback

Aantal pagina's:524

Uitgever:O'Reilly

Druk:1

Verschijningsdatum:19-4-2016

Hoofdrubriek:Computer en informatica

Expertrecensies (1)

Site Reliability Engineering

Rik Lammers | 7 september 2017

Dit is een belangrijke publicatie. Onderwerp is een radicaal andere aanpak voor IT Operations waarin bij Google hoog gekwalificeerde engineers worden ingezet in plaats van operators op Level 1 niveau.
Lees verder

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Schrijf een recensie

Uw waardering

?

Log in om uw waardering te geven

Klik om uw waardering te geven

Inhoudsopgave

Preface

Part 1: Introduction
1. Introduction
2. The Production Environment at Google, from the Viewpoint of an SRE

Part 2: Principles
3. Embracing Risk
4. Service Level Objectives
5. Eliminating Toil
6. Monitoring Distributed Systems
7. The Evolution of Automation at Google
8. Release Engineering
9. Simplicity

Part 3: Practices
10. Practical Alerting from Time-Series Data
11. Being On-Call
12. Effective Troubleshooting
13. Emergency Response
14. Managing Incidents
15. Postmortem Culture: Learning from Failure
16. Tracking Outages
17. Testing for Reliability
18. Software Engineering in SRE
19. Load Balancing at the Frontend
20. Load Balancing in the Datacenter
21. Handling Overload
22. Addressing Cascading Failures
23. Managing Critical State: Distributed Consensus for Reliability
24. Distributed Periodic Scheduling with Cron
25. Data Processing Pipelines
26. Data Integrity: What You Read Is What You Wrote
27. Reliable Product Launches at Scale

Part 4: Management
28. Accelerating SREs to On-Call and Beyond
29. Dealing with Interrupts
30. Embedding an SRE to Recover from Operational Overload
31. Communication and Collaboration in SRE
32. The Evolving SRE Engagement Model

Part 5: Conclusions
33. Lessons Learned from Other Industries
34. Conclusion

Appendix A: Availability Table
Appendix B: A Collection of Best Practices for Production Services
Appendix C: Example Incident State Document
Appendix D: Example Postmortem
Appendix E: Launch Coordination Checklist
Appendix F: Example Production Meeting Minutes

Index