SRecon 2019 March 25, 2019 to March 27, 2019, Brooklyn, United States

Event Page


Tell us about missing data
Title Speakers Summary Topic Types
What Breaks Our Systems: A Taxonomy of Black Swans Laura Nolan Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our ...
Complexity: The Crucial Ingredient in Your Kitchen Casey Rosenthal Software engineering is basically rocket science, so it comes as no surprise that we can ...
Case Study: Implementing SLOs for a New Service Arnaud Lawson Implementing service level objectives (SLOs) effectively is a hard task, especially for a service which ...
Fixing On-Call When Nobody Thinks It's (Too) Broken Tony Lykke What's a team to do when they receive more than 30 pages a day, every ...
Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value Aaron Wieczorek How do you monitor systems that don't want to be monitored or ones that you ...
Keeping the Balance: Internet-Scale Loadbalancing Demystified Laura Suriar Can you explain the entire path that an IP packet takes from your users to ...
Aperture: A Non-Cooperative, Client-Side Load Balancing Algorithm Ruben Oanta Twitter's RPC framework, Finagle, employs non-cooperative, client-side load balancing. That is, clients make load balancing ...
Capacity Prediction in External Services Jerome Kraus Applications are often limited by resources in third-party external systems. As an SRE, I want ...
How Did Things Go Right? Learning More from Incidents Ryan Kitchens Solely learning from failure isn't a fundamental—it's a limitation.A look into the New View of ...
Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way Michael Palino We will look at the process for Code Yellow, the term we use for this ...
Creating a Code Review Culture Johnathan Turner Code review is one of the best ways to keep code quality high, and for ...
Benefits of Taking the Less Traveled Road with Containers Infrastructure Eduard Iacoboaia After almost a year of running Openshift Origin we decided to migrate to a vanilla ...
The Ops in Serverless Jennifer Davis In this talk, we will examine the increased need for specialized Operations Engineering in the ...
Testing in Production at Scale Amit Gud Once frowned upon, testing in production has started to become a viable solution, especially in ...
Tackling Kafka, with a Small Team Jaren Glover This is a story about what happens when a distributed system becomes a big part ...
Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance Lynn Root Do you maintain a Rube Goldberg-like service? Perhaps it's highly distributed? Or you recently walked ...
Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest Danny Chen Loggers and tracers have become crucial components of computing systems, providing invaluable visibility into the ...
Operating within Normal Parameters: Monitoring Kubernetes Elana Hashman After Kubernetes takes over your data centers, how can you be sure that it's operating ...
SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager Jen Wohlner SRE and product management—do those even go together? Yes! In this talk, we'll go over ...
Shipping Software with an SRE Mindset Theo Schlossnagle Most SRE techniques revolve around resiliency and reliability of service delivery. Most "product" is the ...
Using PRDs and User Journeys to Design User-Friendly Tools Gwendolyn Stockman Implementing software is one core aspect of the SRE role. Often this software will be ...
SRE Classroom - How to Design a Distributed System in 3 Hours Ryan Thomas , Jc Van Winkel , Phillip Tischler , And Jennifer Mace Participants in this workshop will learn principles of systems design, and work in small groups ...
Migrating a Monolith to the Cloud Keyur Govande After over a decade of hosting itself in the data center, Etsy.com moved to the ...
An Introduction to GraphQL Nat Welch GraphQL is a data sharing schema from Facebook. This talk will introduce the schema, common ...
Service Discovery Challenges at Scale Ruslan Nigmatullin We'll discuss what challenges does one face while building Service Discovery at scale of millions ...
Inside the Kube: A Guided Tour of Kubernetes Cluster Setup Liz Frost A lot of SREs are (or will soon be) responsible for Kubernetes clusters. But what ...
Getting Started with Observability Lab: Opentracing, Prometheus, and Jaeger Kevin Crawley Building a cloud native organization without having a robust understanding of what your applications are ...
What I Wish I Knew before Going On-call Chie Shu , Dorothy Jung , And Wenting Wang Firefighting a broken system is time-sensitive and stressful but becomes even more challenging as teams ...
Running Excellent Retrospectives: Talking for Humans Courtney Neva How many awful meetings have you been to in your life, where people are talking ...
Livetweeting Tech Conferences Bridget Kromhout N/A
5 Insights from 200 SREs on How Incident Response Affects Them Jaime Parzych N/A
Distributed Systems Need Deadlines Paul a. Henry N/A
Doughnut Dilemma: A Lesson in Resource Managers Ravi Lachhman N/A
Automating SRE Work: Focusing on High-Return Customer and Business Outcomes Aniket Kulkarni N/A
Durable Disorder Anthony Sandoval N/A
The Operation Maturity Model Matthew Fornaciari N/A
"Monitoring and Alerting, Ain't Nobody Got Time for That": How USDS Bootstrapped Basic SRE Best Practices a Week before Launch at FEMA David Holmes N/A
Optimizing for Learning Logan Mcdonald The talk is about the most powerful observability system SREs have at their disposal: the ...
Zero to SRE Kim Schlesinger Being able to transform a junior engineer into an excellent mid, then senior engineer is ...
One on One SRE Amy Tobey When Amy started at GitHub, support for SRE principles and technical solutions were well underway. ...
Scaling SRE Organizations: The Journey from 1 to Many Teams Gustavo Franco In this talk, the author will share their experience starting new teams, splitting and moving ...
The Curse of SRE Autonomy and How to Manage It Richard Bondi Within an SRE organization, teams usually develop very different automation tools and processes for accomplishing ...
Learning from Learnings: Anatomy of Three Incidents Randy Shoup The best response to a system outage is not "What did you do?", but "What ...
Fault Tree Analysis Applied to Apache Kafka Andrey Falko At last year's SREcon, we were inspired by talks that introduced fault tree analysis. We ...
Strategies to Edit Production Data Julie Qiu At some point, we all find ourselves at a SQL prompt making edits to the ...
Madaari: Ordering for the Monkeys Ashutosh Ellupuru Lineage Driven Fault Injection (LDFI) is a state of the art technique in chaos engineering ...
Sublinear Scaling in Practice: The 1k SRE Project Nikolaus Rath At Google, one of the primary objectives of SRE teams is sublinear scaling: the size ...
Pragmatic Automation Max Luebbe Automation is great, but how do you know when the right thing to do is ...
Differences in SRE Implementations across Companies Kurt Andersen With the popularity of "SRE" as a job role, people have become aware that not ...
Latency SLOs Done Right Fred Moyer Median, average, 90th, 99th percentile. We've all seen these metrics on our monitoring systems, both ...
Extending the Error Budget Model to Security and Feature Freshness Jim Laing Everyone knows about error budgets (most every SRE at this conference, anyway) and how to ...
You Don't Have to Love Your Job Leslie Carr "Do what you love, and you'll never work another day in your life." -- someone ...
Mindfulness in SRE: Monitoring and Alerting for One's Self Tommy Lutz As SREs, we are all permanently on-call for our own well-being. Without proper monitoring and ...
Automating the Management of the Operational Health of Cloud Accounts at Scale Jamie Walls In a large scale environment where engineers are empowered to independently deliver an application from ...
Designing Resilient Data Pipelines Andrew Bolin There are a number of questions that plague any operator of a complex data pipeline. ...
From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services Salim Villavieja Artificial intelligence is all around us, from the digitals assistants in our microwaves to the ...
Resilience Engineering Mythbusting Will Gallego How confident are you in your prod servers staying up without your help? Too often ...
Why Are Distributed Systems So Hard? Denise Yu Distributed systems are known for being notoriously difficult to wrangle. But why? This talk will ...