SREConAsiaAustralia 2018 June 6, 2018 to June 8, 2018, Singapore, Singapore

Event Page


Tell us about missing data
Title Speakers Summary Topic Types
The Evolution of Site Reliability Engineering Benjamin Purgason Few companies invest in SRE before there is a raging operational fire on their hands. ...
Safe Client Behaviour Ariel Goh Ubiquitous compute power has created frequent impedance mismatches between client capabilities and server capacity in ...
Service Monitoring Manual—2018 Edition Nikola Dipanov Monitoring a.k.a figuring out what production code is doing is extremely important for an SRE ...
Introduction to Alibaba Monitoring System Ren Xinchi We all know that monitoring is one of the most important topics in the field ...
How Atlassian Is Tackling Error Budgets, Agile Style Gui Vieiro Striking a balance between feature delivery and reliability is a challenge many organizations face. Error ...
Building SRE: Culture from the Outside In Todd Palino Many companies want to grow a site reliability engineering team, but first need to ask ...
Quantifying Empathy with Service Level Objectives Ketan Gangatirkar The goal of a Site Reliability Engineer is to create a reliable, scalable, performant service. ...
Doing Things the Hard Way Chris Sinjakli Our discipline is one of tropes and maxims—the commoditisation of infrastructure, the golden signals of ...
Achieving Observability into Your Application with OpenCensus Emil Mikulic Application metrics and distributed traces are immensely powerful for developers, but are difficult to automatically ...
Efficient Trouble Shooting of Service Failures with Multi-Tag Data Analysis Xuan Cao One of the most important works for SREs is troubleshooting the problem causing KPI degradation ...
Autonomous Workload Rebalancing in Kafka Indrajeet Kumar Kafka at Linkedin processes over 3 Trillion messages a day with over 2000 kafka brokers. ...
Know Thy Enemy: How to Prioritize and Communicate Risks Matt Brown Every SRE team attempting to manage, mitigate or eliminate the risks facing their system will ...
Data Visualization for SREs—an Essential Skill for Quick Debugging Yash Shah SREs are software engineers with a broad skill set who work with systems in general. ...
You Can't Stop Fires with an Ambulance Piers Chamberlain SRE is often perceived as an emergency response function—dealing with incidents and restoring system health. ...
Comprehensive Container-Based Service Monitoring with Kubernetes and Istio Fred Moyer Operating containerized infrastructure brings with it a new set of challenges. How do you instrument ...
How to Make Releases Safer in Baidu Pingping Chen Changes/updates are a major source of service faults. In Baidu, around 54% of the faults ...
Cultural Nuance and Effective Collaboration for Multicultural Teams Ayyappadas Ravindran What is considered "good-communication" is different for different cultures. In some cultures "good communication" is ...
Call to ARMs: Adopting an arm64 Server into x86 Infrastructure Ignat Korchagin Over the years Cloudflare have built a huge network: today we have over 5000 servers ...
Randomized Load Balancing, Caching, and Big-O Math Julius Plenz Randomized load balancing is a common strategy to distribute requests across a server farm. When ...
Getting Started with Chaos Engineering Ana Medina Chaos engineering is the practice of conducting thoughtful, planned experiments designed to reveal weaknesses in ...
Automatic Datacenter and Service Deployments Based on Capacity Planning Artifacts Xiaoxiang Jian When you first built services in one data center, it is always easy to do ...
From Monitoring to Automated Testing of Your Infrastructure Code Jesse Reynolds Don't have time to write automated tests for your infrastructure code? Don't see the point? ...
Shopify's Move from the Data Centre to the Cloud Scott Francis Shopify is one of the largest commerce web sites in the world, with over 500,000 ...
Ensuring Reliability of High-Performance Applications Anoop Nayak This talk goes a bit beyond the traditional SRE tasks. It throws light into the ...
Managing Distributed Systems with BOSH on GCP Ronak Banka N/A
Data Integrity: Key to Protect One of the Most Valuable Assets of Your Company Chongxiu Wang N/A
Trouble in the Data Center Dan-claudiu Dragoș N/A
Modern IT Incident Response at Scale Abhijit Pendyal N/A
Moving from the First Line of Defence to Last Line of Defence Pradeep Thangavel N/A
Demystifying the "A" Word—Accountability Lenin Velu N/A
Using SSH with CA-Signed Certificates Marlon Dutra N/A
Relentless Reliability—Handling Hotspots Benjamin Kaehne N/A
Smarter Disasters: End-to-End Automation for Incidents Karthik Nilakant In this talk, I will discuss the different aspects of incident management and how we've ...
Debugging at Scale—Going from Single Box to Production Kumar Srinivasamurthy It's very easy to launch a debugger on your dev box, attach to the right ...
Productionizing Machine-Learning Services: Lessons from Google SRE Salim Villavieja Have you thought that your model trained on a Monday might not work on Saturday? ...
Pro Tip: Save Money on Outages by Having a Bot Do the Heavy Lifting Cezar Guimaraes You will learn about how to effectively run your outages and the strategy that we ...
Evolution of SRE and Rising Need of SRE Catalyzers Isha Ganeriwal An SRE team is responsible for availability, performance, efficiency, change management, emergency response, and capacity ...
How to Serve and Protect (with Client Isolation) Frances Johnson Client isolation is an important consideration for the reliability of Google Maps. We want to ...
A Tale of One Billion Time Series Ruiyao Yao Monitoring system is vital for service stability and availability. To support Baidu’s massive services and ...
Introduction to Linux Container Internals Tyler Mcmullen Software Fault Isolation, or SFI, is a way of preventing errors or unexpected behavior in ...
Automatic Traffic Scheduling for Internet Connectivity Failures Liuqing Zhang When it comes to high availability (HA) or user experience (UE), people often think about ...
Lessons Learned from Our Main Database Migrations at Facebook Yoshinori Matsunobu At Facebook, we created a new MySQL storage engine called MyRocks (https://github.com/facebook/mysql-5.6). Our objective was ...
PV Monitoring Based on Linear Regression Wang Bo PV (Page View) curve is one of the most important curves for SREs. Every significant ...
Do Docs Better: Practical Tips on Delivering Value to Your Business through Better Documentation Riona Macnamara Missing, incomplete, or stale/inaccurate documentation hurts development velocity, software quality, and—critically—service reliability. And the frustration ...
Characterizing and Understanding Phases of SRE Practices Kurt Andersen Site Reliability is a journey, not a destination...Participants in the site reliability field come from ...
Scaling Yourself for Managing Distributed Teams Delivering Reliable Services Paul Greig Distributed teams are an attractive proposition for your customers, your services, your team and you. ...
Interviewing for Systems Design Skills Sebastian Kirsch Google SRE has developed a special interview format called "Non-Abstract Large Systems Design" or NALSD. ...
Scaling a Distributed Stateful System: A LinkedIn Case Study Sai Kiran Kanuri A system is called scalable if it manages to take additional users and requests without ...
Mentoring: A Newcomer's Perspective Leoren Tanyag How do we make an impact to the starters in our field regardless of your ...
You Get What You Measure—Why Metrics Are Important Kumar Srinivasamurthy Good engineers are goal-driven. We can work relentlessly to reach a metric performance goal for ...
Blame. Language. Sharing: Three Tips for Learning from Incidents in Your Organisation Lindsay Holmwood “Fail fast, fail often” is a refrain heard throughout the tech industry. We’ve seen organisations ...
A Theory and Practice of Alerting with Service Level Objectives Jamie Wilkinson As systems grow, they get more components, and more ways to fail. The alerts of ...
Production Engineering: Connect the Dots Espen Roth When recruiting and onboarding new grads and others who haven't worked in site reliability, how ...
Mental Models for SREs Mohit Suley A mental model is an explanation of someone's thought process about how something works in ...