Staff Software Engineer - SRE at Uber
San Francisco, CA, US
At Uber, we ignite opportunity by setting the world in motion. We take on big problems to help drivers, riders, delivery partners, and eaters get moving in more than 600 cities around the world.
We welcome people from all backgrounds who seek the opportunity to help build a future where everyone and everything can move independently. If you have the curiosity, passion, and collaborative spirit, work with us, and let’s move the world forward, together.
About The Team
Join the Uber Site Reliability Engineering Team and help us redefine what it means to be an Site Reliability Engineer! Uber SREs are generalists with a big-picture view of how everything at Uber works together. Uber SREs partner with development and product teams throughout the company with the ultimate goal of improving the reliability of Core Products, Features, and Flows.
Uber SREs spend as much time writing code for libraries, production automation, and operational tooling as they do on efforts to improve the reliability of key systems through performing architecture and design reviews, investigating system failures and complex outages, improving our monitoring infrastructure, defining service level objectives and agreements and much more.
What You’ll Do
An Uber Staff Site Reliability Engineer is a leader and visionary for the SRE organization. They are responsible for providing the technical vision, strategy, and drive large reliability programs and efforts to completion across all of Uber. The SRE organization looks up to its Staff SREs to drive our reliability roadmaps and ultimately ensure that anyone that wants to take a trip (or order food or get a package …) anywhere in the world can successfully do so at any time.
Uber Staff SREs:
- Work with Engineering Leadership and development partners to shape the architecture, design, and implementations of new and existing systems to enhance their reliability, performance, efficiency, and scalability
- Lead teams of SREs to complete major reliability goals across engineering teams and organizations
- Identify opportunities to build platforms, shared libraries, operational tool, production automation, and more
- Develop reliability tools and frameworks for use by all engineers
- Serve as Incident Commanders for Uber’s Rapid Response and Mitigation Teams
- Drive Blameless Postmortems and Root Cause Analysis for severe outages and use iterative interrogation techniques to identify organizational issues and push for their improvements
- Help Engineers embrace the complexity and manage the reliability in a world with thousands of microservices and hundreds of thousands of hosts
- Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis.
- Drive best practices around usage of Uber infrastructure and help development teams using infrastructure more effectively.
- Enable capacity planning across the entire organization and help teams anticipate and prepare for growth.
What you'll need
- Grit, drive and a deep feeling of ownership.
- BS or MS in Computer Science or a related technical discipline. Equivalent practical experience is a reasonable substitute.
- A deep understanding of Linux fundamentals and internals: filesystems and modern memory management, threads and processes, the user/kernel-space divide, etc.
- A deep understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems.
- Working knowledge of the TCP/IP stack, internet routing and load balancing.