Job Description
oles & Responsibilities
End-to-End Reliability & Operations
- Take full ownership of availability, latency, scalability, and durability across all services and databases.
- Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for critical systems.
- Lead incident response protocols, conduct blameless Root Cause Analyses (RCAs), and drive systemic fixes to improve MTTR and MTTD.
- Build production readiness frameworks and establish best practices for capacity planning, deployments, rollbacks, and change management.
Database Reliability & Architecture
- Ensure the end-to-end reliability of relational databases, NoSQL databases, caching layers, and streaming platforms.
- Design highly available, multi-region architectures, implementing robust cross-region replication and failover mechanisms.
- Formulate and implement comprehensive backup, restore, and disaster recovery (DR) strategies.
- Lead system design reviews with a strict focus on fault tolerance, scalability bottlenecks, data partitioning, and sharding.
Platform Automation & Tooling
- Build and evolve internal platforms for database provisioning, lifecycle management, and service deployment.
- Champion Infrastructure as Code (IaC) and GitOps practices to reduce operational toil through automation and self-healing systems.
- Define golden signals (latency, traffic, errors, saturation) and build comprehensive observability and tooling across the application, infrastructure, and database layers.
- Develop reusable frameworks for failover automation, chaos testing, and reliability validation.
Performance, Cost & Security
- Optimize system performance and drive cost efficiency across cloud infrastructure (compute, network, storage) and database usage (IOPS, replication, backups).
- Ensure systems comply with rigorous security and governance standards by implementing access controls, encryption (at rest and in transit), and audit logging.
The Impact You Can Create
As a Staff Engineer (IC4), you will act as a technical leader across the infrastructure, platform, and data layers. By blending Site Reliability Engineering (SRE) and Database Reliability Engineering (DBRE), you will:
- Drive the organization-wide reliability strategy and solve highly ambiguous, high-impact engineering problems.
- Influence system architecture across multiple teams, guiding product teams on resilient architecture patterns.
- Raise the overall engineering standards through mentorship, design leadership, and by operating with high ownership and autonomy.