Applied Methods
~The MetaEngineeringSite Reliability Engineer

Site Reliability Engineer

Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.

$ titles --canonical
Site Reliability EngineerSenior SREStaff SREProduction EngineerReliability EngineerInfrastructure SRE
Open Jobs105
Companies Hiring41
$02

Skills

What companies are looking for in this role.

$ skills --core

Designing and operating multi-cloud infrastructure across multiple cloud providers with infrastructure-as-code principles

95%

Building and maintaining comprehensive monitoring, logging, and alerting systems for production infrastructure

93%

Managing and scaling Kubernetes clusters including lifecycle management, upgrades, networking, and resource orchestration

92%

Leading incident response processes, conducting root cause analysis, and driving postmortem-driven improvements

91%

Defining, implementing, and evolving Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

90%

Automating operational tasks and reducing toil through infrastructure automation and tooling development

89%

Participating in on-call rotations and maintaining production system availability during operational emergencies

88%

Designing and implementing CI/CD pipelines and deployment infrastructure for reliable application delivery

88%

Operating containerized workloads and managing container runtimes in production environments

85%

Implementing GitOps workflows and managing infrastructure through declarative configuration

82%

Implementing security best practices including identity and access management, least-privilege principles, and compliance standards

80%

Operating distributed databases and data systems at scale including configuration, performance tuning, and capacity planning

78%

Managing disaster recovery strategies, backup procedures, and implementing recovery time objectives

78%

Designing network infrastructure including load balancing, service mesh, and DNS management at scale

76%

Conducting performance analysis and reliability testing of large-scale distributed systems

75%

Optimizing infrastructure costs while maintaining reliability and performance standards

70%

Implementing data pipeline reliability and managing high-throughput ingestion systems

68%

Designing and managing multi-tenant isolation strategies for shared infrastructure platforms

68%
$ skills --emerging

Analyzing measurement data and system telemetry to support engineering decision-making

72%

Managing GPU and accelerated computing infrastructure for high-performance workloads

65%

Implementing observability for AI and machine learning workloads including training job reliability

62%
$ skills --soft

Collaborating with software engineering teams to embed reliability principles into system design and deployment processes

85%

Writing and maintaining runbooks, operational procedures, and documentation for production systems

80%

Setting operational standards and quality expectations across engineering organizations

72%

Evaluating, negotiating, and managing vendor relationships for third-party services and migrations

65%
$03

Technology

The tools and technologies that define this role.

$ tech --language
Bashhigh
Pythonhigh
Gomoderate
Javalow
$ tech --framework
OpenTelemetrymoderate
$ tech --platform
AWSvery high
Kubernetesvery high
Linuxvery high
Azurehigh
Dockerhigh
Google Cloud Platformhigh
ClickHousemoderate
Kafkamoderate
Cloudflare Workerslow
MongoDB Atlaslow
Oktalow
Snowflakelow
$ tech --tool
Terraformvery high
Ansiblehigh
Grafanahigh
Prometheushigh
Alertmanagermoderate
ArgoCDmoderate
Datadogmoderate
GitHub Actionsmoderate
Helmmoderate
Jenkinsmoderate
Lokimoderate
PagerDutymoderate
pytestmoderate
Argo Workflowslow
Coralogixlow
Crossplanelow
Envoylow
Falcolow
FluxCDlow
Opallow
Sentrylow
Thanoslow
VictoriaMetricslow
Wizlow
$ tech --concept
eBPFlow
XDPlow
$04

Open Jobs

105 open Site Reliability Engineer jobs across 41 companies.

Gong2d
Senior DevOps
Tel Aviv·Engineering
Block4d
Senior Site Reliability Engineer
Melbourne, Australia·Engineering
Waymo5d
Ridehailing, Site Reliability Engineer
Warsaw, Masovian Voivodeship, Poland·Engineering
RunPod5d
Site Reliability Engineer
Remote, USA·Engineering
Replit1w
Senior Site Reliability Engineer
Remote - Europe·Engineering
Replit1w
Staff Site Reliability Engineer
Remote - Europe·Engineering
Databricks1w
Site Reliability Engineer
Costa Rica·Engineering
Crusoe1w
Senior Staff Data Center Operations Engineer, GPU Hardware Architecture
San Francisco, CA - US·Engineering
Together AI1w
AI Infrastructure Engineer
San Francisco·Engineering
Palantir2w
Production Engineer - Database Operations
London, United Kingdom·Engineering
Waymo2w
Software Reliability Engineer
Mountain View, CA, USA; San Francisco, CA, USA·Engineering
Waymo2w
Senior Site Reliability Engineer, Waymo Fleet
Mountain View, CA, USA; San Francisco, CA, USA·Engineering
Waymo2w
Software Reliability Engineer, Waymo Fleet
Mountain View, CA, USA; San Fransisco, CA, USA·Engineering
Cognition2w
Site Reliability Engineer
San Francisco·Engineering
Lambda2w
Senior Site Reliability Engineer - Observability
San Francisco Office (Fremont St)·Engineering
Nscale3w
Principal Site Reliability Engineer - AI Infrastructure Operations
AMER·Engineering
Crusoe3w
Senior Production Engineer, Operational Excellence
San Francisco, CA - US·Engineering
Graphcore3w
Senior Systems Engineer – Performance & Reliability (Analysis)
Bristol, UK·Engineering
Graphcore3w
Senior Systems Engineer – Performance & Reliability (Analysis)
London, UK·Engineering
Graphcore3w
Senior Systems Engineer – Performance & Reliability
Gdańsk, Pomeranian Voivodeship, Poland·Engineering