~/The Meta/Engineering/Site Reliability Engineer_

Site Reliability Engineer

Engineering

Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.

$ titles --canonical

Site Reliability EngineerSenior SREStaff SREProduction EngineerReliability EngineerInfrastructure SRE

Open Jobs116

Companies Hiring43

$02_

Skills

What companies are looking for in this role.

$ skills --core

Designing and implementing monitoring, alerting, and observability systems across distributed infrastructure

95%

Managing incident response processes including root cause analysis and postmortem facilitation

95%

Automating operational tasks and building infrastructure-as-code deployment solutions

95%

Defining, implementing, and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

90%

Troubleshooting and debugging production issues across complex system stacks

85%

Operating and maintaining stateful storage and database systems at scale

85%

Managing multi-cloud and multi-region infrastructure deployment and operations

85%

Building CI/CD pipelines and managing deployment processes for reliable releases

85%

Reducing Mean Time To Recovery (MTTR) through tooling, runbooks, and automation

80%

Optimizing system performance, architecture, and scaling for maximum uptime and minimal latency

80%

Participating in on-call rotations and maintaining incident escalation paths

80%

Capacity planning and resource optimization for infrastructure scaling

80%

Understanding Linux operating system internals, networking concepts, and system-level optimization

75%

Designing self-healing and resilient systems that respond automatically to failure scenarios

75%

Leading production readiness reviews and reliability standards enforcement across teams

75%

$ skills --emerging

Applying AI and machine learning techniques to improve incident detection and operational efficiency

70%

Building predictive maintenance and anomaly detection systems for infrastructure health

65%

Maintaining observability for machine learning model-serving workloads and inference infrastructure

60%

Developing agentic tooling and AI-driven automation for operational workflows

60%

$ skills --soft

Conducting thorough, blameless postmortems and driving preventative improvements

80%

Collaborating with product and engineering teams on architectural reliability improvements

80%

Building developer tooling and empowering developer productivity through infrastructure improvements

70%

Mentoring engineering teams and establishing reliability as a core organizational value

70%

Communicating complex technical concepts and driving technical decision-making across stakeholders

70%

Balancing long-term infrastructure strategic goals with immediate engineering needs

65%

$03_

Technology

The tools and technologies that define this role.

$ tech --language

Gohigh

Pythonhigh

Javamoderate

$ tech --framework

CUDAlow

$ tech --platform

AWSvery high

Kubernetesvery high

Linuxvery high

Azurehigh

Dockerhigh

Google Cloud Platform (GCP)high

NVIDIA GPUmoderate

ELK Stacklow

MongoDB Atlaslow

$ tech --tool

Terraformvery high

Datadoghigh

Ansiblemoderate

ArgoCDmoderate

GitHub Actionsmoderate

Grafanamoderate

Helmmoderate

Jenkinsmoderate

Pagerdutymoderate

Prometheusmoderate

Pulumimoderate

Argo Workflowslow

Coralogixlow

Crossplanelow

Sentrylow

Wizlow

$ tech --concept

Distributed systemshigh

DNSmoderate

GitOpsmoderate

Machine Learning infrastructuremoderate

TCP/IPmoderate

TLS/SSLmoderate

$04_

Open Jobs

116 open Site Reliability Engineer jobs across 43 companies.

DataHub

DevOps

Engineering

Bengaluru, Karnataka, India

DataHub4d

DevOps

Bengaluru, Karnataka, India·Engineering

Lambda

Senior Site Reliability Engineer - SDN

Engineering

San Francisco Office (Fremont St)

Lambda5d

Senior Site Reliability Engineer - SDN

San Francisco Office (Fremont St)·Engineering

Together AI

Lead/Manager Site Reliability Engineering Team (Amsterdam)

Engineering

Amsterdam

Together AI1w

Lead/Manager Site Reliability Engineering Team (Amsterdam)

Amsterdam·Engineering

StackBlitz

Staff Site Reliability Engineer

Engineering

Remote

StackBlitz1w

Staff Site Reliability Engineer

Remote·Engineering

Nebius

Staff Network Site Reliability Engineer

Engineering

United States

Nebius1w

Staff Network Site Reliability Engineer

United States·Engineering

Tabs

Staff Site Reliability Engineer

Engineering

New York City, NY

Tabs2w

Staff Site Reliability Engineer

New York City, NY·Engineering

Nebius

Senior Site Reliability Engineer (In-Office Required)

Engineering

New York City, New York, United States

Nebius2w

Senior Site Reliability Engineer (In-Office Required)

New York City, New York, United States·Engineering

Thinking Machines Lab

Reliability Engineer, Supercomputing

Engineering

San Francisco

Thinking Machines Lab2w

Reliability Engineer, Supercomputing

San Francisco·Engineering

Synthesia

Senior Site Reliability Engineer

Engineering

US Remote

Synthesia3w

Senior Site Reliability Engineer

US Remote·Engineering

MongoDB

Site Reliability Engineer (Senior or Staff)

Engineering

Toronto

MongoDB3w

Site Reliability Engineer (Senior or Staff)

Toronto·Engineering

MongoDB

Site Reliability Engineer (Senior or Staff)

Engineering

Boston; Miami; New Jersey; New York City; Princeton; Raleigh; Washington DC

MongoDB3w

Site Reliability Engineer (Senior or Staff)

Boston; Miami; New Jersey; New York City; Princeton; Raleigh; Washington DC·Engineering

Nabla

SRE / Backend Engineer

Engineering

New York office

Nabla3w

SRE / Backend Engineer

New York office·Engineering

Crusoe

Production Engineer (Kubernetes)

Engineering

Dublin - IE

Crusoe3w

Production Engineer (Kubernetes)

Dublin - IE·Engineering

Staff Technical Solution Engineering

Engineering

McLean, Virginia

Databricks4w

Staff Technical Solution Engineering

McLean, Virginia·Engineering

Databricks

Sr Technical Solutions Engineering

Engineering

McLean, Virginia

Databricks4w

Sr Technical Solutions Engineering

McLean, Virginia·Engineering

Nscale

Principal Site Reliability Engineer - AI Infrastructure Operations

Engineering

Houston; New York; San Francisco; Seattle

Nscale4w

Principal Site Reliability Engineer - AI Infrastructure Operations

Houston; New York; San Francisco; Seattle·Engineering

1mo

Gong

Senior DevOps Engineer

Engineering

Tel Aviv

Gong1mo

Senior DevOps Engineer

Tel Aviv·Engineering

1mo

Nebius

Senior Site Reliability Engineer (DevTools)

Engineering

Amsterdam, Netherlands; Germany; Israel; London, United Kingdom; Remote - Europe; United Kingdom

Nebius1mo

Senior Site Reliability Engineer (DevTools)

Amsterdam, Netherlands; Germany; Israel; London, United Kingdom; Remote - Europe; United Kingdom·Engineering

1mo

Crusoe

Staff Network Engineer, Operations

Engineering

San Francisco, CA - US

Crusoe1mo

Staff Network Engineer, Operations

San Francisco, CA - US·Engineering

View all 116 jobs

$ roles --related --function=engineering

Other Engineering roles

Software Engineer

General-purpose software engineering roles focused on building and maintaining software systems. Covers generalist SWE positions that don't clearly fall into frontend, backend, fullstack, or other specialized tracks.

Backend Engineer

Engineers focused on server-side systems, APIs, services, and data processing pipelines. Includes roles explicitly labeled as backend or server-side development.

Frontend Engineer

Engineers specializing in user-facing interfaces, web applications, and client-side development. Includes UI/UX engineering and web development roles.

Fullstack Engineer

Engineers working across the entire application stack, handling both frontend and backend responsibilities.

Infrastructure & Platform Engineer

Engineers building and maintaining internal platforms, cloud infrastructure, compute systems, and developer tooling.