~/Jobs/Nscale/Staff Observability Platform Engineer_

Staff Observability Platform Engineer

AI Infrastructure OperationsUSPosted today

About the role

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency while contributing to the technology that powers the future.

About the Role

As a Staff Observability Platform Engineer, you'll play a critical role in building and evolving Nscale's observability platform, enabling deep visibility into GPU clusters, AI workloads, and the infrastructure that powers them.

You view observability as a product, not simply a collection of tools. You'll help define and implement scalable, reliable observability solutions that empower engineering teams to understand system behavior, diagnose issues quickly, and operate complex distributed systems with confidence.

You'll combine technical leadership with hands-on engineering, partnering across SRE, infrastructure, platform, and AI/ML teams to improve reliability, operational efficiency, and developer experience. You'll influence architectural decisions, establish engineering best practices, and help drive the evolution of observability capabilities across the organization.

This is a role for someone who enjoys solving difficult infrastructure problems, building platforms that scale, and helping engineering teams succeed through better visibility and operational insight.

What You'll Do

Design, build, and evolve observability platforms across metrics, logs, traces, alerting, and telemetry pipelines.
Lead the implementation of scalable observability solutions that support Nscale's growing GPU and AI infrastructure.
Partner with SRE, infrastructure, platform, and AI/ML teams to ensure observability is embedded throughout the software and infrastructure lifecycle.
Drive improvements in monitoring coverage, alert quality, service health visibility, and incident response effectiveness.
Develop standards, frameworks, and reusable patterns that simplify observability adoption across engineering teams.
Identify reliability risks and operational blind spots, helping teams proactively address them before they impact customers.
Contribute to architectural decisions around telemetry collection, storage, retention, cardinality management, and performance optimization.
Lead technical initiatives and projects that improve platform scalability, reliability, and operational efficiency.
Mentor engineers and provide technical guidance through design reviews, code reviews, and knowledge sharing.
Participate in incident investigations and postmortems, translating operational learnings into durable platform improvements.
Evaluate new observability technologies and practices, balancing innovation with operational simplicity and long-term maintainability.

About You

6+ years of experience in SRE, platform engineering, infrastructure engineering, observability engineering, or related disciplines.
Strong experience building and operating observability platforms in cloud-native, distributed environments.
Deep hands-on experience with several of the following technologies: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic, or similar platforms.
Strong software engineering skills with proficiency in Go, Python, or equivalent languages.
Experience operating and troubleshooting Kubernetes-based platforms at scale.
Strong understanding of monitoring, logging, tracing, telemetry pipelines, and modern observability practices.
Experience designing systems with scalability, reliability, performance, and operational simplicity in mind.
Proficiency with Infrastructure-as-Code tools such as Terraform, Ansible, or equivalent.
Ability to lead technical initiatives and influence engineering decisions across multiple teams.
Excellent communication skills with the ability to explain technical tradeoffs and align stakeholders around pragmatic solutions.

Preferred

Experience operating observability systems in GPU, AI/ML, HPC, or large-scale compute environments.
Familiarity with Slurm, Kubernetes GPU scheduling, or AI infrastructure platforms.
Experience with high-volume telemetry pipelines and streaming technologies such as Kafka, Vector, or Fluent Bit.
Knowledge of observability challenges related to model training, inference workloads, GPU utilization, and distributed AI systems.
Experience mentoring engineers and helping grow technical capability across teams.

Equal Opportunities Statement

We strongly encourage applications from people of color, the LGBTQ+ community, people with disabilities, neurodivergent individuals, parents, carers, and people from lower socio-economic backgrounds.

If there's anything we can do to accommodate your specific situation, please let us know.

Note: Responsibilities outlined are not exhaustive and may evolve as business needs change.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.