Model Serving Engineer
About the role
About Fundamental
Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict.
At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI.
About the role
Our Serving team is responsible for turning NEXUS, our Large Tabular Model, into a reliable and scalable production system. We own the infrastructure and execution stack that serves the model across multiple deployment environments, each with different requirements around scale, isolation, performance, and trust.
The team sits at the intersection of research and production engineering. We work closely with researchers to bring new model architectures into production, while building the systems needed to operate them efficiently and predictably under real-world workloads. Tabular foundation models introduce serving challenges that differ meaningfully from traditional LLM inference, including irregular computational behavior and complex resource tradeoffs across CPU, GPU, memory, and networking.
As a Model Serving Engineer, you’ll work across the full inference stack - from Python runtime performance and concurrency behavior to distributed orchestration, GPU serving infrastructure, and deployment architecture. You’ll identify bottlenecks, improve throughput and latency, and help define how new generations of the model are translated from research artifacts into production-grade systems.
This is a deeply technical, Python-heavy role for engineers who enjoy distributed systems, performance optimization, and low-level infrastructure challenges close to modern ML systems.
Key responsibilities
Optimize Python inference code for performance under real concurrency constraints, including GIL contention, multi-threading, multiprocessing, async execution, and long-running production workloads
Work closely with research to understand model internals and support the continuous evolution of the architecture, especially around complex and non-obvious computational behavior under production load
Collaborate with research and infrastructure teams to reason about hardware utilization and serving tradeoffs across GPU, CPU, memory, networking, batching, and concurrency
Define and evolve the architecture behind our distributed inference and asynchronous execution stack, including orchestration, worker coordination, and end-to-end concurrency patterns
Own the Triton serving layer for NEXUS, including how models are packaged, configured, and executed as part of our production inference pipeline
Build observability and performance tooling across the serving stack, and use production metrics to drive tuning decisions around latency, throughput, and resource efficiency
Solve cross-cutting serving challenges that emerge from deploying the same model across environments with very different scale, isolation, and reliability constraints
Evaluate and integrate new inference runtimes, serving strategies, and infrastructure approaches as the model ecosystem evolves
Must have
Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role
Deep expertise in Python concurrency, including GIL behavior, multi-threading, thread safety, multiprocessing
Experience building asynchronous and message-driven systems
High-performance, large-scale distributed systems
Ability to read and reason about ML model implementations at a computational level, including compute behavior, batching, memory usage, and inference characteristics
Experience profiling and optimizing performance across CPU, memory, I/O, and ideally GPU workloads, and translating findings into architectural improvements
Nice to have
Understanding of GPU architecture, performance characteristics, and resource utilization in high-performance compute workloads
Experience working with tabular and structured-data ML systems
Understanding of neural networks and modern deep learning architectures
Experience with Kubernetes and cloud infrastructure
Familiarity with DevOps and production infrastructure tooling, including containers, Helm, observability, and CI/CD systems
Benefits
Competitive compensation with salary and equity
Comprehensive health coverage, including medical, dental, vision, and 401K
Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
Find similar jobs
Explore opportunities with similar job descriptions at other companies.