C

Staff Site Reliability Engineer – Automation and Platform

Cerebras
9 months ago
Full-time
Remote
Worldwide
Remote Engineering
About the Role We are building a high-performance SRE function to support one of the world’s fastest-growing AI inference services, powered by the Wafer-Scale Engine (WSE). This team will help deliver world-class, ultra-reliable inference infrastructure for leading model builders such as OpenAI and other frontier labs. As a Staff SRE, you will lead the engineering effort to eliminate toil at scale by driving implementation of self-service delivery pipelines, shared observability common tooling. This role starts with ~1 month of hands-on operational immersion to gain deep familiarity with our current stack, production pain points, and high-stakes workflows. From there, your primary focus shifts to architecting and delivering the "tomorrow" layer: declarative GitOps-driven CD for model releases, capacity provisioning and cluster upgrades. Success over the first year in this role will be defined by enabling core teams, product managers, external customers, and cluster stakeholders to operate in a fully self-service model with strong reliability guarantees. You will partner with our early-career SRE sub-team, who own day-to-day operations. This will allow you to deeply understand their... Click Apply to read the full job description.