Software Engineer, Hardware Health
Openai
San FranciscoFull-timeYesterday
Looking for more like this? See all Software Engineer jobs.
About the role
ABOUT THE TEAM
The Hardware Health and Observability team owns the end-to-end health lifecycle of OpenAI’s global compute fleet.
Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling.
We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads.
ABOUT THE ROLE
On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps OpenAI’s largest compute clusters healthy and operational at scale.
Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet
More at Openai
- Finance Manager - Hardware ConsolidationsSan Francisco · $234k – $260k
- Protective Intelligence & Threat AnalystSan Francisco
- Senior Counsel, Public SectorWashington, DC
- Researcher, Alignment OversightSan Francisco
- Executive Transformation Program Lead, Embedded ExperienceSan Francisco
- Software Engineer, Productivity - Inference RuntimeSan Francisco