Skip to content

Reliability Engineer, Supercomputing

Thinkingmachines

San Francisco$350k – $475k7d ago

About the role

Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. About the Role We're hiring an engineer to ensure the reliability of our GPU supercomputing fleet, owning the seam between hardware, firmware, and operating system. You will track the long tail of hardware issues: We are conducting frontier research in AI and a single bad NIC, HBM or a kernel driver edge case can compr

More at Thinkingmachines