Posted 25 Nov 2023, 8:00 pm

Member of Technical Staff High Performance Computing at Inflection AI

Sorry, but this job listing has expired!

About the Role

At Inflection, the scale of our compute is critical to our mission of creating personal intelligence for everyone.  We have several clusters in production currently, and are continually expanding our compute capacity. You’ll have the opportunity to work on the most powerful AI cluster in the world comprising 22K nVidia H-100 chips

Inflection announces build of largest ML cluster in the world

As a Higher Performance Compute practitioner, you will be responsible for the smooth operation of these clusters on a day to day basis. You will be expected to add monitoring and telemetry to the clusters to preempt any issues that may arise, and on some occasions when issues do happen, you will be expected to put on your firefighting skills and resolve them. You will be monitoring jobs running on thousands of GPUs, or looking at workloads and their utilization. You will be partnering with other members of technical staff at Inflection to understand their needs and how to best achieve them. 

Experience as a HPC practitioner and with schedulers such as SLURM and Kubernetes will be key for your success in this role. Knowledge of GPUs and their architectures as well as common failures is also important. Familiarity with LLMs and the current state and trends in NLP will also help. Finally, comfortably stepping into any problem with the HPC infrastructure and resolving it is essential for success in this role. 

Minimum Requirements:

  • Direct experience managing a multi-100-node+ Slurm cluster
  • Direct experience with debugging massively parallel CPU/GPU jobs

Preferred experience:

  • Managing ML-specific workloads on large GPU clusters on Slurm or Kubernetes

Employee Pay Disclosures

At Inflection AI, we aim to attract and retain the best employees and compensate them in a way that appropriately and fairly values their individual contributions to the company. The pay range for this position in California, is estimated to fall in the base range of approximately $150,000 - $300,000. This estimate can vary based on the factors described above, so the actual starting annual base salary may be above or below this range.



Please mention the word **CONSISTENTLY** and tag RMzUuMjAzLjI0NS4xNzc= when applying to show you read the job post completely (#RMzUuMjAzLjI0NS4xNzc=). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they're human.

The offering company is responsible for the content on this page / the job offer.
Source: Remote Ok