Job: L2 SRE / HPC Engineer

DevOps, Kubernetes, Python, Nvidia GPUs

Remote

Full Time

Apply for this job - or - Join our talent network

About the Company

Our client is an AI cloud. They work with many of the top AI companies on the planet, including Poolside, Meta, Modal, Reka, and many more.

Our HPC Engineers make sure their GPU infrastructure is working at peak performance and offer top tier support to our customers.

At its core, you will have three main responsibilities:

Deployment. They will be onboarding new clusters at least monthly – you will help take bare-metal servers and deploy them for their customers as high performance compute as a service.
Automation. Their GPU fleet is large and growing. You will help us to automate many of their processes and systems to allow them to support Fluidstack continuing to scale.
Support. This will be a client facing role – you will work closely with our client customers to make sure that they are able to utilize our infrastructure to achieve their goals. You will work on everything from GPU debugging, Slurm management, to training performance optimization.

Experience with HPC systems, System Administration, SRE, or DevOps
Experience with large scale workloads utilizing orchestrators like Slurm or Kubernetes.
Experience with automation of bare-metal machines and containers, using tools such as Ansible, Bash, or Python.
Experience with shared storage on platforms such as NFS, DDN ,Vast, CephFS, etc.
Experience provisioning large scale clusters and networks with e.g. BCM, UFM
Experience with large-scale GPU systems, working with Nvidia GPUs and Infiniband networks