Job: L2 SRE / HPC Engineer
DevOps, Kubernetes, Python, Nvidia GPUs
Remote
Full Time
About the Company
Our client is an AI cloud. They work with many of the top AI companies on the planet, including Poolside, Meta, Modal, Reka, and many more.
About the role:
Our HPC Engineers make sure their GPU infrastructure is working at peak performance and offer top tier support to our customers.
At its core, you will have three main responsibilities:
- Deployment. They will be onboarding new clusters at least monthly – you will help take bare-metal servers and deploy them for their customers as high performance compute as a service.
- Automation. Their GPU fleet is large and growing. You will help us to automate many of their processes and systems to allow them to support Fluidstack continuing to scale.
- Support. This will be a client facing role – you will work closely with our client customers to make sure that they are able to utilize our infrastructure to achieve their goals. You will work on everything from GPU debugging, Slurm management, to training performance optimization.
Skills & Experience
- Experience with HPC systems, System Administration, SRE, or DevOps
- Experience with large scale workloads utilizing orchestrators like Slurm or Kubernetes.
- Experience with automation of bare-metal machines and containers, using tools such as Ansible, Bash, or Python.
- Experience with shared storage on platforms such as NFS, DDN ,Vast, CephFS, etc.
- Experience provisioning large scale clusters and networks with e.g. BCM, UFM
- Experience with large-scale GPU systems, working with Nvidia GPUs and Infiniband networks