Share
What you will be doing:
Build and maintain CI/CD pipelines that support fast, reliable integration and deployment across complex systems.
Design tools and automation workflows that simplify software releases, manage dependencies, and increase reliability.
Accelerate development by modularizing systems and enabling independent release cycles.
Build infrastructure automation for provisioning, scaling, and maintaining GPU clusters.
Automate software updates and monitor system health to improve reliability and availability.
Troubleshoot and resolve operational issues across distributed infrastructure.
Manage firmware and software rollouts to minimize downtime and ensure consistency.
Work with global engineering teams to align infrastructure tools and support project achievements.
What we need to see:
BS or MS in Computer Science, Computer Engineering, or a related field
5+ years of experience managing infrastructure or systems in high-performance or distributed environments.
Expertise in scripting and automation using Python, Ansible, and Shell.
Practical experience with modern CI/CD tools andinfrastructure-as-codeframeworks.
Strong understanding of Linux, networking, and distributed system design.
Proven ability to break down monolithic systems into scalable, loosely coupled components.
Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.
Ways to stand out from the crowd:
Experience with cluster management tools like Slurm.
Familiarity with NVIDIA DGX/HGX systems and GPU-based clusters.
Knowledge of observability tools such as Prometheus and Grafana.
Proven ability to lead DevOps process improvements and drive team efficiency.
These jobs might be a good fit