Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia Principal Site Reliability Engineer AI Infrastructure 
United States, Texas 
554030201

Today
US, CA, Santa Clara
US, TX, Austin
US, WA, Remote
US, CA, Remote
US, NV, Remote
time type
Full time
posted on
Posted 12 Days Ago
job requisition id

What You Will Be Doing:

  • Architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments.

  • Design and lead implementation of automation frameworks that reduce manual tasks, promote resilience, and uphold standard methodologies for system health, change safety, and release velocity.

  • Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing approaches for sophisticated distributed systems.

  • Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies in collaboration with engineering, infrastructure, and product teams.

  • Pioneer initiatives that influence NVIDIA’s AI platform roadmap, participating in co-development efforts with internal partners and external vendors, and staying ahead of academic and industry advances.

  • Publish technical insights (papers, patents, whitepapers) and drive innovation in production engineering and system design.

  • Lead and mentor global teams in a technical capacity, participating in recruitment, design reviews, and developing standard methodologies in incident response, observability, and system architecture.

What We Need to See:

  • 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs.

  • Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).

  • Expert-level programming in Python and one or more languages such as C++, Go or Rust.

  • Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production.

  • Hands-on expertise in observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code (Terraform, CDK, Pulumi).

  • Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance.

  • Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication.

  • Proven track record to complete long-term, forward-looking platform strategies.

  • Degree in Computer Science or related field, or equivalent experience

Ways to Stand Out from the Crowd:

  • Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.

  • Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) and orchestration frameworks (e.g., Ray, Kubeflow).

  • Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.

  • Experience leading operational readiness efforts and reliability engineering in GPU-heavy environments.

  • Track record of driving cultural improvements in incident management, root cause analysis, and postmortem processes across large teams.

You will also be eligible for equity and .