Architect and design of high-performance storage system for GPU cluster, supporting large checkpoints and low-latency context preemption and reloads.
Develop monitoring and observability tools for GPU clusters
Maintain high availability, fault tolerance, and disaster recovery strategies for AI infrastructure
Work closely with AI/ML engineers, data scientists, and DevOps teams to streamline AI workflows.
What you will bring:
Masters or PhD in EE or CS
Over 5 years of experience building HPC systems
C/C++ Programming – for performance-critical components and integration tasks. Lustre (Paralell filesystems is in C)
Linux Kernel and OS internals – to optimize system behavior and support kernel-level customization for filesystems and networking
Filesystems knowledge – with a strong preference for experience in Lustre or similar distributed filesystems
Kubernetes – for container orchestration and management at scale
Hardware and Networking familiarity – to work effectively with low-level infrastructure and tuning
Good to have:
Strong understanding of RDMA, RoCE V2 protocols
Hands-on experience with GPUs
Understanding of AI Workflows, training, inferencing
Understanding of AI/ML Python frameworks (TensorFlow, PyTorch)
משרות נוספות שיכולות לעניין אותך