Share
What you will be doing:
Work directly with key customers to understand their technology and provide the best AI solutions/ guidance on training process in terms of tools and methodology.
Perform in-depth analysis and optimization to ensure the best performance on GPU architecture systems (in particular Grace/ARM based systems). This includes support in optimization of distributed training pipelines.
Partner with Engineering, Product and Sales teams to develop, plan best suitable solutions for customers. Enable development and growth of product features through customer feedback and proof-of-concept evaluations.
What we need to see:
Excellent verbal, written communication, and technical presentation skills in English.
MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields.
5+ years work or research experience with Python/ C++ / other software development
Work experience and knowledge of modern NLP including good understanding of transformer, state space, diffusion, MOE model architectures. This can include either expertise in training oroptimization/compression/operationof DNNs.
Understanding of key libraries used for NLP/LLM training (such as Megatron-LN, NeMo, DeepSpeed etc.) and/or deployment (e.g. TensorRT-LLM, vLLM, Triton Inference Server).
Track record in neural network performance optimization and/or training robustness.
Person excited to work with multiple levels and teams across organizations (Engineering, Product, Sales and Marketing team). Capable of working in a constantly evolving environment without losing focus.
Self-starter with demeanor for growth, passion for continuous learning and sharing findings across the team.
Ways to Stand Out from The Crowd:
Ability to conduct LLM post training in particular knowledge of large scale RL.
Track record in running large scale training/HPC jobs with a focus on training robustness / failure resilience.
Understanding of HPC systems: data center design, high speed interconnect InfiniBand, Cluster Storage and Scheduling related design and/or management experience.
These jobs might be a good fit