Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia Deep Learning Solutions Architect – Distributed Training 
United Kingdom, England, Southampton 
33740973

28.07.2025
UK, Remote
Poland, Remote
Spain, Remote
Switzerland, Zurich
Germany, Remote
time type
Full time
posted on
Posted 19 Days Ago
job requisition id

What you will be doing:

  • Work directly with key customers to understand their technology and provide the best AI solutions/ guidance on training process in terms of tools and methodology.

  • Perform in-depth analysis and optimization to ensure the best performance on GPU architecture systems (in particular Grace/ARM based systems). This includes support in optimization of distributed training pipelines.

  • Partner with Engineering, Product and Sales teams to develop, plan best suitable solutions for customers. Enable development and growth of product features through customer feedback and proof-of-concept evaluations.

What we need to see:

  • Excellent verbal, written communication, and technical presentation skills in English.

  • MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields.

  • 5+ years work or research experience with Python/ C++ / other software development

  • Work experience and knowledge of modern NLP including good understanding of transformer, state space, diffusion, MOE model architectures. This can include either expertise in training oroptimization/compression/operationof DNNs.

  • Understanding of key libraries used for NLP/LLM training (such as Megatron-LN, NeMo, DeepSpeed etc.) and/or deployment (e.g. TensorRT-LLM, vLLM, Triton Inference Server).

  • Track record in neural network performance optimization and/or training robustness.

  • Person excited to work with multiple levels and teams across organizations (Engineering, Product, Sales and Marketing team). Capable of working in a constantly evolving environment without losing focus.

  • Self-starter with demeanor for growth, passion for continuous learning and sharing findings across the team.

Ways to Stand Out from The Crowd:

  • Ability to conduct LLM post training in particular knowledge of large scale RL.

  • Track record in running large scale training/HPC jobs with a focus on training robustness / failure resilience.

  • Understanding of HPC systems: data center design, high speed interconnect InfiniBand, Cluster Storage and Scheduling related design and/or management experience.