Share
What you'll be doing:
Own the deployment of scalable datacenter networking for enterprise AI/ML systems.
Deploy and validate cluster designs, optimizing them for enterprise facilities.
Collaborate closely with other experts in network, compute, software, and storage to drive innovation.
Lead multi-disciplinary projects, addressing high-level goals and complex challenges.
Engineer on-premises cloud-native solutions that flawlessly integrate with diverse cloud providers.
Assume a pivotal role for the compute and hardware architecture domain, driving expertise and excellence.
Showcase a multidisciplinary understanding of Ethernet, InfiniBand, data center LAN (local area networking), WAN (wide area networking), and SD (software-defined) networks.
Conduct TCO analysis, optimizing datacenter efficiency for cost-effectiveness.
Finding opportunities for operational improvements and collaborating with teams to build solutions that improve excellence and sustainability in network operations.
What we need to see:
Bachelor's degree or equivalent experience, with 10+ years in hardware or infrastructure architecture.
Proven expertise in designing and deploying on-prem cloud-native platforms, with deep understanding of scaling and resilience at chassis, rack, cluster, and data center levels.
In-depth knowledge of networking protocols and technologies including Ethernet, TCP/IP, VLAN, VXLAN, BGP, EVPN, MPLS, QoS, and Infiniband. Skilled in evaluating, designing, and optimizing complex network architectures for performance, security, and resilience.
Extensive experience with optical networking and cabling, fiber types, and transceiver modules (SFP/SFP+, QSFP, OSFP), including their signal modulation, FEC, and compatibility with multiple switch platforms and software configurations.
Strong grasp of cloud-native systems with emphasis on high availability, scalability, and security in compute environments. Demonstrated system-level thinking to enhance reference designs.
Hands-on experience with infrastructure as code and monitoring tools: Base Command Manager (BCM), Ansible, Terraform, Grafana, Prometheus.
Proficient with Linux (including Cumulus OS), and scripting languages such as Python and Bash.
Familiarity with NVIDIA networking products including Mellanox switches, Cumulus Linux, BlueField DPUs, and Infiniband technologies.
Demonstrated leadership in cluster design, especially in networking, security, and remote access management. Experienced in working independently and with distributed teams across time zones. Collaborates closely with SMEs to ensure swift production issue resolution and maintain customer satisfaction.
Strong written and verbal skills for effectively communicating complex technical concepts to diverse audiences. Capable of creating clear documentation including Methods of Procedure (MoPs) and deployment guides.
Ways to stand out from the crowd:
Certified in key vendor programs including Cisco (CCIE), Arista (ACE), Juniper (JNCIE), and NVIDIA (NCP-AIN), with deep expertise in RDMA technologies such as RoCE.
Broad experience across Networking, Compute, Storage, and Platform Sizing, with a focus on Infrastructure Cost Optimization and TCO analysis for datacenter environments.
Strong understanding of network topologies, load balancing, and congestion control algorithms; experienced in both practical and standards-based approaches, including engagement with open-source communities.
Proficient in Python with a personal GitHub showcasing relevant projects. Skilled in Kubernetes, Docker, and performance monitoring tools such as Grafana, Prometheus, and Datadog.
Hands-on experience with networking simulators including NVIDIA Air, GNS3, and EVE-NG, for digital twin and virtual network testing.
You will also be eligible for equity and .
These jobs might be a good fit