NVIDIA is looking for a Senior Solutions Architect, Customer Success to join its NVIDIA Infrastructure Specialist Team. Academic and commercial organizations around the world are using NVIDIA products to redefine deep learning and data analytics, and to power next-generation data centers. Join the team building and advising on many of the largest and fastest AI/HPC systems in the world!
We are looking for someone who blends deep technical expertise with a consultative, collaborative approach. This role will engage directly with customers, partners, and multi-functional internal teams to assess infrastructure needs, architect scalable solutions, and guide the implementation of large-scale networking and AI infrastructure projects. The scope spans networking, system design, and automation—serving as a trusted strategic advisor and the technical face of NVIDIA to key accounts.
What You’ll Be Doing:
Serve as a senior technical authority and trusted consultant on NVIDIA technologies, contributing to architecture reviews, guiding infrastructure decisions at scale, and providing strategic recommendations aligned with each customer’s business objectives.
Establish and refine monitoring and optimization methodologies using analytics, telemetry, and automation to proactively detect bottlenecks, improve infrastructure resiliency, and drive continuous operational maturity.
Lead and advise on the analysis, optimization, and performance tuning of complex GPU-accelerated systems and AI workloads, ensuring high availability and efficiency across customer data centers.
Facilitate post-deployment reviews, incident retrospectives, and strategy sessions to shape the customer experience and deliver actionable insights into NVIDIA’s infrastructure roadmap.
Own and lead complex technical projects end-to-end—from initial discovery and solution design through implementation, knowledge transfer, and continuous improvement—ensuring alignment to SLAs and proactive mitigation of technical risks.
Support business growth by identifying AI infrastructure opportunities in cloud and enterprise environments, crafting compelling technical proposals, and driving initiatives that showcase NVIDIA’s leadership in this space.
What We Need to See:
Education & Experience: BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields, with 10+ years of professional experience in large-scale data center service operations with a focus on infrastructure.
NVIDIA GPU Expertise: Demonstrated hands-on experience deploying, configuring, and optimizing NVIDIA GPU-accelerated infrastructure, including driver and firmware management, CUDA toolkit integration, and GPU workload profiling and fix.
Customer Engagement: Track record of building long-term customer relationships and driving adoption through consultative engagement.
Analytical & Problem-Solving Skills: Strong analytical and decision-making capabilities, with a demonstrable ability to identify root causes, drive continuous improvement, and deliver resilient technical solutions.
System & Infrastructure Proficiency: Expertise in end-to-end data center architecture, spanning operating systems, Linux kernel drivers, GPU and NIC hardware, high-speed networking (InfiniBand, Ethernet, RDMA), and storage systems (Lustre, GPFS, NFS).
Leadership & Communication: Good communication, time management, and organizational skills, with the ability to lead complex multi-functional projects, guide technical teams, and present to executive partners.
Travel: Willingness to travel up to 25% for customer engagements.
Ways to Stand Out from the Crowd:
Experience with Kubernetes for container orchestration, resource scheduling, and integration with GPU-accelerated workloads.
Familiarity with observability stacks (Grafana, Prometheus, Loki) for monitoring, alerting, and building fault-tolerant systems.
Experience with multi-tenant GPU cluster management and workload scheduling frameworks.
Experience with NVIDIA Base Command Manager (BCM) for provisioning, managing, and monitoring GPU clusters at scale.
Background with RDMA-based fabrics (InfiniBand or RoCE) in HPC or AI environments as well as knowledge of CI/CD pipelines, Infrastructure-as-Code (Terraform, Ansible), and GitOps workflows for infrastructure automation.