At Sully.ai, we’re building cutting-edge AI-native infrastructure to power real-time, intelligent healthcare applications. Our team operates at the intersection of high-performance computing, ML systems, and cloud infrastructure — optimizing inference pipelines to support next-generation multimodal AI agents. We're looking for a deeply technical engineer who thrives at the systems level and loves building performant, scalable infrastructure.
We’re looking for a senior-level engineer to lead efforts in deploying and optimizing large language models on high-end GPU hardware and building the infrastructure that supports them. You'll work across the stack — from C++ and CUDA kernels to Python APIs — while also shaping our DevOps practices for scalable, multi-cloud deployments. This role blends systems performance, ML inference, and infrastructure-as-code to deliver low-latency, production-grade AI services.
LLM Inference Optimization: Develop and optimize inference pipelines using quantization, attention caching, speculative decoding, and memory-efficient serving.
Systems Programming: Build and maintain low-level modules in C++/CUDA/NCCL to squeeze the most out of GPUs and high-throughput architectures.
DevOps & Infrastructure Engineering: Stand up and manage multi-cloud environments using modern IaC frameworks such as Pulumi or Terraform. Automate infrastructure provisioning, deployment pipelines, and GPU fleet orchestration.
Real-Time Architectures: Design low-latency streaming and decision-support systems leveraging embedding models, VRAM token caches, and fast interconnects.
Developer Enablement: Build robust tooling, interfaces, and sandbox environments so that other engineers can contribute safely to the ML systems layer.
Proficiency in C++, CUDA, and Python with experience in systems or ML infrastructure engineering.
Deep understanding of GPU architectures, inference optimization, and large model serving techniques.
Hands-on experience with multi-cloud environments (GCP, AWS, etc.) and infrastructure-as-code tools such as Pulumi, Terraform, or similar.
Familiarity with ML deployment frameworks (TensorRT, vLLM, DeepSpeed, Hugging Face Transformers, etc.).
Comfortable with DevOps workflows, containerization (Docker), CI/CD, and distributed system debugging.
(Bonus) Experience with streaming embeddings, semantic search, or hybrid retrieval architectures.
(Bonus) Interest in building tools that democratize high-performance systems for broader engineering teams.
Collaborate with a highly technical team solving hard problems at the edge of AI and healthcare.
Work with bleeding-edge GPU infrastructure and build systems that push what's possible.
Be a foundational part of shaping AI-native infrastructure for real-time, mission-critical applications.
Help accelerate a meaningful product that improves how clinicians work and patients are cared for.
Sully.ai is an equal opportunity employer. In addition to EEO being the law, it is a policy that is fully consistent with our principles. All qualified applicants will receive consideration for employment without regard to status as a protected veteran or a qualified individual with a disability, or other protected status such as race, religion, color, national origin, sex, sexual orientation, gender identity, genetic information, pregnancy or age. Sully.ai prohibits any form of workplace harassment.