C

Senior ML Systems Engineer, Frameworks & Tooling

Cohere
6 months ago
Full-time
Remote
Worldwide
Remote Engineering
Who are we?

Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what’s best for our customers.

Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.

Join us on our mission and shape the future!

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.


WHAT YOU’LL WORK ON

- Build and own the training framework responsible for large-scale LLM training.

- Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).

- Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).

- Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.

- Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.

- Investigate and resolve performance bottlenecks across the ML systems stack.

- Build robust systems that ensure reproducible, debuggable, large-scale runs.



YOU MIGHT BE A GOOD FIT IF YOU HAVE

- Strong engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.

- Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).

- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.

- Experience working with containerized environments (Docker, Singularity/Apptainer).

- A track record of building tools that increase developer velocity for ML teams.

- Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.

- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.



NICE TO H