H

Director of Engineering, Data Infrastruture

Hubspotjobs
11 hours ago
Full-time
Remote
Worldwide
Remote Engineering

POS-31619


Director, Reliability Engineering

Role Summary

Our mission at HubSpot is to help millions of organizations grow better. HubSpot’s engineering organization has grown to more than 2,000 engineers shipping across thousands of services and deploying thousands of times per day. As HubSpot has become core infrastructure for over 200,000 customers worldwide, reliability isn’t just a priority — it’s foundational to customer trust and business growth.

Our Reliability Engineering team has matured from an early SRE function into a strategic pillar within Platform Infrastructure. The team has driven a 76% reduction in critical incidents while the platform scaled 19x in deployables, established company-wide SLO frameworks, and built the incident management practices that keep HubSpot running.

Now we’re entering the next phase: leveraging AI and agentic approaches to fundamentally transform how we detect, respond to, and prevent outages. As Director of Reliability Engineering, you’ll lead this evolution — deepening our reliability capabilities, pioneering AI-assisted operations, and ensuring HubSpot remains a platform customers can confidently bet their business on.

What You’ll Do

Lead and Develop the Team

  • Lead a team of ~20 reliability engineers, fostering a culture of operational excellence, continuous learning, and customer obsession
  • Attract, develop, and retain top talent; build career paths that keep engineers engaged and growing

Own Reliability Strategy

  • Define and drive HubSpot's reliability roadmap, balancing proactive resilience investments with reactive incident reduction
  • Partner with Infrastructure leadership to prioritize reliability initiatives alongside cost, performance, and platform evolution
  • Set and evolve SLO standards that align engineering effort with customer experience

Pioneer AI-Driven Operations

  • Lead the strategy for integrating AI and agentic approaches into incident detection, diagnosis, and mitigation-reducing time-to-resolution and human toil
  • Explore and implement AI-assisted tooling for pattern recognition across incidents, automated runbook execution, and predictive reliability insights
  • Build intelligent systems that learn from our operational history, proactively surface risks, and recommend-or execute-mitigation actions
  • Balance automation with human