A

Senior Site Reliability Engineer (SRE & Platform Reliability)

Affirm
10 days ago
Full-time
Remote
Worldwide
Remote Engineering

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest.

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps our Engineering partners to “Operate What They Own” with excellence to protect their customers’ experience. SRE accomplishes this through defining frameworks and best practices for operating applications, building tooling, and providing training and consulting. Some of the many SRE responsibilities are:

  • Providing data and visibility to teams and leadership on application performance
  • Guiding the development of SLOs
  • Driving the Incident Management and Analysis process
  • Steering the implementation of Change Management and Deployment practices
  • Engaging in service and architectural conversations
  • Recommending observability and alerting configurations

The SRE team benefits from experience across many domains including:

  • infrastructure, platform, and distributed systems
  • capacity management, load and chaos testing
  • automation, observability, and configuration management
  • development and product experience

The SRE team is seeking motivated software and systems engineers with the experience to build, iterate on, and expand incident lifecycle, reliability, and resilience practices throughout Affirms Engineering organization and beyond.


What You'll Do:

  • You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery.
  • You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs.
  • You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis.
  • You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts.
  • You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating f