We are seeking an experienced Site Reliability Engineer (SRE) with expertise in Infrastructure as Code tools like Terraform, core CI/CD tools such as Azure DevOps, and monitoring tools including DataDog and AWS CloudWatch. The ideal candidate will have commercial experience in technologies like Dotnet or Java, and be skilled in troubleshooting, incident resolution, and improving service and change management processes. Strong leadership in client-facing discussions and engagement with third-party suppliers is essential. An SRE Foundation certificate and a cloud provider associate-level certification are highly beneficial.
Commercial experience and proficiency with industry standard:
IAC tooling (Terraform preferably, or ARM/bicep and CloudFront)
Core CI/CD Tooling (Azure DevOps, GitHub Actions or Gitlab)
Monitoring Tooling (DataDog, Splunk, NewRelic, Azure Monitor, AWS CloudWatch)
Commercial experience in at least one core technology (Dotnet, Java, AI/Data Engineering, Golang)
Troubleshooting issues and identifying systemic failings indicated by incidents/failures
Implementing fixes
Proposing solutions for reducing toil
Providing leadership in the Incident resolution process, including creating and maintaining documentation, and providing key input to Post-mortem analysis
Improving Service Requests and Change Management processes, both technically and through stakeholder management).
Participate in the process for, and Proactively mitigate risks in a Security management process (Vulnerabilities in Code, Infrastructure, Dependencies)
Lead discussion in client-facing meetings and discussions around the SRE process, and identifying areas for increasing SRE footprint.