Key Responsibilities
Incident & Problem Management
- Lead major incident (MI) bridges and restore service with minimum business impact.
- Handle all L3 escalations, perform deep diagnostics across Java, JVM, middleware, OS, and infra.
- Own technical RCAs, drive longβterm and systemic remediation.
- Identify recurring failure patterns and risks.
Reliability Engineering
- Apply SRE principles: SLIs/SLOs, error budgets, resilience patterns.
- Tune JVM parameters, analyze thread/heap dumps, and improve performance.
- Influence application architecture for fault tolerance, scalability, and recoverability.
- Validate DR readiness, failover behavior, and resilience testing outcomes.
Change, Release & Risk