US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs


Senior Lead Site Reliability Engineer

Replace the first sentence with \"As a Senior Lead Site Reliability Engineer at JPMorgan Chase within Consumer and Community banking team, you will set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability.

Job responsibilities


* Set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability.


* Turn business goals into clear, testable requirements-and hold teams to an objective "Definition of Done" before release.


* Define and manage SLIs/SLOs and error budgets, and ensure they're reflected in roadmaps and delivery plans.


* Lead operational readiness reviews, assess delivery risk, and drive fixes through root-cause analysis, corrective actions, and automation to prevent repeat issues.


* Improve logging, monitoring, and alerting so dashboards are actionable and alerts are tuned to reduce noise and speed response.


* Own CI/CD controls (security, reliability, testing, change management) and drive automation to reduce toil and increase release confidence.


* Lead and participate in major incident response (including outside business hours when needed), run post-incident reviews, and drive improvements against KPIs like availability, MTTR, and change failure rate.

Required qualifications, capabilities, and skills


* 10+ years supporting critical applications in large-scale environments, including experience leading and mentoring engineers/teams.


* Strong SDLC and secure development practices, with experience implementing objective quality gates and release readiness standards.


* Hands-on SRE experience, including SLIs/SLOs, error budgets, incident management, and post-incident reviews/root-cause analysis.


* Experience designing actionable monitoring/logging and dashboards (e.g., Splunk, AppDynamics, or equivalent), including alert tuning.


* Experience with CI/CD pipelines and automated testing (unit, integration, security), plus operational controls that reduce change risk.


* Calm, accountable incident leadership under pressure, with strong communication and stakeholder management.


* Comfortable collaborating with global teams and engaging during critical incidents outside standard business hours.

Preferred qualifications, capabilities, and skills


* Proficiency in Python; experience with LangChain, LangGraph, or similar agentic frameworks


* Experience implementing LLMs using vector databases and Retrieval-Augmented Generation (RAG), as well as model tuning


* Strong SRE fundamentals: SLOs, SLIs, error budgets, blameless post-mortems, capacity planning


* Hands-on with observability tooling (Datadog, Prometheus, OpenTelemetry, distributed tracing)


* Experience leading operational readiness reviews and maintaining "Defi...




Share Job