US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs

   

Sr./Lead Site Reliability Engineer

Company

Federal Reserve Bank of San Francisco

We are the Federal Reserve Bank of San Francisco—public servants with a mission to advance the nation’s monetary, financial, and payment systems to build a stronger economy for all Americans.

We are a community-engaged bank, and are committed to understanding and serving the vibrant, expansive communities of the Twelfth District.

That means we seek and appreciate new perspectives.

We respect people for what they do and for who they are.

We build opportunities to learn and grow.

When you join the SF Fed, you become part of a diverse team united in its purpose to promote an economy that works for everyone.

As a Sr.

/Lead Site Reliability Engineer, you will work with Cash Application Delivery Services (ADS) development, QA , DevOps and National IT teams for managing the systems that support the Cash ADS applications suite both on-prem and in the Cloud.

Your main focus will be to ensure that all of our applications are operating optimally, and every aspect of the application is being monitored so as to facilitate quick troubleshooting and resolution of issues as they arise.

We empower our people to balance their life and work responsibilities.

That’s why we offer a flexible hybrid work model that allows you to collaborate with office colleagues on some days, and work from home on others.

Responsibilities:



* Establish and run playbooks to support the resolution of incidents that occur in production environments.


* Help design Dashboards for effective monitoring of infrastructure resources in the cloud environments


* Work with development teams to establish Service-Level Objectives and key Service-Level Indicators


* Conduct Production Readiness Reviews to ensure services meets accepted standards of operational readiness before going live


* Ensure infrastructure aligns with Security standards, assist in audits, and implement recommended practices to protect data and systems.


* Facilitate the design and implementation of the Disaster Recovery plans, including back-ups, failover and recovery mechanism with the development and DBA subject matter experts


* As one of the SREs, drive improvement opportunities in infrastructure, tooling, and workflows using a continuous feedback loop between development and CloudOps


* Ensure uptime and reliability of Cloud based infrastructure and systems, monitoring system performance, and maintaining high availability of cloud-based assets.


* Participate in incident Response and Troubleshooting by conducting root cause analysis and implementing solutions to prevent recurrence.


* Establish thresholds for cloud based services and capabilities, set up and maintain monitoring systems to detect issues, before they impact users,


* Configure alerts for system analogies, develop monitoring dashboards, monitor resource usage, latency, and error rates


* Analyze system performance, establish metrics and thresholds, o...




Share Job