US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs

   
This job has been posted for more than 30 working days and has expired.

Site Reliability Engineer

Join the Mizuho team as a Lead Site Reliability Engineer (SRE)!

 

The successful candidate will bring the following:

•        10+ years of Software Engineering, and Architecture experience with at least 5+ years on SRE focused experience in Production Support, Application Support and DevOps implementation.

•        Demonstrated experience enabling SRE principles and practices with technical and operations teams in different SRE maturity levels in Engineering and Operations space.

•        Demonstrated experience influencing design committee and process teams to establish standards by improving the approaches and maturity across IT teams.

•        Work closely with Infrastructure services and product teams to develop reliable solutions to improve availability, scalability, and performance targets.

•        Experience in SDLC life cycle from architecture and software designs, SLA/SLO definitions, tech debts reviews, CI/CD releases, monitoring KPIs to DevOps principles.

•        Experience in production systems analyzing performance and error metrics, lead triage and troubleshooting exercises and track incident management targets (MTTx)

•        Strong experience in infrastructure and Applications technology components and designs, assess problem areas (logs/events), support in analysis (metrics/traces) and recommend solutions.

•        Hands-on experience coding and developing automation solutions leveraging APIs based integrations, configuration using Ansible and Terraform for IAAS solutions.

•        Experience working in microservices and containerized platforms to support platforms through monitoring, alerting, and troubleshooting needs part of service operations.

•        Technical knowledge and experience in cloud architectures, hybrid cloud and cloud native solutions to leverage reliable designs in cloud to improve operational efficiencies.

•        Experience working in Incident management, leveraging postmortem analysis and developing reliable solutions part of driving multiple incident management initiatives.

•        Experience in Observability tools and frameworks, concepts of golden signals, MELT data integration and Analysis using market solutions to improve operational efficiencies.

•        Experience managing and growing teams to achieve short-term and long-term goals part of the SRE RoadMap and align with SRE strategic goals.

•        Experience handling partnership with multiple peers, stakeholders and able to interact with leadership team and technical teams at different levels.

•        Ability to adapt, support multiple application and infrastructure groups towards SRE needs in a fast-paced, dynamic, and growing organization.

 

Must have:

•        10+ years of overall IT experience focusing on Software Engineering, Architecture and/or supporting...