US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs

   

Site Reliability Engineering Leader

-

The Site Reliability Engineering (SRE) Lead is responsible for leading the SRE team to ensure the reliability, scalability, and performance of the organization’s critical systems and services.

This role involves managing the team’s day-to-day operations, developing automation strategies, implementing best practices, and collaborating with development and operations teams to optimize the entire software lifecycle.

The ideal candidate is a highly skilled engineer with strong leadership capabilities, capable of driving improvements in system reliability, monitoring, and incident response.

Key Accountabilities/Deliverables:


* Lead, mentor, and manage a team of SREs, fostering a culture of reliability, collaboration, and continuous improvement.


* Oversee the availability, performance, and scalability of services, ensuring that systems are reliable, efficient, and meet established SLAs.


* Develop and implement automation strategies to reduce manual intervention, improve efficiency, and minimize downtime.


* Lead incident response efforts, ensuring timely resolution of production issues and minimizing impact on customers.

Conduct post-incident reviews to identify root causes and implement preventive measures.


* Design, implement, and maintain robust monitoring and alerting systems to ensure real-time visibility into the health of production environments.


* Perform capacity analysis and forecasting to ensure systems can handle growth and peak demand without degradation.


* Work closely with development, DevOps, and infrastructure teams to integrate reliability engineering practices into the software development lifecycle.


* Identify performance bottlenecks and work on tuning systems for optimal performance, including database, application, and infrastructure optimizations.


* Ensure that systems and processes adhere to security and compliance standards, integrating security best practices into SRE activities.


* Provide regular updates and reports to leadership on system performance, incidents, and improvement initiatives.

Technical Knowledge and Understanding:


* In-depth knowledge of CI/CD pipelines, release management, and software lifecycle processes.


* Exceptional leadership and team management skills, with the ability to motivate and develop high-performing teams.


* Strong problem-solving and analytical skills, with a focus on data-driven decision-making.


* Excellent communication skills, with the ability to articulate technical issues clearly and effectively to both technical and non-technical stakeholders.


* Ability to manage multiple priorities in a fast-paced, high-stakes environment.

Experience:
 


* Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field.


* Master’s degree preferred.


* 7+ years of experience in site reliability engineering, DevOps, or software engineering, with at least 3 years in a lead...




Share Job