US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs


Senior Site Reliability Engineer

Essential Functions:


* Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.


* Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.


* Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.


* Drive continuous improvement in platform stability, maintenance, and availability.


* Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.

Experience and Skills Required:


* 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services.


* Strong experience with Linux systems administration and troubleshooting in enterprise environments.


* Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins.


* Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling.


* Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence.


* Experience with GitOps tools such as FluxCD or ArgoCD.


* Proficiency scripting with at least one of Python, Go, or Bash.


* Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs.


* Strong understanding of reliability engineering concepts:
+ Service health indicators
+ High availability design, failure reduction, and testing
+ Operational readiness practices, including developing documentation, runbooks, and architectural descriptions
+ Incident response, root cause analysis, remediation/recovery


* Ability to obtain a security clearance, which includes U.S.

citizenship.

Preferred:


* Experience with multiple Linux distributions including Ubuntu.


* Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.


* Experience with cloud platforms such as AWS and Azure.


* Experience with infrastructure automation and configuration management.


* Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.


* Experience with security and compliance considerations in regulated environments.


* DoD experience.


* Active or inactive Secret Security Clearance.

Education:


* Bachelor’s degree in CS, Software Engineering or other IT-related field or equivalent experience

 

REMOTE WORK NOTICE:  This position may be performed fully remote, hybrid, or onsite at an ARA office.

Preference will be given to c...




Share Job