US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs


Lead - ML Ops Engineer

What will you do?


* Design, develop, and document Infrastructure as Code (Terraform) for ML/LLM platform components on AWS/Databricks; implement secure, scalable foundations for data, compute, networking, and secrets.


* Build and maintain GitHub based pipelines (Actions/Workflows) for training, packaging, validation, and deployment of ML/LLM assets (models, evaluation suites, prompts, policies), using GitOps for environment promotion.


* Containerize models using Docker and deploy them primarily through managed endpoints (SageMaker/Azure ML); Kubernetes-based serving (KServe/Triton/Seldon) is a plus.


* Operate model registries and feature stores; enforce versioning, lineage, and artifact governance via MLflow/Databricks and cloud native services.


* Implement logs/metrics/traces, performance profiling, and drift/quality monitors; define SLIs/SLOs and on call runbooks; drive incident response and post-mortems with accountability (business hours support rotation).


* Embed DevSecOps: secrets management, IAM/RBAC, vulnerability scanning, image signing, policy as code, least privilege access, backup/DR/resiliency patterns; align with enterprise security standards.


* Operationalize GenAI: prompt/content safety filters, evaluation harnesses (human in the loop), grounding/attribution logging, token cost & latency tracking, and red teaming pipelines integrated into CI/CD.


* Monitor and optimize compute/storage/bandwidth and inference costs; implement right sizing, autoscaling, and caching strategies.


* Partner with Data Scientists to productize models; co design platform features with stakeholders; deliver documentation, templates, and knowledge transfers that accelerate safe reuse.


* Run operations (RUN): Troubleshoot escalations, improve monitoring, automate administration/IRP tasks, and continuously harden reliability, performance, and security across environments.

What skills and capabilities will make you successful?


*
+ Technical Experience:
+ Understanding of DevOps concepts such as reference implementation enforcement, use of shared DevOps stacks, infrastructure optimization (performance, cost, HA, resiliency), release management (GitOps best practices), and QA automation frameworks.
+ Strong knowledge of AWS ecosystems and Databricks integration.
+ Proficiency in Terraform for developing, testing, and maintaining Infrastructure-as-Code to manage cloud services for ML engineering.
+ Hands-on experience with CI/CD using GitHub, GitHub Actions, and Workflow automation to support continuous integration, delivery, and deployment of ML assets.
+ Strong experience with Docker; Kubernetes is a plus.
+ MLflow (tracking/registry), model registries, feature stores, experiment tracking, and lineage management; Databricks and cloud native equivalents.
+ Build pipelines for training, testing (unit/integration/e2e), evaluation...




Share Job