US Jobs US Jobs     UK Jobs UK Jobs     EU Jobs EU Jobs


Sr Director of Software Engineering- AI Infrastructure Platform

Your opportunity to make a real impact and shape the future of financial services is waiting for you.

Let's push the boundaries of what's possible together.

As a Senior Director of Software Engineering at JPMorganChase within the firmwide AI Infrastructure Platform organization, you will lead multiple technical areas and manage the activities of multiple departments responsible for delivering a unified AI infrastructure layer across on-premises environments, public cloud, and emerging accelerated-compute vendors.

You will collaborate across AI/ML engineering, infrastructure, security and controls, and vendor teams to ensure the firm remains at the forefront of AI platform capabilities, operational excellence, and industry best practices.

In this role, you will own training and experimentation on a Kubernetes-standardized platform.

While a dedicated architecture function exists, you will act as an active design partner-guiding architectural trade-offs and ensuring designs translate into reliable, secure, and operable systems at enterprise scale.

Job responsibilities



* Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives, with a primary focus on training and experimentation platforms operating at enterprise scale.


* Own the design, delivery, and evolution of a Kubernetes-first training and experimentation platform, including Kubernetes-native support for batch and distributed training jobs, lifecycle management, retry semantics, and failure recovery patterns.


* Standardize AI developer workflows for experimentation, enabling self-service job submission, reusable templates and golden paths, reproducibility mechanisms, and consistent runtime behavior across hybrid deployment environments.


* Build and evolve platform APIs and automation, including Kubernetes controllers and operators where appropriate, to ensure the platform is safe, scalable, and easy to adopt across teams.


* Drive measurable improvements in GPU availability and utilization through reliability engineering, fleet readiness patterns, and accelerated capacity onboarding.


* Define and implement governance-based scheduling and placement strategies, including:

Multi-tenant GPU quotas and guardrails,

Priority, admission control, and reservation patterns,

Preemption policies,

Fragmentation reduction and topology-aware placement (GPU type, MIG, and topology awareness)


* Embed enterprise-grade security, risk, and control requirements into platform defaults, including IAM and RBAC controls, secrets management, audit logging, policy enforcement, network segmentation, and controlled change management.


* Drive operational excellence by establishing SLIs and SLOs, managing error budgets, leading incident management practices, forecasting capacity, and delivering end-to-end platform observability across job lifecycles and GPU telemetry.


* Act as the...




Share Job