Site Reliability Engineer Job at Berkley Hunt, San Jose, CA

SElqMXJYNGhTeU5rcTZOUS9qY0pVa28yMGc9PQ==
  • Berkley Hunt
  • San Jose, CA

Job Description

Senior Site Reliability Engineer (GPU Compute) | Hybrid – Bay Area, CA

Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge machine learning workloads. As they scale, they’re hiring a Senior/Staff Infrastructure Engineer to lead the development of a scalable GPU compute environment from the ground up.

About the Role:

This is a high-impact role for an experienced infrastructure engineer who thrives in fast-paced environments and wants to shape the future of AI infrastructure. You’ll design, build, and operate the systems that enable high-throughput GPU workloads at scale—collaborating closely with the core engineering team to optimize performance, efficiency, and reliability.

If you're excited about solving deep technical challenges in distributed compute and cloud automation, this could be a standout opportunity.

Responsibilities:

  • Build and maintain a large-scale, distributed GPU compute platform powering AI workloads.
  • Develop backend systems in Python to orchestrate GPU jobs, manage routing, observability, and capacity.
  • Design and implement infrastructure with tools like Terraform, Ansible, and Kubernetes across cloud and bare metal environments.
  • Own the reliability, scalability, and performance of the platform, from provisioning to deployment and monitoring.
  • Collaborate with the engineering team to shape infrastructure vision and technical strategy over the next 1–5 years.
  • Drive automation and improvements to minimize operational overhead and scale efficiently.

Requirements:

  • 6+ years of experience in cloud infrastructure or backend engineering roles.
  • Deep knowledge of distributed compute systems, especially involving GPU orchestration.
  • Proficiency with Python and infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Solid experience with Kubernetes and CI/CD pipelines.
  • Strong understanding of cloud platforms (AWS, GCP, or Azure); bare metal experience is a plus.
  • Excellent problem-solving skills and a proactive, ownership-driven mindset.

Nice to Have:

  • Experience at a high-growth startup or in scaling large infrastructure systems.
  • Familiarity with GPU resource scheduling and performance optimization.
  • Hands-on experience with observability stacks (Prometheus, Grafana, Loki, Thanos).
  • A passion for automation, infrastructure design, and moving fast without breaking things.

Job Tags

Similar Jobs

West Chester University

25-110 Office Assistant (CA2) - Department Anthropology and Sociology Job at West Chester University

 ...community whose excellence is reflected in its diversity and student success. West Chester University of Pennsylvania's Department of Anthropology and Sociology invites applications for the position of Office Assistant. The work schedule for this position is Monday to Friday... 

Dexian

Marketing Consultant Job at Dexian

 ...Marketing Consultant 2 - Contingent Duration: 6+ months (possible extension) Location: Des Moines, IA Job Description: #Marketing In this contingent resource assignment, you may: Participate in low to moderately complex initiatives and identify... 

Great Clips

Hair Stylist Job at Great Clips

 ...Job Description Join a locally owned Great Clips salon, the worlds largest salon brand, and be one of the GREATS! Whether youre...  ...behind the chairgreat opportunities await!! Our Collierville hair salon is so busy in 2025! We're adding 2 more full-time stylists... 

Leidos Holding

Systems Engineering Integrator (Verification) - NASA HHPC Job at Leidos Holding

 ...Engineering or related field and 8+ years of applicable experience (to include supporting and/or performing V&V tasks) Experience with NASA and/or commercial partner V&V processes Experience in system engineering integration Ability collaborate effectively among... 

Walmart Inc.

Principal, Data Scientist Job at Walmart Inc.

 ...... Job Summary: As a Principal Data Scientist, you will lead the development and deployment...  ...and company-paid life insurance. Paid time off benefits include PTO (including sick...  ...benefit program for full-time and part-time associates in Walmart and Sam's Club...