Chan Zuckerberg Biohub - San Francisco

San Francisco, California, United States

Posted on: 16 November 2023

Back Apply to job

Experience

n/a

Work

n/a

Employee Type

n/a

Salary Range

n/a

AI ML HPC Principal Engineer

The Opportunity

The Chan Zuckerberg Biohub Network has an immediate opening for an AI/ML High Performance Computing (HPC) Principal Engineer. The CZ Biohub Network is composed of several new institutes that the Chan Zuckerberg Initiative created to do great science that cannot be done in conventional environments. The CZ Biohub Network brings together researchers from across disciplines to pursue audacious, important scientific challenges. The Network consists of four institutes throughout the country; San Francisco, Silicon Valley, Chicago and New York City. Each institute closely collaborates with the major universities in its local area. Along with the world-class engineering team at the Chan Zuckerberg Initiative, the CZ Biohub supports several 100 of the brightest, boldest engineers, data scientists, and biomedical researchers in the country, with the mission of understanding the mysteries of the cell and how cells interact within systems.

The Biohub is expanding its global scientific leadership, particularly in the area of AI/ML, with the acquisition of the largest GPU cluster dedicated to AI for biology. The AI/ML HPC Principal Engineer will be tasked with helping to realize the full potential of this capability in addition to providing advanced computing capabilities and consulting support to science and technical programs. This position will work closely with many different science teams simultaneously to translate experimental descriptions into software and hardware requirements and across all phases of the scientific lifecycle, including data ingest, analysis, management and storage, computation, authentication, tool development and many other computing needs expressed by scientific projects.

This position reports to the Director for Scientific Computing and will be hired at a level commensurate with the skills, knowledge, and abilities of the successful candidate.

What You'll Do

Work with a wide community of scientific disciplinary experts to identify emerging and essential information technology needs and translate those needs into information technology requirements
Build an on-prem HPC infrastructure supplemented with cloud computing to support the expanding IT needs of the Biohub
Support the efficiency and effectiveness of capabilities for data ingest, data analysis, data management, data storage, computation, identity management, and many other IT needs expressed by scientific projects
Plan, organize, track and execute projects
Foster cross-domain community and knowledge-sharing between science teams with similar IT challenges
Research, evaluate and implement new technologies on a wide range of scientific compute, storage, networking, and data analytics capabilities
Promote and assist researchers with the use of Cloud Compute Services (AWS, GCP primarily) containerization tools, etc. to scientific clients and research groups
Work on problems of diverse scope where analysis of data requires evaluation of identifiable factors
Assist in cost & schedule estimation for the IT needs of scientists, as part of supporting architecture development and scientific program execution
Support Machine Learning capability growth at the CZ Biohub
Provide scientist support in deployment and maintenance of developed tools
Plan and execute all above responsibilities independently with minimal intervention

What You'll Bring

Essential –

Bachelor’s Degree in Biology or Life Sciences is preferred. Degrees in Computer Science, Mathematics, Systems Engineering or a related field or equivalent training/experience also acceptable.
A minimum of 8 years of experience designing and building web-based working projects using modern languages, tools, and frameworks
Experience building on-prem HPC infrastructure and capacity planning
Experience and expertise working on complex issues where analysis of situations or data requires an in-depth evaluation of variable factors
Experience supporting scientific facilities, and prior knowledge of scientific user needs, program management, data management planning or lab-bench IT needs
Experience with HPC and cloud computing environments
Ability to interact with a variety of technical and scientific personnel with varied academic backgrounds
Strong written and verbal communication skills to present and disseminate scientific software developments at group meetings
Demonstrated ability to reason clearly about load, latency, bandwidth, performance, reliability, and cost and make sound engineering decisions balancing them
Demonstrated ability to quickly and creatively implement novel solutions and ideas

Technical experience includes -

Proven ability to analyze, troubleshoot, and resolve complex problems that arise in the HPC production compute, interconnect, storage hardware, software systems, storage subsystems
Configuring and administering parallel, network attached storage (Lustre, GPFS on ESS, NFS, Ceph) and storage subsystems (e.g. IBM, NetApp, DataDirect Network, LSI, VAST, etc.)
Installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.) and implementing fairshare, node sharing, backfill etc.. for compute and GPUs
Red Hat Enterprise Linux, CentOS, or derivatives and Linux services and technologies like dnsmasq, systemd, LDAP, PAM, sssd, OpenSSH, cgroups
Scripting languages (including Bash, Python, or Perl)
OpenACC, nvhpc, understanding of cuda driver compatibility issues
Virtualization (ESXi or KVM/libvirt), containerization (Docker or Singularity), configuration management and automation (tools like xCAT, Puppet, kickstart) and orchestration (Kubernetes, docker-compose, CloudFormation, Terraform.)
High performance networking technologies (Ethernet and Infiniband) and hardware (Mellanox and Juniper)
Configuring, installing, tuning and maintaining scientific application software (Modules, SPACK)
Familiarity with source control tools (Git or SVN)
Experience with supporting use of popular ML frameworks such as Pytorch, Tensorflow
Familiarity with cybersecurity tools, methodologies, and best practices for protecting systems used for science
Experience with movement, storage, backup and archive of large scale data

Nice to have -

An advanced degree is strongly desired

The Chan Zuckerberg Biohub requires all employees, contractors, and interns, regardless of work location or type of role, to provide proof of full COVID-19 vaccination, including a booster vaccine dose, if eligible, by their start date. Those who are unable to get vaccinated or obtain a booster dose because of a disability, or who choose not to be vaccinated due to a sincerely held religious belief, practice, or observance must have an approved exception prior to their start date.

Compensation

$212,000 - $291,500

New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. To determine starting pay, we consider multiple job-related factors including a candidate’s skills, education and experience, market demand, business needs, and internal parity. We may also adjust this range in the future based on market data. Your recruiter can share more about the specific pay range during the hiring process.

Please mention the word **SOFT** and tag RMTg4LjE2Ni4xMDAuMTkx when applying to show you read the job post completely (#RMTg4LjE2Ni4xMDAuMTkx). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they're human.