Role Description
Givebutter is hiring a New York City-based Site Reliability Team Lead to oversee the reliability, scalability, and performance of our systems. As a Lead SRE, you will be directly responsible for delivering world-class infrastructure to our users, maturing our operational practices, and leading a team of skilled engineers. You will report directly to our CTO and carry out our infrastructure vision while creating a scalable engineering culture that breeds innovation. You will ensure we are delivering excellent user experiences in a timely manner and retain top-notch security, design, and performance. You will cultivate a culture of high performance by creating systems that eliminate roadblocks, processes that incentivize excellence, and by being an expert in site reliability engineering. We have already built a great foundation, powering hundreds of millions of donations to over 10k+ organizations and you will take this impact much further.
Why join the Givebutter Engineering team?
Democracy of code - We are a group of engineers that values equal contribution as well as discussing architecture and ideas openly.
Not overburdened with meetings - Our Engineers manage their own calendars and block times so they can work uninterrupted.
Automated ci/cd - Our builds are reproducible and the pipeline is easy to manage. Shipping to production is hands-off, automated, and consistent. Our engineers are focused on solving problems with code.
Mission-driven, full stop - We work with amazing organizations, non-profits, and charities doing good all over the world.
\n
Responsibilities- Manage and hire in-house SREs and contractor resources
- Handle and prioritize incidents, ensuring timely resolution and effective communication.
- Establish and manage key metrics for reliability; set up and maintain alerting systems.
- Automate tasks and manage infrastructure using Infrastructure as Code (IaC) tools and techniques.
- Ensure application scalability and identify performance bottlenecks to optimize system performance.
- Design and implement fault-tolerant and highly available systems to minimize downtime.
- Develop, implement, and regularly test disaster recovery plans to ensure business continuity.
- Conduct capacity planning to anticipate and manage future infrastructure needs.
- Define, measure, and maintain SLOs and SLAs to meet service performance expectations.
- Ensure the security of applications through best practices and conduct regular penetration tests to identify and mitigate vulnerabilities.
Requirements- 5+ years of experience building and deploying production infrastructure at scale
- 5+ years experience working with AWS
- Knowledge of PHP
- Aware of trends and best practices in SRE and cloud infrastructure
- 2+ years of experience managing system architecture, ensuring best practices for reliability, performance, and security
- Strong technical leadership, mentorship, and communication skills
- Experience working for a product-led growth company is beneficial
- Experience managing a remote engineering team
\n
$170,000 - $190,000 a year
This is a remote position, however candidates should be located in New York for occasional in-person meetings with the CTO and Product and Engineering Team Leads
\n
Please mention the word **UPLIFTMENT** and tag RMTg4LjE2Ni4xMDAuMTkx when applying to show you read the job post completely (#RMTg4LjE2Ni4xMDAuMTkx). This is a beta feature to avoid spam applicants. Companies can search these words to find applicants that read this and see they're human.
Tags
system
security
cto
technical
test
growth
code
cloud
lead
operational
reliability
engineer
engineering
Apply to job