Reliability Engineer
- R&D
- Tel-Aviv, Israel
- Full-time
Description
monday.com is looking for a Reliability Engineer to join our Reliability team. This role will be integral in ensuring the robustness and dependability of our platform, impacting millions of users globally.
About The Role
- Maintain a comprehensive understanding of our service architecture and its dependencies.
- Identify and mitigate risks associated with tightly coupled services and complex interconnections.
- Lead service re-architecture initiatives to improve reliability and scalability.
- Review new services and ensure they meet our reliability standards.
- Advocate for Chaos Engineering, collaborate with R&D teams, build tools/envs, and improve system resilience
- Manage the full lifecycle of reliability tools and services, adhering to the comprehensive architectural guidelines
- Collaborate with teams to define and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that align with business goals and user expectations
- Our Stack: Kubernetes, Datadog, Chaos Mesh, AWS, Terraform, CDKTF
Requirements
- Proven k8s and Linux admin/internals experience.
- Proven experience with microservice architectures and reliability engineering.
- Deep understanding of reliability concepts (eg, SLOs, SLIs, and service interconnections).
- Strong background in incident response and resilience efforts.
- Ability to collaborate across teams to drive reliability improvements.
- (Nice-to-have): Prior knowledge with chaos engineering.
Social Title
Reliability Engineer
Our Team
The R&D Team is passionate about building innovative and lovable products, while tackling complex engineering problems at a great scale. We’re accountable for bringing the company’s vision to life by navigating our progress into flawless execution and encouraging full ownership and independence in all projects. The Infra role is a crucial piece as our company scales and user-base grows, conquering all aspects of product and infrastructure challenges. We are focused around development flow productivity, building application infrastructure and production resilience. We have huge challenges related to hyper growth of engineering, application and data scale.