Your mission
We’re not firefighters, everyone on the team owns decisions, drives improvements, and gets plenty of room to make things better. We work proactively with developers, data engineers, and other stakeholders not just to put out fires, but to build better, more resilient infrastructure as we go.
In this role you will:
- Work with our core tech stack: AWS (multi-account, multi-region), Terraform, EKS (Kubernetes), complex GitLab CI pipelines, Helm, RDS, S3, RabbitMQ, Lambda, Python, and more plus observability tools like Prometheus, Loki, Datadog, and OpenTelemetry
- Troubleshoot complex issues across distributed systems and apply SRE principles to drive root cause analysis, long-term fixes, and platform-wide reliability improvements.
- Design and implement robust backup and disaster recovery strategies for both stateless and stateful services
- Collaborate with engineers, stakeholders, and DevOps teammates to design, evolve, and maintain a scalable and secure cloud platform
- Continuously improve our tooling, automation, and operational workflows to reduce friction, enhance developer experience, and enable faster, safer shipping
- Stay current with the evolving DevOps and cloud-native ecosystem not just to grow your own skill set, but to help elevate the team’s knowledge, challenge assumptions, and introduce better ways of thinking and working.
Beyond your operational responsibilities, you will also embody and champion our core values every day:
Elevate Together – Collaboration is our strength. We support each other, embrace diversity, and grow as a team.
Xray Transparency – Open communication builds trust and strong relationships. We value honesty, inclusion, and accountability.
Master Feedback – We challenge and support each other through honest, caring feedback.
Own the Outcome – We take ownership, move fast, and turn ideas into results.
Xcute Smart – Keep it simple, act pragmatically, and focus on what truly matters