Your mission
We’re not firefighters, everyone on the team owns decisions, drives improvements, and gets plenty of room to make things better. We work proactively with developers, data engineers, and other stakeholders not just to put out fires, but to build better, more resilient infrastructure as we go.
What you will work on:
- Work with our core tech stack: AWS (multi-account, multi-region), Terraform, EKS (Kubernetes), complex GitLab CI pipelines, Helm, RDS, S3, RabbitMQ, Lambda, Python, and more plus observability tools like Prometheus, Loki, Datadog, and OpenTelemetry
- Troubleshoot complex issues across distributed systems and apply SRE principles to drive root cause analysis, long-term fixes, and platform-wide reliability improvements.
- Design and implement robust backup and disaster recovery strategies for both stateless and stateful services
- Collaborate with engineers, stakeholders, and DevOps teammates to design, evolve, and maintain a scalable and secure cloud platform
- Continuously improve our tooling, automation, and operational workflows to reduce friction, enhance developer experience, and enable faster, safer shipping
- Stay current with the evolving DevOps and cloud-native ecosystem not just to grow your own skill set, but to help elevate the team’s knowledge, challenge assumptions, and introduce better ways of thinking and working.