← Careers
SRE / Operations Engineer
Observability · incidents · cost and performance · agentic ops
Apply for this role →The Role
SRE / Operations Engineers keep shipped systems healthy. You will design the observability, incident response, reliability, performance, security, and cost practices that make Lukla’s delivery model credible after launch.
This is not a ticket queue role. It is engineering work focused on keeping production boring.
What You Will Do
- Build monitoring, alerting, logging, tracing, and operational dashboards for client systems.
- Define what should page a human and what should become backlog work.
- Improve deployment safety, rollback paths, incident response, and post-incident learning.
- Use agents to accelerate runbook generation, test coverage, log analysis, and operational cleanup.
- Tune performance and infrastructure cost without compromising reliability.
- Partner with delivery engineers to make systems operable before launch, not after failure.
- Help clients understand operational risk in plain language.
What We Are Looking For
- Production experience with cloud infrastructure, CI/CD, observability, incident response, and reliability engineering.
- Strong scripting or software engineering ability.
- Comfort with systems thinking across app code, databases, networks, queues, and third-party services.
- A calm operating style under pressure.
- Good judgment around alert fatigue, security exposure, and operational tradeoffs.
Success Looks Like
Systems fail less often, recover faster, and tell us what is wrong before clients do. Clients feel supported because production ownership is real, not just promised during sales.