Site Reliability Engineer (AWS, Azure)

Darwin Recruitment

9 months ago

NB: For this role we are considering engineers who are already located in the Netherlands.

Are you excited about developing software solutions that will enhance system reliability, scalability, and performance well into the future? Do you prefer to automate operational tasks rather than solving them manually, ensuring smooth system operations? Are you skilled in leading technical recoveries and continuously improving how incidents are managed?

This large enterprise is in a pivotal growth phase, where optimizing complex distributed systems while maintaining high availability is essential. The most pressing challenge you’ll face is creating robust automation frameworks that ensure systems run smoothly under all conditions, even during periods of heavy usage.

I’m looking for someone who:

Has hands-on experience with public cloud platforms, specifically with AWS and Azure.
Knows containers very well (Docker, Kubernetes), and has hands-on experience in managing high traffic, complex clusters.
Is a pro at infrastructure-as-code, ideally with Terraform.
Is well versed with keeping mission-critical production systems online for large enterprises.
Understands storage architectures, databases, caching mechanisms, networking, and queuing systems.
Has a track record of successfully managing technical recoveries and diagnosing issues in distributed systems, including triage, postmortems, and more.
Is adept at troubleshooting, optimizing code, and automating day-to-day operational tasks
Has strong experience with Linux or Windows administration.
Is familiar with monitoring and observability tools, such as Prometheus, Grafana, Kibana, or Elasticsearch.
Has a solid grasp of Service Level Agreements (SLAs) and Service Level Objectives (SLOs).
Is proficient in at least one relevant programming language (e.g., Python, Go, Bash).
Communicates effectively in English, both in writing and speaking.

Who is interested in:

Designing and implementing software that improves system stability, scalability, and availability.
Building reusable automation patterns for use across different teams.
Taking responsibility for key services and systems, ensuring they’re well-automated.
Responding quickly to outages, leading incident calls, and minimizing downtime.
Continuously refining incident management and leading technical recoveries for major incidents.
Mentoring junior engineers in automation and best operational practices.
Helping the team grow by participating in recruitment and onboarding processes.

For a company that:

Can offer you a base salary up to €90.000 per year (including 8% holiday allowance), based on your experience.
Can offer much more, including 30 holidays, a commuting/lease allowance, a great pension scheme, discounts on several products, and more.
Focuses on achieving operational excellence through automation and scalability.
Encourages engineers to take ownership of their work and continuously improve systems.
Supports mentorship and personal growth for engineers at all levels.
Emphasizes long-term system reliability with proactive monitoring and observability.

We understand this role requires a broad range of skills, but we encourage you to apply even if you meet 80% of the requirements.

Apply now! Prefer a chat first? Call Nezar Lourens at 020 305 8545 or e-mail nezar.lourens@darwinrecruitment.com to learn more about this exciting opportunity.

Keywords: AWS, Azure, Docker, Kubernetes, CloudFormation, distributed systems, Linux, Windows, Prometheus, Grafana, Kibana, Elasticsearch, Python, Go, Bash, Site Reliability Engineer

Darwin Recruitment is acting as an Employment Agency in relation to this vacancy.

Nezar Lourens