DTG GLOBAL

Site Reliability Engineer

London
June 19, 2025
Application deadline closed.
Deadline date:
Application deadline closed.
£75000 - £95000 / year

Job Description

A highly respected software business in London is looking for an experienced Site Reliability Engineer (SRE) to join their platform team. This is an opportunity to play a key role in scaling and improving the reliability of critical systems used by thousands of businesses every day.

About the Company

This company builds and operates software platforms that are essential to businesses across finance, logistics, and e-commerce. Their systems handle high volumes of data and transactions in real time, with reliability, scalability, and performance being core to their success.

As the company continues to expand its platform capabilities, they are investing heavily in SRE practices to automate operations, enhance observability, and ensure smooth service delivery.

The Role

The SRE will focus on improving the performance, reliability, and resilience of production systems. This is a hands-on role that blends software engineering with systems operations, working closely with both developers and infrastructure teams.

Key responsibilities include building automated monitoring, alerting, and incident response systems, driving infrastructure as code, and enhancing CI/CD pipelines to ensure seamless deployments. There’s also a strong emphasis on performance tuningcapacity planning, and proactive fault detection.

Key Responsibilities

  • Build and improve observability tools (monitoring, logging, tracing).
  • Automate operational processes through code (IaC, self-healing systems).
  • Lead post-incident reviews and drive systemic improvements.
  • Collaborate with engineering teams to design for reliability from the start.
  • Improve deployment pipelines and reduce time to production.
  • Monitor performance and capacity across critical systems.
  • Champion best practices around SRE, DevOps, and platform engineering.

Ideal Candidate

  • Strong experience in a Site Reliability Engineer or Platform Engineer role.
  • Proficient with cloud platforms (AWS, GCP, or Azure).
  • Expertise in Kubernetes and container orchestration.
  • Solid coding/scripting skills in GolangPython, or Bash.
  • Strong knowledge of monitoring & observability tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
  • Hands-on experience with Infrastructure as Code (Terraform preferred).
  • Familiar with modern CI/CD processes and tooling (GitHub Actions, ArgoCD, Jenkins, etc.).
  • Experience with incident management and running blameless post-mortems.
  • Strong understanding of distributed systems and performance optimisation.

What They Offer

  • Salary: £75,000 – £95,000 + Bonus + Benefits.
  • Hybrid working – 2-3 days per week in the London office.
  • Opportunity to shape SRE practices in a scaling environment.
  • Work with modern tooling and infrastructure.
  • Structured career development and personal growth budget.
  • A collaborative, engineering-led culture.