-
Sparagus

Site Reliability Coordinator

Sparagus
Belgium · Full-time · Mid-Senior

Job Context


As part of an operational restructuring, a Site Reliability Coordination (SRC) team has been established within LEV1 support to enhance reliability, responsiveness, and incident coordination in a multi-provider environment.

The Senior SysOps Engineer plays a key role in the technical analysis of incidents, event correlation, and performance optimization by leveraging Site Reliability Engineering (SRE) and ITIL practices. Acting as a technical facilitator, they work between support teams (LEV1, LEV2, LEV3), providers, and technical governance.


Key Responsibilities


Technical Incident Supervision and Correlation

  • Perform in-depth analysis of logs, metrics, and alerts across different components (middleware, infrastructure, applications).
  • Ensure proactive monitoring of service performance and availability.
  • Facilitate root cause identification in collaboration with LEV2/LEV3 teams from providers.
  • Correlate incidents across different system layers (e.g., an application issue impacting infrastructure).
  • Escalate incidents to the appropriate teams when necessary.


Multi-Provider Technical Coordination

  • Participate in investigation meetings with technical experts from providers.
  • Ensure that all stakeholders comply with SLAs and contractual commitments.
  • Coordinate technical escalations and track actions clearly.
  • Centralize and document technical exchanges in a structured manner (runbooks, incident reports).


Continuous Improvement and Performance Optimization

  • Contribute to technical postmortems, analyzing root causes and suggesting improvements.
  • Recommend enhancements to monitoring and observability tools used by providers.
  • Track key performance indicators (SLI, SLO, SLA, MTTD, MTTR) to anticipate risks.
  • Stay up to date with SRE/DevOps tools and practices to enhance diagnostic capabilities.


Documentation and Knowledge Sharing

  • Maintain and enhance incident and escalation runbooks.
  • Write technical guides for LEV1 teams to improve initial diagnostics.
  • Support LEV1 team skill development through technical training sessions.
  • Assist in the training and development of junior SysOps engineers within the team.


Participation in ITSM Committees and Governance

  • Attend operational review committees (CAB, Incident Review, Performance Review) as a technical expert.
  • Provide recommendations on critical incident management and change management.
  • Suggest ITIL and SRE process adjustments to improve coordination efficiency.



Required Skills


Technical Skills

  • Systems: Strong expertise in Linux environments.
  • Virtualization & Containers: Experience with IaaS technologies, Kubernetes, Docker, OpenShift.
  • Middleware & Messaging: Knowledge of solutions such as Kafka, JBoss, SpringBoot, HAProxy, etc.
  • Observability & Monitoring: Proficiency in Prometheus, Grafana, Loki.
  • Databases: Experience in troubleshooting Oracle and PostgreSQL.
  • Automation & Scripting: Strong knowledge of Bash, Python, Ansible, Terraform for operations analysis and optimization.
  • SRE Methodology: Solid understanding of SLI, SLO, postmortems, advanced monitoring.
  • ITIL v4: Good understanding of Incident, Problem, and Change Management processes.


Organizational & Interpersonal Skills

  • Analytical and synthesis skills to correlate technical incidents and anticipate risks.
  • Collaborative mindset to facilitate communication between technical teams and providers.
  • Strong verbal and written communication skills, particularly for documenting and explaining incidents.
  • Autonomy and proactivity in incident management and continuous improvement.
  • Stress resistance, ability to handle critical incidents and prioritize effectively.



Experience & Qualifications

  • ITIL v3/v4 certification is a plus.
  • Certifications such as Kubernetes (CKA, CKAD), AWS/GCP/Azure, or Red Hat are a plus.
  • 5 to 10 years of experience in a similar role (SysOps, SRE, Operations Engineer, Incident Manager, Observability Engineer).
  • Experience in critical environments (high availability, high volume, SLA constraints).


Languages

  • Professional English (for interactions with some providers).
  • Bilingual work environment: French/Dutch.


Conditions & Work Modalities

  • Full-time position, with the possibility of on-call duties (24/7 support).
  • Remote work policy to be defined based on team needs, with potential on-site presence for team coordination.

Key Skills

Ranked by relevance

sla high availability kubernetes ansible grafana python docker oracle
Login to Apply
Posted
Feb 11, 2025
Type
Full-time
Level
Mid-Senior
Location
Belgium
Company
Sparagus

Industries

Information Services Technology Information Media

Categories

Information Technology

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
AspenView Technology Partners
Related

Mid-Level AI Engineer

2026-05-28

Full-time
Mid-Senior
Argentina
Technology
Engineering
View Job Details
Google
Related

Engineering Manager, Agent Foundations, Front-End

2026-05-29

Full-time
Not Applicable
Germany
Information Services
Information Technology
View Job Details
Redwolf + Rosch
Related

Information Technology Service Desk

2026-05-28

Full-time
Mid-Senior
Australia
Government Administration
Information Technology