Site Reliability Coordinator

Sparagus

Belgium · Full-time · Mid-Senior

Job Context

As part of an operational restructuring, a Site Reliability Coordination (SRC) team has been established within LEV1 support to enhance reliability, responsiveness, and incident coordination in a multi-provider environment.

The Senior SysOps Engineer plays a key role in the technical analysis of incidents, event correlation, and performance optimization by leveraging Site Reliability Engineering (SRE) and ITIL practices. Acting as a technical facilitator, they work between support teams (LEV1, LEV2, LEV3), providers, and technical governance.

Key Responsibilities

Technical Incident Supervision and Correlation

Perform in-depth analysis of logs, metrics, and alerts across different components (middleware, infrastructure, applications).
Ensure proactive monitoring of service performance and availability.
Facilitate root cause identification in collaboration with LEV2/LEV3 teams from providers.
Correlate incidents across different system layers (e.g., an application issue impacting infrastructure).
Escalate incidents to the appropriate teams when necessary.

Multi-Provider Technical Coordination

Participate in investigation meetings with technical experts from providers.
Ensure that all stakeholders comply with SLAs and contractual commitments.
Coordinate technical escalations and track actions clearly.
Centralize and document technical exchanges in a structured manner (runbooks, incident reports).

Continuous Improvement and Performance Optimization

Contribute to technical postmortems, analyzing root causes and suggesting improvements.
Recommend enhancements to monitoring and observability tools used by providers.
Track key performance indicators (SLI, SLO, SLA, MTTD, MTTR) to anticipate risks.
Stay up to date with SRE/DevOps tools and practices to enhance diagnostic capabilities.

Documentation and Knowledge Sharing

Maintain and enhance incident and escalation runbooks.
Write technical guides for LEV1 teams to improve initial diagnostics.
Support LEV1 team skill development through technical training sessions.
Assist in the training and development of junior SysOps engineers within the team.

Participation in ITSM Committees and Governance

Attend operational review committees (CAB, Incident Review, Performance Review) as a technical expert.
Provide recommendations on critical incident management and change management.
Suggest ITIL and SRE process adjustments to improve coordination efficiency.

Required Skills

Technical Skills

Systems: Strong expertise in Linux environments.
Virtualization & Containers: Experience with IaaS technologies, Kubernetes, Docker, OpenShift.
Middleware & Messaging: Knowledge of solutions such as Kafka, JBoss, SpringBoot, HAProxy, etc.
Observability & Monitoring: Proficiency in Prometheus, Grafana, Loki.
Databases: Experience in troubleshooting Oracle and PostgreSQL.
Automation & Scripting: Strong knowledge of Bash, Python, Ansible, Terraform for operations analysis and optimization.
SRE Methodology: Solid understanding of SLI, SLO, postmortems, advanced monitoring.
ITIL v4: Good understanding of Incident, Problem, and Change Management processes.

Organizational & Interpersonal Skills

Analytical and synthesis skills to correlate technical incidents and anticipate risks.
Collaborative mindset to facilitate communication between technical teams and providers.
Strong verbal and written communication skills, particularly for documenting and explaining incidents.
Autonomy and proactivity in incident management and continuous improvement.
Stress resistance, ability to handle critical incidents and prioritize effectively.

Experience & Qualifications

ITIL v3/v4 certification is a plus.
Certifications such as Kubernetes (CKA, CKAD), AWS/GCP/Azure, or Red Hat are a plus.
5 to 10 years of experience in a similar role (SysOps, SRE, Operations Engineer, Incident Manager, Observability Engineer).
Experience in critical environments (high availability, high volume, SLA constraints).

Languages

Professional English (for interactions with some providers).
Bilingual work environment: French/Dutch.

Conditions & Work Modalities

Full-time position, with the possibility of on-call duties (24/7 support).
Remote work policy to be defined based on team needs, with potential on-site presence for team coordination.

Key Skills

Ranked by relevance

sla high availability kubernetes ansible grafana python docker oracle

Related Jobs

3 roles aligned with this opportunity

View all jobs

Mid-Level AI Engineer

2026-05-28

Full-time

Mid-Senior

Argentina

Technology

Engineering

Engineering Manager, Agent Foundations, Front-End

2026-05-29

Full-time

Not Applicable

Germany

Information Services

Information Technology

Information Technology Service Desk

2026-05-28

Full-time

Mid-Senior

Australia

Government Administration

Information Technology

🇧🇪

Country Guide

Belgium

International hub for EU careers

Posted: Feb 11, 2025
Type: Full-time
Level: Mid-Senior
Location: Belgium
Company: Sparagus

Industries

Information Services Technology Information Media

Related Jobs

3 roles aligned with this opportunity

View all jobs

Mid-Level AI Engineer

2026-05-28

Full-time

Mid-Senior

Argentina

Technology

Engineering

Engineering Manager, Agent Foundations, Front-End

2026-05-29

Full-time

Not Applicable

Germany

Information Services

Information Technology

Information Technology Service Desk

2026-05-28

Full-time

Mid-Senior

Australia

Government Administration

Information Technology

Site Reliability Coordinator

Key Skills

Related Jobs

Mid-Level AI Engineer

Engineering Manager, Agent Foundations, Front-End

Information Technology Service Desk

Related Jobs

Mid-Level AI Engineer

Engineering Manager, Agent Foundations, Front-End

Information Technology Service Desk

Cookie Settings