Job Context
As part of an operational restructuring, a Site Reliability Coordination (SRC) team has been established within LEV1 support to enhance reliability, responsiveness, and incident coordination in a multi-provider environment.
The Senior SysOps Engineer plays a key role in the technical analysis of incidents, event correlation, and performance optimization by leveraging Site Reliability Engineering (SRE) and ITIL practices. Acting as a technical facilitator, they work between support teams (LEV1, LEV2, LEV3), providers, and technical governance.
Key Responsibilities
Technical Incident Supervision and Correlation
- Perform in-depth analysis of logs, metrics, and alerts across different components (middleware, infrastructure, applications).
- Ensure proactive monitoring of service performance and availability.
- Facilitate root cause identification in collaboration with LEV2/LEV3 teams from providers.
- Correlate incidents across different system layers (e.g., an application issue impacting infrastructure).
- Escalate incidents to the appropriate teams when necessary.
Multi-Provider Technical Coordination
- Participate in investigation meetings with technical experts from providers.
- Ensure that all stakeholders comply with SLAs and contractual commitments.
- Coordinate technical escalations and track actions clearly.
- Centralize and document technical exchanges in a structured manner (runbooks, incident reports).
Continuous Improvement and Performance Optimization
- Contribute to technical postmortems, analyzing root causes and suggesting improvements.
- Recommend enhancements to monitoring and observability tools used by providers.
- Track key performance indicators (SLI, SLO, SLA, MTTD, MTTR) to anticipate risks.
- Stay up to date with SRE/DevOps tools and practices to enhance diagnostic capabilities.
Documentation and Knowledge Sharing
- Maintain and enhance incident and escalation runbooks.
- Write technical guides for LEV1 teams to improve initial diagnostics.
- Support LEV1 team skill development through technical training sessions.
- Assist in the training and development of junior SysOps engineers within the team.
Participation in ITSM Committees and Governance
- Attend operational review committees (CAB, Incident Review, Performance Review) as a technical expert.
- Provide recommendations on critical incident management and change management.
- Suggest ITIL and SRE process adjustments to improve coordination efficiency.
Required Skills
Technical Skills
- Systems: Strong expertise in Linux environments.
- Virtualization & Containers: Experience with IaaS technologies, Kubernetes, Docker, OpenShift.
- Middleware & Messaging: Knowledge of solutions such as Kafka, JBoss, SpringBoot, HAProxy, etc.
- Observability & Monitoring: Proficiency in Prometheus, Grafana, Loki.
- Databases: Experience in troubleshooting Oracle and PostgreSQL.
- Automation & Scripting: Strong knowledge of Bash, Python, Ansible, Terraform for operations analysis and optimization.
- SRE Methodology: Solid understanding of SLI, SLO, postmortems, advanced monitoring.
- ITIL v4: Good understanding of Incident, Problem, and Change Management processes.
Organizational & Interpersonal Skills
- Analytical and synthesis skills to correlate technical incidents and anticipate risks.
- Collaborative mindset to facilitate communication between technical teams and providers.
- Strong verbal and written communication skills, particularly for documenting and explaining incidents.
- Autonomy and proactivity in incident management and continuous improvement.
- Stress resistance, ability to handle critical incidents and prioritize effectively.
Experience & Qualifications
- ITIL v3/v4 certification is a plus.
- Certifications such as Kubernetes (CKA, CKAD), AWS/GCP/Azure, or Red Hat are a plus.
- 5 to 10 years of experience in a similar role (SysOps, SRE, Operations Engineer, Incident Manager, Observability Engineer).
- Experience in critical environments (high availability, high volume, SLA constraints).
Languages
- Professional English (for interactions with some providers).
- Bilingual work environment: French/Dutch.
Conditions & Work Modalities
- Full-time position, with the possibility of on-call duties (24/7 support).
- Remote work policy to be defined based on team needs, with potential on-site presence for team coordination.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Mid-Level AI Engineer
2026-05-28
Engineering Manager, Agent Foundations, Front-End
2026-05-29
Information Technology Service Desk
2026-05-28
- Posted
- Feb 11, 2025
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Belgium
- Company
- Sparagus
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Mid-Level AI Engineer
2026-05-28
Engineering Manager, Agent Foundations, Front-End
2026-05-29
Information Technology Service Desk
2026-05-28