-
View all jobs
Site Reliability Engineer (SRE)
Role Overview: We are seeking an SiteReliability Engineer to own the "Production Readiness" of our cloud-based AI solutions. This hybrid role combines automated software testing, and Site Reliability Engineering (SRE). You will build the automated frameworks that validate our AI outputs and ensure the underlying Azure/AWS infrastructure is resilient, performant, and compliant with banking standards.
Key Responsibilities
Role Overview: We are seeking an SiteReliability Engineer to own the "Production Readiness" of our cloud-based AI solutions. This hybrid role combines automated software testing, and Site Reliability Engineering (SRE). You will build the automated frameworks that validate our AI outputs and ensure the underlying Azure/AWS infrastructure is resilient, performant, and compliant with banking standards.
Key Responsibilities
- Resiliency Engineering (SRE): Implement "Chaos Engineering" and load testing to ensure web/mobile backends can handle banking-scale traffic. Maintain high availability through automated recovery scripts.
- Automated Regression: Build CI/CD-integrated test suites using Python that validate both the application logic and the infrastructure state (IaC validation).
- Observability & SLIs: Define and monitor Service Level Indicators (SLIs) and Objectives (SLOs). Set up advanced alerting in Azure Monitor or AWS CloudWatch to catch performance degradation before users do.
- Security & Compliance Testing: Automate security scans and compliance checks to ensure all AI data handling meets strict banking data residency and privacy protocols.
- Automation Stack: High proficiency in Python (for AI testing) and framework automation (PyTest, Selenium, or Robot Framework).
- Cloud Infrastructure: Strong hands-on experience with Azure or AWS, specifically regarding networking, scaling, and serverless reliability.
- AI/ML Understanding: Understanding of Prompt Engineering and how to evaluate AI model outputs (RAG evaluation, ROUGE/BLEU scores, or custom LLM-benchmarks).
- Monitoring Tools: Experience with Grafana, Prometheus, or native cloud monitoring tools to build real-time reliability dashboards.
- FinOps Awareness: Ability to identify "expensive" failing tests or inefficient cloud resource usage during the testing phase.
- Languages: Python (Mandatory), Bash scripting.
- Tools: GitHub Actions (CI/CD), Terraform (reading/validating), K6 or JMeter (Performance).
- AI Frameworks: DeepEval, Ragas, or LangSmith (for automated AI evaluation).
Key Skills
Ranked by relevance
ai
python
cloud
cicd
aws
high availability
serverless
prometheus
terraform
selenium
grafana
bash
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
Senior Engineer – Network Operations
2026-05-24
Full-time
Mid-Senior
United Arab Emirates
IT Services
Information Technology
View Job Details
Related
System Engineer/Site Reliability Engineer (m/w/d)
2026-06-09
Full-time
Not Applicable
Germany
IT Services
Engineering
View Job Details
Related
IT-Delivery-Project Manager
2026-05-24
Contract
Not Applicable
United Arab Emirates
IT Services
Information Technology
Login to Apply
- Posted
- May 16, 2026
- Type
- Contract
- Level
- Not Applicable
- Location
- Abu Dhabi
- Company
- Dicetek LLC
Industries
IT Services
IT Consulting
Categories
Engineering
Information Technology
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
Senior Engineer – Network Operations
2026-05-24
Full-time
Mid-Senior
United Arab Emirates
IT Services
Information Technology
View Job Details
Related
System Engineer/Site Reliability Engineer (m/w/d)
2026-06-09
Full-time
Not Applicable
Germany
IT Services
Engineering
View Job Details
Related
IT-Delivery-Project Manager
2026-05-24
Contract
Not Applicable
United Arab Emirates
IT Services
Information Technology