Job Skills City - Job Board
Home
TVET eAcademy
AI TVET Tutor
Communities
Sign in
Post a job for ₦10,000
Open main menu
Home
TVET eAcademy
AI TVET Tutor
Communities
Sign in
Post a job for ₦10,000
Site Reliability Engineer
SeamlessHR
Contract
On-site
Lagos, Nigeria
Installation & Maintenance
Observability & Monitoring
Enhance and expand our existing observability stack (Grafana, Prometheus, Tempo, Loki).
Implement robust
alerting mechanisms
for logs, traces, and metrics to improve incident detection and response.
Establish Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
Automate dashboards and monitoring configurations for new services and infrastructure.
Reliability & Resilience
Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
Run
resiliency reviews
and chaos testing to ensure systems can withstand failures.
Partner with engineering teams to design for scalability, high availability, and disaster recovery.
Incident Response & Postmortems
Establish and refine incident management processes (on-call rotations, escalation policies, playbooks).
Lead blameless postmortems, turning incidents into learning opportunities and systemic improvements.
Automation & Tooling
Develop automation around monitoring, logging, and alerting configurations.
Implement self-service tools for developers to easily onboard new services into observability pipelines.
Optimize costs of observability tools while maintaining coverage and depth.
Collaboration & Enablement
Work closely with software engineers, QA, DevOps, and product teams to embed observability into the development lifecycle.
Mentor and guide teams on best practices for monitoring, instrumentation, and performance analysis.
Foster a culture of proactive monitoring and continuous improvement.
Proven experience as an SRE, DevOps Engineer, or in a similar role with a strong focus on observability and reliability.
Hands-on experience with
Grafana, Prometheus, Tempo, Loki, and OpenTelemetry
(or similar observability stacks).
Strong background in
Linux systems, networking, and cloud platforms (AWS preferred)
.
Proficiency in
infrastructure-as-code tools
(Terraform, CloudFormation, or similar).
Solid programming/scripting skills (e.g., Python, Go, Bash).
Experience setting up
alerting and incident response workflows
.
Knowledge of CI/CD pipelines and modern software delivery practices.
Strong analytical, troubleshooting, and problem-solving skills.
Excellent communication and collaboration skills
Experience with
chaos engineering
and resilience testing.
Knowledge of
distributed systems design and scaling
.
Familiarity with cost optimization strategies for observability and monitoring tools.
Exposure to
security monitoring and compliance requirements
.
Apply now
Share this job
Job expired?
More jobs
Mechanical Maintenance Technician
Wtsenergy
Technical Officer
Wtsenergy