SeamlessHR logo

Site Reliability Engineer

SeamlessHR
Contract
On-site
Lagos, Nigeria
Installation & Maintenance

 

  • Observability & Monitoring
    • Enhance and expand our existing observability stack (Grafana, Prometheus, Tempo, Loki).
    • Implement robust alerting mechanisms for logs, traces, and metrics to improve incident detection and response. 
    • Establish Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
    • Automate dashboards and monitoring configurations for new services and infrastructure.
  • Reliability & Resilience
    • Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
    • Run resiliency reviews and chaos testing to ensure systems can withstand failures. 
    • Partner with engineering teams to design for scalability, high availability, and disaster recovery.
  • Incident Response & Postmortems
    • Establish and refine incident management processes (on-call rotations, escalation policies, playbooks).
    • Lead blameless postmortems, turning incidents into learning opportunities and systemic improvements.
  • Automation & Tooling
    • Develop automation around monitoring, logging, and alerting configurations.
    • Implement self-service tools for developers to easily onboard new services into observability pipelines.
    • Optimize costs of observability tools while maintaining coverage and depth.
  • Collaboration & Enablement
    • Work closely with software engineers, QA, DevOps, and product teams to embed observability into the development lifecycle.
    • Mentor and guide teams on best practices for monitoring, instrumentation, and performance analysis.
    • Foster a culture of proactive monitoring and continuous improvement.


 

  • Proven experience as an SRE, DevOps Engineer, or in a similar role with a strong focus on observability and reliability.
  • Hands-on experience with Grafana, Prometheus, Tempo, Loki, and OpenTelemetry (or similar observability stacks). 
  • Strong background in Linux systems, networking, and cloud platforms (AWS preferred)
  • Proficiency in infrastructure-as-code tools (Terraform, CloudFormation, or similar). 
  • Solid programming/scripting skills (e.g., Python, Go, Bash).
  • Experience setting up alerting and incident response workflows
  • Knowledge of CI/CD pipelines and modern software delivery practices.
  • Strong analytical, troubleshooting, and problem-solving skills.
  • Excellent communication and collaboration skills
  • Experience with chaos engineering and resilience testing. 
  • Knowledge of distributed systems design and scaling
  • Familiarity with cost optimization strategies for observability and monitoring tools.
  • Exposure to security monitoring and compliance requirements.


Apply now
Share this job