· Ensuring
availability of UAT and production applications and foster capacity planning
for production infrastructures. Monitoring of existing systems/applications
using monitoring tools
· Engage
in and improve the whole lifecycle of services from inception and design,
through deployment, operations
· Troubleshooting
problems that span systems, databases, storage, network, and codes while
suggesting/implementing security measures for the protection of systems,
networks, and information
· Scale
systems sustainably through mechanisms like automation and evolve systems by
pushing for changes that improve reliability and velocity
· Minimize
and mitigate the risk of reliability-related failures pertaining to systems
availability, performance, and correctness. Ensuring investigation into
warnings and alerts from monitoring systems, Incident response, diagnosis, and
follow-up on system outages
· Documentation
of process and procedure manuals.
· inimum of 3 years’ experience in a similar role
· Working knowledge of databases and SQL
· Comfortable
with Open-Source configuration management and orchestration tools (chef,
Puppet, Ansible, Terraform, etc.)
· Knowledge of Docker, Docker swamp, Fargate, and Kubernetes
· Experience with caching systems such as Kafka and Redis
· Working experience with building monitoring tools and
setting measurement metrics
· Proficiency with shell and a programming language used in an SRE/Operations engineering
context (Python, Go, Ruby, etc.) will be an added advantage
· Experience with operating in a high availability environment
· Excellent communication skills with a high level of emotional intelligence
· Experience in working with remote teams
· Server Administration skills (Redhat, Windows, CentOs, Ubuntu)
You’ll receive
competitive compensation and work with amazing people. You’ll work in a
beautiful environment with a flat structure and solve complex, real-world
challenges.