Summary
Join Experian's growing Site Reliability Engineering team as a Staff Engineer and contribute to the global uptime of Experian One, our Cloud SaaS offering. You will be responsible for monitoring, incident response, and improving system reliability. This role requires extensive experience in supporting complex systems, Linux, networking, cloud-native applications, and incident management. Proficiency in various tools and technologies, including Kubernetes and several programming languages, is essential. The position is permanent, home-based in Costa Rica, and offers a comprehensive benefits package including medical, life, and dental insurance, paid time off, and more.
Requirements
- 5+ years of experience in: Direct experience supporting complex scaled systems in production
- Linux knowledge, experience troubleshooting and predicting issues in advance
- Networking, troubleshooting and monitoring
- Cloud Native application designs for top performance, scalability and resilience
- Incident Management and coordination, Blameless PIRs
- Proficiency in one programming or scripting language and willingness to apply software development best practices to an operational role
- Knowledge of Kubernetes, Infrastructure as Code, High availability principles
- Experience with Kubernetes, Splunk, Dynatrace, Thousand Eyes, ServiceNow, Jira, Jenkins, Python, and Prometheus
- Experience with Java, Cassandra, Redis, RunDeck, MongoDB, Apigee, Okta, PostGres, and AWS
- Experience with Infrastructure as Code, Git Ops
- Line management or mentoring
- Written and verbal fluency in English is required
Responsibilities
- Ensure Uptime of Experian One β Experian's Cloud SaaS offering for Decision Analytics
- Monitor and provide alerts of our platform
- Respond to incidents and restoring service
- Gain a good enough understanding of the systems to assess issues and find owners for problem resolution
- Identify an issue or a manual process and ensure that they never occur again
- Incident management; able to co-ordinate others and be coordinated during service disruptions with a focus on restoring availability
- Write complex queries using multiple tools
- Review systems designs and implementations to identify resiliency, scalability and monitoring issues before implementation
- Role model behaviors and give technical leadership within the team
Preferred Qualifications
- Incident manager skills and can manage rationally and calmly during a crisis
- Work through boundaries - geographically, teams, language and cultural
- Curious and willing and to stay informed about relevant technology trends and developments
- Cloud aware, you understand how cloud technologies differ from other technical approaches and can explain these to others
- Previous job stability, including maintaining long-term work relationships with former employers
Benefits
- Medical, life, vision and dental insurance
- Asociacion Solidarista
- International Share Save Plan
- Flex Work/Work from home
- Paid time off
- Birthday day off
- Annual Performance Bonus
- Education Reimbursement
- Family Bonding
- Bereavement Leave
- Referral Program
Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.