Summary

Join Experian's growing Site Reliability Engineering team as a Staff Engineer and contribute to the global uptime of Experian One, our Cloud SaaS offering. You will be responsible for monitoring, incident response, and improving system reliability. This role requires extensive experience in supporting complex systems, Linux, networking, cloud-native applications, and incident management. Proficiency in various tools and technologies, including Kubernetes and several programming languages, is essential. The position is permanent, home-based in Costa Rica, and offers a comprehensive benefits package including medical, life, and dental insurance, paid time off, and more.

Requirements

5+ years of experience in: Direct experience supporting complex scaled systems in production
Linux knowledge, experience troubleshooting and predicting issues in advance
Networking, troubleshooting and monitoring
Cloud Native application designs for top performance, scalability and resilience
Incident Management and coordination, Blameless PIRs
Proficiency in one programming or scripting language and willingness to apply software development best practices to an operational role
Knowledge of Kubernetes, Infrastructure as Code, High availability principles
Experience with Kubernetes, Splunk, Dynatrace, Thousand Eyes, ServiceNow, Jira, Jenkins, Python, and Prometheus
Experience with Java, Cassandra, Redis, RunDeck, MongoDB, Apigee, Okta, PostGres, and AWS
Experience with Infrastructure as Code, Git Ops
Line management or mentoring
Written and verbal fluency in English is required

Responsibilities

Ensure Uptime of Experian One – Experian's Cloud SaaS offering for Decision Analytics
Monitor and provide alerts of our platform
Respond to incidents and restoring service
Gain a good enough understanding of the systems to assess issues and find owners for problem resolution
Identify an issue or a manual process and ensure that they never occur again
Incident management; able to co-ordinate others and be coordinated during service disruptions with a focus on restoring availability
Write complex queries using multiple tools
Review systems designs and implementations to identify resiliency, scalability and monitoring issues before implementation
Role model behaviors and give technical leadership within the team

Preferred Qualifications

Incident manager skills and can manage rationally and calmly during a crisis
Work through boundaries - geographically, teams, language and cultural
Curious and willing and to stay informed about relevant technology trends and developments
Cloud aware, you understand how cloud technologies differ from other technical approaches and can explain these to others
Previous job stability, including maintaining long-term work relationships with former employers

Benefits

Medical, life, vision and dental insurance
Asociacion Solidarista
International Share Save Plan
Flex Work/Work from home
Paid time off
Birthday day off
Annual Performance Bonus
Education Reimbursement
Family Bonding
Bereavement Leave
Referral Program

Staff Site Reliability Engineer

Experian

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Aviatrix

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Stash

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level