Tech Holding is hiring a
Senior Site Reliability Engineer

closed
Logo of Tech Holding

Tech Holding

πŸ’΅ ~$150k-$222k
πŸ“Remote - Mexico

Summary

Tech Holding is seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and performance of critical infrastructure and applications. The role involves collaboration with various teams, incident management, defining SLAs, automation, and mentorship. Required skills include 5-8 years of SRE experience, proficiency with GCP, monitoring tools, incident management best practices, alerting tools, scripting languages, communication skills, problem-solving skills, and a passion for building reliable systems.

Requirements

  • 5-8 years of experience as a Site Reliability Engineer (SRE) or related role
  • Experience with cloud platform GCP
  • Proven experience with monitoring tools like Prometheus and Grafana
  • Strong understanding of incident management best practices
  • Experience with alerting tools like PagerDuty
  • Experience with scripting languages like Python or Bash for automation
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Strong problem-solving and analytical skills
  • Passion for building reliable and scalable systems

Responsibilities

  • Ensure the reliability, scalability, and performance of critical infrastructure and applications
  • Partner with development teams to implement best practices for building reliable and scalable systems
  • Stay up-to-date on the latest SRE trends and technologies
  • Design, implement, and maintain robust monitoring solutions using tools like Prometheus and Grafana
  • Develop and configure alerts within tools like PagerDuty to ensure timely notification of potential issues
  • Analyze and troubleshoot issues using collected application and infrastructure metrics
  • Lead incident response, ensuring timely resolution and minimizing downtime
  • Document and communicate incident details effectively to stakeholders
  • Conduct post-incident reviews to identify root causes and implement preventative measures
  • Collaborate with product and engineering teams to define clear and measurable SLAs for SaaS offerings
  • Establish Service Level Objectives (SLOs) for key metrics based on SLA requirements
  • Define Service Level Indicators (SLIs) to track progress towards achieving SLOs
  • Monitor SLO compliance and proactively identify potential SLA breaches
  • Identify opportunities for automation to improve efficiency and reliability
  • Develop and implement automation scripts using tools like Python or Bash
  • Automate routine tasks and incident response workflows
  • Act as a liaison between SRE, Product, Security, Application Engineering, and Customer Operations teams
  • Facilitate communication and information sharing across teams to ensure smooth operations
  • Work collaboratively to define and implement solutions that meet the needs of all stakeholders

Preferred Qualifications

  • Experience with container orchestration platforms like Kubernetes
  • Experience with chaos engineering principles
  • Experience with configuration management tools like Ansible or Chef

Benefits

  • Remote Work Opportunities
  • Flexible Work Hours
This job is filled or no longer available

Similar Jobs