Staff Site Reliability Engineer

Illumio Logo

Illumio

πŸ“Remote - Australia

Summary

Join Illumio, a leader in Zero Trust Segmentation, as a Product Site Reliability Engineer (SRE). This remote role in Australia requires expertise in AWS and Azure cloud platforms, application performance, and operational excellence. You will investigate and resolve production incidents, monitor application health, develop automation scripts, conduct root cause analyses, and collaborate with cross-functional teams. The ideal candidate possesses a Bachelor's degree or equivalent experience, 6+ years of relevant SRE experience, and proficiency in programming and scripting. Illumio offers a wide range of benefits, varying by location, including health insurance, paid time off, retirement savings, and more.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related field; or equivalent work experience
  • 6+ years of relevant SRE experience
  • Strong hands-on experience with AWS and Azure
  • Familiarity with Kubernetes and containerized environments
  • Knowledge of networking concepts, such as DNS, load balancing, and firewalls
  • Proficient in diagnosing and resolving complex issues in SaaS environments, including performance bottlenecks and application errors
  • Proficiency in at least one programming language (e.g., Python, Go, Java) and scripting languages (e.g., Bash, PowerShell)
  • Experience with tools like Datadog, New Relic, Prometheus, Grafana, ELK, or Azure Monitor
  • Familiarity with tools like Ansible, Terraform, or CloudFormation
  • Knowledge of debugging and optimizing relational databases (e.g., PostgreSQL, MySQL) and caching systems (e.g., Redis, Memcached)
  • Experience with incident management tools and processes, including conducting RCAs and improving on-call processes

Responsibilities

  • Investigate and resolve production incidents and escalations to ensure minimal downtime and impact to customers
  • Work closely with engineering and support teams to troubleshoot application and infrastructure issues
  • Proactively monitor application health, performance, and reliability using modern observability tools
  • Analyze trends in system behavior and suggest performance improvements
  • Develop and maintain automation scripts and tools to improve operational efficiency and incident resolution
  • Create and enhance runbooks to streamline troubleshooting and reduce mean time to resolution (MTTR)
  • Conduct thorough post-incident reviews to identify root causes and implement preventive measures
  • Drive a culture of continuous improvement by documenting lessons learned and improving system designs
  • Partner with software engineers, QA, and product teams to improve application stability and user experience
  • Act as a bridge between development and operations, ensuring smooth and reliable service delivery

Benefits

  • Medical, Dental, Vision Coverage
  • Health and Dependent Savings Accounts
  • Life and Disability Programs
  • Paid Parental Leave
  • Voluntary Benefit Programs
  • Company Sponsored Wellness Program
  • Wellness Reimbursement Program
  • Retirement Savings
  • Equity Opportunities
  • Paid time off and Paid Holidays
  • Employee Incentive Program

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs