Site Reliability Engineer

Logo of Omniscius

Omniscius

πŸ“Remote - Worldwide

Job highlights

Summary

Join our team as a Site Reliability Engineer (SRE) and ensure the reliability, performance, and scalability of our software, websites, and applications. This role blends software engineering and systems administration, requiring expertise in cloud infrastructure and automation. You will monitor, control, and automate systems, respond to incidents, and optimize performance. Collaboration with development and cross-functional teams is crucial. The ideal candidate possesses deep Kubernetes and container expertise, along with strong problem-solving and communication skills. This position is vital for maintaining the overall health and efficiency of our platform.

Requirements

  • Possess deep expertise of Kubernetes and containers
  • Have a strong understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance
  • Have experience with monitoring and logging tools such as Loki, Grafana
  • Have a minimum of 3 years of experience in site reliability engineering, Kubernetes administration, or a related role
  • Possess excellent problem-solving skills and attention to detail
  • Possess strong communication and interpersonal skills, with the ability to work effectively with cross-functional teams

Responsibilities

  • Monitor the performance and reliability of Kubernetes clusters, software, websites, and applications
  • Automate routine maintenance tasks to ensure system stability and performance
  • Respond to and resolve incidents in a timely manner, minimizing downtime and impact on users
  • Conduct root cause analysis to identify and address underlying issues
  • Develop and implement strategies to prevent future incidents and improve system resilience
  • Design, build, and maintain automated systems and processes to improve efficiency and reduce manual intervention
  • Manage cloud infrastructure, including provisioning, scaling, and optimizing resources
  • Collaborate with development teams to ensure seamless deployment and integration of new features and updates
  • Analyze system performance and identify areas for improvement
  • Implement performance tuning and optimization techniques to enhance system efficiency
  • Collaborate with cross-functional teams to ensure optimal performance of all components
  • Ensure compliance with security best practices and industry standards
  • Implement and maintain security measures to protect systems and data
  • Conduct regular security audits and vulnerability assessments
  • Maintain accurate and up-to-date documentation of systems, processes, and procedures
  • Generate and analyze reports on system performance, incidents, and other key metrics
  • Provide regular updates to management and stakeholders on system health and performance
  • Identify opportunities for improving system reliability, performance, and scalability
  • Stay up-to-date with industry trends and best practices in site reliability engineering
  • Participate in training and development opportunities to enhance skills and knowledge

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.