Summary

Join our team as a Site Reliability Engineer (SRE) and ensure the reliability, performance, and scalability of our software, websites, and applications. This role blends software engineering and systems administration, requiring expertise in cloud infrastructure and automation. You will monitor, control, and automate systems, respond to incidents, and optimize performance. Collaboration with development and cross-functional teams is crucial. The ideal candidate possesses deep Kubernetes and container expertise, along with strong problem-solving and communication skills. This position is vital for maintaining the overall health and efficiency of our platform.

Requirements

Possess deep expertise of Kubernetes and containers
Have a strong understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance
Have experience with monitoring and logging tools such as Loki, Grafana
Have a minimum of 3 years of experience in site reliability engineering, Kubernetes administration, or a related role
Possess excellent problem-solving skills and attention to detail
Possess strong communication and interpersonal skills, with the ability to work effectively with cross-functional teams

Responsibilities

Monitor the performance and reliability of Kubernetes clusters, software, websites, and applications
Automate routine maintenance tasks to ensure system stability and performance
Respond to and resolve incidents in a timely manner, minimizing downtime and impact on users
Conduct root cause analysis to identify and address underlying issues
Develop and implement strategies to prevent future incidents and improve system resilience
Design, build, and maintain automated systems and processes to improve efficiency and reduce manual intervention
Manage cloud infrastructure, including provisioning, scaling, and optimizing resources
Collaborate with development teams to ensure seamless deployment and integration of new features and updates
Analyze system performance and identify areas for improvement
Implement performance tuning and optimization techniques to enhance system efficiency
Collaborate with cross-functional teams to ensure optimal performance of all components
Ensure compliance with security best practices and industry standards
Implement and maintain security measures to protect systems and data
Conduct regular security audits and vulnerability assessments
Maintain accurate and up-to-date documentation of systems, processes, and procedures
Generate and analyze reports on system performance, incidents, and other key metrics
Provide regular updates to management and stakeholders on system health and performance
Identify opportunities for improving system reliability, performance, and scalability
Stay up-to-date with industry trends and best practices in site reliability engineering
Participate in training and development opportunities to enhance skills and knowledge

Site Reliability Engineer

Omniscius

Summary

Requirements

Responsibilities

Remote

DevOps

Mid-level

Share this job:

Similar Remote Jobs

Remote

DevOps

Mid-level

Tailor

Remote

Software Development

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Mid-level

Wizeline

Remote

DevOps

Mid-level

Wizeline

Remote

DevOps

Mid-level

Remote

DevOps

Mid-level

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior