Site Reliability Engineer

Axiom Software Solutions Limited Logo

Axiom Software Solutions Limited

πŸ“Remote - United States

Summary

Join us as a Site Reliability Engineer (Ex - Fidelity Exp) in a remote contract position. Design, implement, and manage Kubernetes environments, building and maintaining scalable infrastructure using infrastructure as code. Develop comprehensive monitoring solutions, analyze system performance, and implement improvements. Implement and maintain CI/CD pipelines, conduct incident response and root cause analysis, and create automation tools leveraging AI/ML. Collaborate with development teams to enhance application reliability and performance. This role requires strong expertise in Kubernetes, Linux/Unix, and database administration, along with programming skills in Python, Go, Java, or Node.js.

Requirements

  • 5-7 years of experience in SRE or DevOps roles
  • Strong expertise with Kubernetes ecosystem and container orchestration
  • Deep understanding of Linux/Unix operating systems and performance analysis tools (NMON, etc.)
  • Experience with log analysis, monitoring systems, and observability tools
  • Proficiency in database administration and performance tuning (Oracle, SQL Server)
  • Strong programming skills in at least one of: Python, Go, Java, or Node.js
  • Experience developing automation tools and frameworks
  • Proven track record of proactive problem identification and resolution

Responsibilities

  • Design, implement, and manage Kubernetes environments from deployment to configuration, monitoring, and troubleshooting
  • Build and maintain scalable and reliable infrastructure using infrastructure as code principles
  • Develop comprehensive monitoring solutions and implement alerting strategies
  • Analyze system performance bottlenecks and implement improvements
  • Implement and maintain CI/CD pipelines for seamless deployments
  • Conduct incident response, root cause analysis, and implement preventative measures
  • Create and enhance automation tools leveraging AI/ML where applicable
  • Collaborate with development teams to improve application reliability and performance

Preferred Qualifications

  • Experience with AI/ML integration into operational workflows
  • Cloud platform experience (AWS, GCP, Azure)
  • Knowledge of service mesh technologies
  • Experience with distributed systems architecture
  • Familiarity with security best practices and compliance requirements

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.