Lead Site Reliability Engineer at Hitachi Solutions

Summary

Join Hitachi Solutions as a full-time Systems Design Expert in our product organization. You will be responsible for designing and implementing CI/CD tooling using Azure DevOps and related technologies, managing Azure Kubernetes AKS clusters, and ensuring the availability, performance, and reliability of critical business applications. This role requires strong experience in Azure, Kubernetes, and IaC, as well as expertise in SRE principles. You will collaborate with various teams, participate in Agile ceremonies, and mentor other engineers. The position offers a competitive salary, comprehensive benefits, and flexible work arrangements.

Requirements

Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider
Solid experience with Monitoring/APM/Observability tools (Data dog, Application Insights, Prometheus, Grafana etc.,)
Strong backgroud with Azure Resources like Key Vault, Data Factory, Azure Databricks and Storage Accounts
Experience implementing observability plans around logs, metrics, and traces
Experience in an agile development team developing software
Implement and participate exercising best practices for CI/CD
Experience with cloud infrastructure environments, preferably Azure, and Infrastructure as code (Terraform, Bicep, ARM)
Design, develop, and maintain infrastructure using popular IaC tools and technologies like Terraform, Helm, others
Strong experience with containerization technology and/or Kubernetes
Experience with Release automation, system administration, configuration management
Experience with programming languages (Python, Go, etc.)
Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts
Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports
Strong analytical and programming skills (Python, Go etc.)

Responsibilities

Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting, and maintaining SLOs, SLIs and Error Budgets, creating dashboards
Analyze, troubleshoot, and resolve operational challenges contributing to defined SLO's
Manage site stability, performance, reliability, and maintain uptime for production environments
Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns
Strive for automation to reduce toil and increase development velocity
Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed
Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach
Analyze and address complex technical challenges and issues that arise during the software development & run lifecycle. Debug, troubleshoot, and resolve technical problems efficiently
Create and maintain technical documentation, including design specifications, user guides, run books and best practice guidelines
Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation
Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams
Participate in Agile ceremonies, such as sprint planning, stand-up meetings, and retrospectives
Collaborate with product managers, designers, and other engineers to ensure alignment and efficient project execution
Share your expertise and mentor engineers, helping them grow and develop their skills. Foster a culture of continuous learning and improvement within the team
Stay updated with the latest technologies, tools, and cloud computing. Proactively learn and adapt to new technologies to drive innovation
Collaborate with customers to understand their needs, gather feedback, and provide technical support and guidance as needed
Triage incoming Web Support escalation requests routing to applicable internal teams
Contribute to incident root cause analysis, service restoration, and serve as an incident commander during outage events

Preferred Qualifications

Experience with MLFlow and other MLOps pipeline technology

Benefits

Bonus Plan
Medical, Dental and Vision Coverage
Life Insurance and Disability Programs
Retirement Savings with Company Match
Paid Time Off
Flexible Work Arrangements including Remote Work

Lead Site Reliability Engineer

Hitachi Solutions

Summary

Requirements

Responsibilities

Preferred Qualifications

Benefits

Remote

DevOps

Senior

Similar Remote Jobs

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Tailor

Remote

Software Development

Mid-level