Senior Site Reliability Engineer

NICE
Summary
Join NICE, a market-leading global company, as a Senior Site Reliability Engineer. You will support large enterprise software clients, focusing on applications, servers, SQL, and networks. This role demands excellent problem-solving skills and the ability to deliver real-time insights from massive-scale data. You will collaborate with a cross-functional team to develop solutions and enhance user experiences. The position involves managing production environments, building software and systems, improving reliability and time-to-market, and optimizing system performance. You will also provide operational support, analyze metrics, partner with development teams, and participate in system design and capacity planning.
Requirements
- 6+ years programming/scripting experience with any of the following: (Go, Python, .Net (C#), Node)
- Bachelorβs degree in computer science, Engineering, or related field (or equivalent experience)
- 6-8 years of working experience in a similar role, with a focus on systems engineering, automation, and reliability
- Proficiency in at least one programming language (e.g., Python, Go, Java, C#) and experience with scripting languages (e.g., Bash, PowerShell)
- Deep understanding of cloud computing platforms (e.g., AWS), the working and reliability constraints of some of the prominent services (e.g., EC2, ECS, Lambda, DynamoDB etc)
- Experience with infrastructure as code tools such as CloudFormation, Terraform
- Deep understanding of CI/CD concepts and experience with CI/CD tools such as Jenkins, GitLab CI/CD, or CircleCI
- Strong knowledge of containerization technologies (e.g., Docker, Kubernetes) and microservices architecture
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Cloudwatch)
- Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems
- Experience of Incident management and blameless postmortems that includes driving the incident response efforts during outages and other critical incidents, resolution, and communication in a cross-functional team setup
Responsibilities
- Run the production environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
Preferred Qualifications
Kubernetes + certification, Grafana , AWS, Azure, DevOps experience
Benefits
- Enjoy NICE-FLEX!
- At NICE, we work according to the NICE-FLEX hybrid model, which enables maximum flexibility: 2 days working from the office and 3 days of remote work, each week