Senior Site Reliability Engineer

DocPlanner
Summary
Join Docplanner as a Site Reliability Engineer (SRE) and play a key role in ensuring the reliability and performance of our platform. You will operate production environments, optimize system performance, improve software solutions, and provide operational support for large-scale applications. Responsibilities include ensuring system reliability and availability, investigating and resolving incidents, defining and maintaining SLOs/SLIs, and collaborating with developers. The ideal candidate possesses experience with monitoring stacks (DataDog/OTEL/Prometheus), a detective mindset for troubleshooting, .NET and AWS experience, and Kubernetes expertise. Docplanner offers a competitive salary, share options, flexible work arrangements, paid time off, private healthcare, and wellness programs.
Requirements
- Monitoring and observability - Experience with monitoring stack like DataDog / OTEL / Prometheus
- Detective mindset - Strong investigative mindset with a detective-like approach to troubleshooting and resolving complex issues
- .NET experience - Familiar with .NET environment and ability to code
- AWS experience - Experience working with AWS services and cloud-native architectures
- Kubernetes - Practical experience deploying, managing, and troubleshooting applications in Kubernetes; understanding of containers, Helm, and scaling strategies
- Think like an owner - Proactive approach to identifying problems, performance bottlenecks, and areas for improvement
- Communicator β Equally fluent when talking to humans or machines; clear, effective communication across teams and tools
Responsibilities
- Operate production environments by monitoring availability and taking a holistic view of system health
- Measure and optimize system performance to stay ahead of customer needs and drive continuous innovation
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Provide primary operational support and engineering expertise for multiple large-scale, distributed software applications
- Ensure reliability and availability of systems through monitoring, alerting, and incident response
- Investigate and resolve incidents, perform root cause analysis, and implement long-term fixes
- Define and maintain SLOs/SLIs to measure and drive service quality
- Continuously improve performance and optimize infrastructure cost and resource usage
- Collaborate with developers to build scalable, fault-tolerant systems and improve deployment practices
- Automate operational tasks to reduce manual toil and improve efficiency
Preferred Qualifications
- Proficiency in scripting or programming with languages such as Python or Go β to support automation and tooling development
- Hands-on experience in Site Reliability Engineering practices β including incident management and service-level objectives
- Understanding of microservices architecture β with experience in designing, observing, and troubleshooting distributed systems
Benefits
- A salary adequate to your experience and skills
- Share options plan after 6 months of working with us
- Remote or hybrid work model with or hub in Warsaw
- Flexible working hours (fully flexible, as in most cases you only have to be on a couple of meetings weekly)
- 20/26 days of paid time off (depending on your contract)
- Additional paid day off on your birthday or work anniversary (you choose what you want to celebrate)
- Private healthcare plan with Signal Iduna for you and subsidized for your family
- Multisport card co-financing for you to have access to sports facilities across Poland
- Access to iFeel , a technological platform for mental wellness offering online psychological support and counseling
Share this job:
Similar Remote Jobs
