Senior Site Reliability Engineer

IntelliPro
Summary
Join a high-impact infrastructure team at a fast-growing global technology leader as a Senior Site Reliability Engineer. This role focuses on scaling reliable, high-performance systems in a cloud-native environment, working on large-scale, mission-critical applications used by millions. You will ensure 24/7 uptime, operate and maintain core systems, architect monitoring solutions, collaborate with engineering teams, develop automation tools, and troubleshoot infrastructure bottlenecks. The ideal candidate will have a Bachelor's degree in a related field, 5+ years of relevant experience, and deep expertise in Linux, distributed systems, and cloud architecture. This position offers a hybrid or remote work setup (in select states) and a competitive compensation and benefits package.
Requirements
- Bachelorβs degree in Computer Science, Information Systems, or a related technical field
- 5+ years of experience supporting mission-critical, real-time, high-traffic systems in a cloud-based or hybrid production environment
- Deep expertise in Linux , distributed systems, cloud architecture, and containerized workloads ( Docker, Kubernetes , etc.)
- Skilled in system-level debugging and end-to-end performance optimization
- Strong programming/scripting ability in Python, Go , or similar
- Experience managing OSS components such as Kafka, Elasticsearch, Redis , and more
- Proven ability to reduce incident rates and drive down MTTR through process improvements and tooling
- Excellent communication skills and experience working across distributed teams
Responsibilities
- Ensure 24/7 uptime by participating in a rotating on-call schedule and managing production incidents across distributed environments
- Operate and maintain core systems like Elasticsearch, Kafka, RabbitMQ, Redis , with a focus on reliability and performance
- Architect monitoring solutions, define SLOs/SLIs, and implement scalable observability tools (e.g., Grafana, Prometheus, Zabbix )
- Collaborate with engineering teams to optimize capacity, auto-scaling, and system utilization
- Develop and maintain automation tools and workflows to support a culture of minimal manual intervention
- Troubleshoot infrastructure bottlenecks and improve full-stack performance across services
- Own the design and execution of new infrastructure patterns to support continued scale and speed
- Maintain clear technical documentation including runbooks, incident response procedures, and architectural diagrams
Preferred Qualifications
- Experience with big data infrastructure (e.g., Hadoop, Spark, Hive, HBase )
- Background in data infrastructure, DBRE, or DBA responsibilities at scale
- Familiarity with service mesh technologies and zero-trust architectures
Benefits
- Full medical, dental, and vision insurance
- HSA with company contributions + FSA options
- 401(k) plan with discretionary company match and financial advising
- Company-paid life, AD&D, short-term & long-term disability insurance
- Paid holidays, generous PTO, and floating days
- Employee discounts and perks
- Weekly catered lunches, stocked snacks, and beverages
- Gym access & dog-friendly office (select locations)
- Swag, holiday parties, and internal community events
- Base Salary: $107,600 β $180,200/year
- Compensation: Includes annual bonus + equity (RSU)