Summary
Join Sumo Logic as a Senior Site Reliability Engineer 2, focusing on cost optimization. You will own the availability of Sumo's planet-scale products, working with the global SRE and FinOps teams to monitor, alert, and optimize cost efficiency. Collaborate with cross-functional teams to maximize the value of cloud infrastructure spending. Develop data models for cost analysis and profitability forecasting, and lead cost-saving projects. This role requires strong technical expertise and leadership in a fast-paced environment.
Requirements
- Understand and apply modern approaches to cloud-native software financial operations (finOps)
- Experience monitoring, alerting, and forecasting cloud spend
- Cloud native application development and operations experience leveraging best practices
- Strong debugging and trouble-shooting skills across the entire technology stack
- Deep understanding of AWS Networking, Compute, Storage, and managed services
- Competency with modern CI/CD tooling like Kubernetes, Terraform, Ansible & Jenkins
- Experience with full life cycle support of services, from creation to production support
- Versed in Infrastructure as Code practices using technologies like Terraform or Cloud Formation
- Ability to author production ready code in at least one the following: Java, Scala or Go
- Experience with Linux systems and at home on the command line
- Experienced with agile frameworks, such as Scrum and Kanban, and how to operate within these frameworks to continually deliver value
- Flexible and willing to step into new roles and responsibilities
- Willingness to learn and use Sumo Logic products for solving reliability and security issues
- Bachelorβs or Master's Degree in Computer Science, Electrical Engineering, or another scientific or technical discipline
- 6-8 years of industry experience
Responsibilities
- Support the engineering teams within your product area by maintaining and executing a reliability roadmap of opportunities for improvement for reliability, maintainability, security, efficiency, and velocity - and help realizing those opportunities that drive efficiency and reduce costs
- Lead initiatives to implement cost-saving projects, leveraging data insights and technical expertise
- Develop and maintain data models and reporting for Sumo Logic cost and profitability, such as unit economics or cost per customer
- Ensure engineering teams have near-real-time visibility into their cloud spend and can be held accountable for it, by providing them with cost dashboards, alerts and insights
- Collaborate with cross-functional teams to identify opportunities for cost optimization and efficiency improvements
- Engage in capacity planning activities by designing a resource reservation strategy and participate in commitment purchases (CSP, RI etc.)
- Drive continuous improvement and operational excellence for our FinOps tools
- Participate in on-call rotations to understand operations workload so you can continually work to improve the on-call experience and reduce operational workload for running microservices and related components
- Complete projects to optimize and tune on-call experience for your engineering teams
- Write code and automation to reduce operational workload, increase efficiency, improve security posture, eliminate toil, and enable Sumoβs developers to deliver features more rapidly
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability, velocity, and cost efficiency
- Facilitate blame-free root cause analysis meetings for incidents to learn and drive improvement
- Participate in and continually improve our global IRC (incident response coordination) for all products
- Drive root cause identification and issue resolution with the teams
- Work inside of a fast-paced iterative environment
Preferred Qualifications
- Experience using Sumo Logic products or other observability products for reliability and security
- Experienced with planet scale product development
- Running and operating SaaS products on AWS Cloud with expert level proficiency
- 1+ years of experience with cloud financial management or FinOps framework
- Expert level experience in one or more of: Java, Go, Scala, or Python
- Expert level experience in one or more of: Terraform, Jenkins, Kubernetes
- Extensive experience running and tuning JVM workloads at scale