Site Reliability Engineer

Logo of Glean

Glean

đź“ŤBangalore

Job highlights

We’re on a mission to bring people the knowledge they need to make a difference in the world. 

Glean was founded by a seasoned team of former Google search and Facebook engineers, who wondered why we don’t have an easier way of finding what we need at work. In our personal lives, we have tools to help us find pretty much whatever we need. Why don’t we have that at work? And that was the beginning of Glean.

Glean searches across all your company’s apps to help you find exactly what you need and discover the things you should know. We’re a diverse team of curious and creative people who want to help each other get big things done—so we can help other teams do the same. 

We're backed by some of the Valley's leading venture capitalists—including Sequoia, Kleiner Perkins, Lightspeed, and General Catalyst—and have assembled a world-class team with senior leadership experience at Google, Slack, Facebook, Dropbox, Rubrik, Uber, Intercom, Pinterest, Palantir, and others.

About the Role:

We are seeking a skilled and motivated Site Reliability Engineer (SRE) to become a valuable addition to our dynamic and innovative team. As a SRE, you will play a critical role in ensuring the reliability, availability, and performance of our cloud-based services and applications. You will work closely with our engineering teams to design, build, and maintain robust, scalable, and highly available cloud infrastructure.

Much of our software development focuses on building infrastructure to scale our operations in a hybrid cloud environment and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale and fast growth which are unique to Glean, while using your expertise in coding, algorithms, problem-solving, and SRE practices. We keep Glean applications up and running, ensuring our customers have the best and most reliable experience possible.

What you will do and achieve:

  • Ensure High Availability: Implement and maintain resilient cloud architectures, monitor system performance, and proactively identify and resolve potential bottlenecks or points of failure. 

  • Incident Management: Play an active role in production on-call, responding swiftly to troubleshoot and resolve production issues. Minimize service disruptions and downtime by conducting thorough triaging and debugging of product or system issues. Continuously optimize the on-call process for sustainability and efficiency.

  • Automation and Tooling: Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks. Your contributions will be vital in efficiently scaling cloud operations.

  • Performance Optimization: Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness.

  • Security and Compliance: Collaborate with security engineers to implement best practices and ensure compliance with security standards and policies.

  • Monitoring and Alerting: Design and configure advanced monitoring systems to gain insights into system behavior, set up alerts, and respond proactively to potential issues. Create and maintain comprehensive dashboards and playbooks for production on-call.

  • Software Development Consultation: Engage actively in the entire software development lifecycle. Participate in system design reviews and provide valuable SRE insights during launch reviews, influencing and enhancing system architecture.

Who you are:

  • Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.

  • 3+ years of experience as a Site Reliability Engineer or similar role, with a primary focus on managing cloud-based services and infrastructure.

  • 5+ years of experience with software development in one or more programming languages.

  • Strong knowledge of cloud platforms such as Google Cloud Platform, AWS, or Azure.

  • Practical experience with containerization technologies, including Docker and Kubernetes. Familiarity with infrastructure as code tools like Terraform is essential.

  • Solid understanding of networking, security principles, and best practices.

  • Proficiency in using monitoring and alerting tools to detect and respond to potential issues effectively.

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.
Please let Glean know you found this job on JobsCollider. Thanks! 🙏