Site Reliability Engineer

Feedzai
Summary
Join Feedzai's Platform Engineering Performance & Reliability team and contribute to the optimization of existing systems, infrastructure building, and automation. You will manage the complex challenges of scale in Feedzai's fraud detection mission, collaborating with talented platform engineers on complexity analysis and large-scale system design. The role involves developing automation, tooling, and platforms supporting Feedzai's cloud service. You will provide recommendations on capacity allocation, work with product teams to improve system performance and reliability, and participate in incident response and root cause investigation. This position requires experience in distributed systems, cloud services, and programming languages like Go or Python. Feedzai offers a fast-paced, collaborative environment with opportunities for continuous learning.
Requirements
- A bachelor's degree in Computer Science, Information Systems, or the equivalent combination of education, experience, and training
- Programming skills (Go, Python or similar languages)
- 2+ years of experience in data structures, algorithms, programming, asynchronous & multithreaded designs
- 2+ years of experience with building scalable and distributed cloud services
- 2+ years operating production environments
- 1+ years of experience in cross team collaboration within a supportive role
- Self-driven & motivated, with a strong work ethic and a passion for problem solving
- Systematic problem-solving approach, coupled with effective verbal and written communication skills
- Experience being oncall
Responsibilities
- Provide recommendations about capacity allocation considering cost, resilience and performance
- Work together with product teams to support best practices and drive improvements on systems performance and reliability before and after they go live
- Development with Go, Python or similar languages
- Automate all aspects of cloud infrastructure and incident response
- Develop playbooks related to actionable alerts
- Participate in incident response, root cause investigation and resolution
- Maintain and develop our infrastructure as code (IaC) to manage and operate end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
- Utilize your experience and problem solving skills to help prevent and investigate production issues
Preferred Qualifications
- Experience with monitoring & Observability stacks such as Grafana and Prometheus
- Kubernetes, Cloud and Hashicorp experience is valued
- Knowledge or experience with AWS or GCP