Site Reliability Engineer

Feedzai Logo

Feedzai

πŸ“Remote - Portugal

Summary

Join Feedzai's Platform Engineering Performance & Reliability team as a Platform Engineer and contribute to the optimization and scalability of our cloud-based risk management platform. You will work with a talented team to build and maintain distributed systems, automate infrastructure, and resolve production issues. This role requires experience in cloud services, programming (Go, Python), and system design. You will be responsible for capacity planning, collaboration with product teams, and incident response. The ideal candidate is passionate about distributed systems, performance, and reliability. Feedzai offers a fast-paced, collaborative environment with opportunities for continuous learning.

Requirements

  • A bachelor's degree in Computer Science, Information Systems, or the equivalent combination of education, experience, and training
  • Programming skills (Go, Python or similar languages)
  • 3+ years of experience in data structures, algorithms, programming, asynchronous & multithreaded designs
  • 3+ years of experience with building scalable and distributed cloud services
  • 3+ years operating production environments
  • 2+ years of experience in cross team collaboration within a supportive role
  • Self-driven & motivated, with a strong work ethic and a passion for problem solving
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills
  • Experience being oncall

Responsibilities

  • Provide recommendations about capacity allocation considering cost, resilience and performance
  • Work together with product teams to support best practices and drive improvements on systems performance and reliability before and after they go live
  • Development with Go, Python or similar languages
  • Automate all aspects of cloud infrastructure and incident response
  • Develop playbooks related to actionable alerts
  • Participate in incident response, root cause investigation and resolution
  • Maintain and develop our infrastructure as code (IaC) to manage and operate end-to-end lifecycle operations (monitoring, alerting, security, cost optimization, configuration, backup, etc.) in production environments
  • Utilize your experience and problem solving skills to help prevent and investigate production issues

Preferred Qualifications

  • Experience with monitoring & Observability stacks such as Grafana and Prometheus
  • Kubernetes, Cloud and Hashicorp experience is valued
  • Knowledge or experience with AWS or GCP

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.