Application SRE

Encora Logo

Encora

πŸ“Remote - Brazil

Summary

Join Encora as a Senior Site Reliability Engineer (SRE) in Brazil and lead efforts to ensure the reliability, availability, and performance of applications and platforms. This full-time, work-from-home position involves overseeing production operations, managing incidents, performing root cause analysis, and implementing preventative measures. You will collaborate with development teams to enhance application performance and reliability and work with a global team providing 24/7 support for production applications. The role requires extensive experience with observability and monitoring tools, AWS, Kubernetes, DevOps practices, and Agile methodologies. You will also mentor team members and work with clients to investigate and escalate incidents.

Requirements

  • Experience in Tier 2 or Tier 3 product support of one of the following roles: business/systems analysis, technology/development, data/reporting, project management
  • Possess the ability to analyze logs and code to fix Tier 2 support issues
  • Experience as a Site Reliability Engineer (SRE), preferably with a focus on applications instead of platforms
  • Extensive experience with observability and monitoring, especially with OpenTelemetry, Splunk, AppDynamics, and Datadog
  • Experience with AWS and/or Kubernetes
  • Background in DevOps practices
  • Scripting experience with Python
  • Experience with L1 and L2 support, incident management, ITIL, and writing documentation
  • Experience with disaster recovery, business continuity planning, creating ServiceNow dashboards, Linux, and shell scripting
  • Deep background working in an Agile methodology
  • Knowledge of cloud-native application architecture design patterns
  • Experience using Postman or similar for making API calls and testing
  • Experience with Mulesoft

Responsibilities

  • Coaching and mentoring fellow team members
  • Use Splunk and other observability tools to monitor and troubleshoot application issues
  • Capture metrics and create dashboards using Splunk and other tools
  • Work with a global team to provide 24/7 support for production applications running on AWS and Mulesoft
  • Perform incident management, root cause analysis, and implement preventative measures
  • Work with team members and clients to investigate and escalate incidents
  • Responding proactively to indications of issues or complaints by customers
  • Applying industry best practices throughout our processes

Benefits

Work from home

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs