Infrastructure Support Engineer
Thoughtworks
Job highlights
Summary
Join Thoughtworks as a Consultant Infrastructure Support Engineer and play a crucial role in ensuring technical excellence and operational efficiency in cloud environments. You will be a first responder to production incidents, automating daily operations, and assisting development teams in resolving issues. Responsibilities include monitoring product performance, documenting incident responses, automating operations, and conducting root cause analysis. The ideal candidate possesses strong technical skills in areas such as CI/CD, log aggregation, monitoring, and cloud platforms, along with excellent communication and problem-solving abilities. Thoughtworks offers a supportive learning and development environment, fostering career growth and providing opportunities to collaborate with experienced professionals.
Requirements
- Be familiar with CI/CD tools such as Jenkins, GitlabCI, CircleCI, etc
- Have had exposure to log aggregation systems, e.g.: EFK, Splunk, Datadog
- Have hands-on experience with monitoring, alerting and observability, e.g.: Prometheus, Grafana, Datadog
- Possess a good understanding of at least one Public Cloud, e.g.: AWS, Azure, GCP
- Have hands-on experience executing most common operations in managing workloads on any container ecosystem tech stacks e.g.: Docker, Kubernetes, Openshift
- Have a basic understanding of API concepts such as request, response, headers, authentication, JSON payloads, etc
- Have a basic understanding of networking including concepts such as high availability, load balancing and proxies
- Have a basic understanding of traffic load management approaches such as horizontal and vertical scaling
- Have a basic understanding of availability concepts such as downtime, time to recover/restore, SLAs, etc
- Have experience running basic system administration operations in a Linux operating system such as RHEL or Ubuntu
- Have good communication skills and are proficient in English
- Be able to confidently hold a Q&A discussion
- Have a good attitude towards learning new technical skills and concepts
- Possess innovative thinking and confidence in suggesting ideas to the team
- Have strong drive and ownership to sign up and deliver work when called upon without being too concerned with role boundaries
- Be willing to be part of a rotation- and need-based 24x7 team
Responsibilities
- Keep a vigilant eye on the operations of shipped products and services following the agreed upon βEyes on glass/Follow the sunβ engagement models
- Monitor product/service operations against key performance indicators defined by the business and take necessary actions in response to detected deviations
- Document the appropriate responses to various kinds of incident scenarios in collaboration with development teams and prepare runbooks
- Reduce the human effort in day-to-day operations by automating, configuring and tweaking alerts, and monitoring as necessary
- Respond to production incidents and execute well defined responses, raising the incident to higher levels of support wherever necessary
- Assist development teams in incident resolution as necessary, e.g.: as a pair, providing updates, handling communication, etc
- Assist in conducting incident root cause analysis (RCA), preparing incident postmortem reports, communicating incident RCA to client stakeholders whenever necessary and responding to queries and resolution approaches
- Pair on implementing service/product reliability improvement by writing infrastructure/observability configuration code, in collaboration with service reliability engineers
Benefits
- Learning & Development opportunities
- Remote work (#LI-Remote)
Share this job:
Similar Remote Jobs
- πRomania
- πSouth Africa
- πUnited States
- π°$177k-$213kπUnited States
- πRomania
- πCanada
- πUnited Kingdom
- πIndia