Senior Production Operations Engineer at Index Exchange

Summary

Join Index Exchange, a leading ad tech company, and become a crucial member of the global Production Operations team. You will play a vital role in ensuring the operational stability and reliability of our worldwide 24/7 on-premises and hybrid-cloud environments. This position requires in-depth understanding of systems, network, and hardware fundamentals, along with experience in private-cloud infrastructure engineering and maintenance. You will be responsible for incident response, system maintenance, automation, and documentation. The ideal candidate possesses strong technical expertise, excellent communication skills, and a commitment to continuous improvement. Index Exchange offers a supportive and collaborative work environment with comprehensive benefits.

Requirements

In-depth understanding of the Linux operating environment: kernel tuning, network stack tuning, system observability & instrumentation, and security & access management
Solid understanding of layer 2-7 networking fundamentals and the relationship between servers & services, and the transit of their packets through network hardware
In-depth experience engineering and maintaining a private-cloud infrastructure: Bare-metal, vSphere, KVM, Kubernetes
Experience with tools like Ansible, Terraform, Docker, Kafka, Nexus
Experiencing with observability platforms: InfluxDB, Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
Familiarity with Big Data tools: Hadoop, HDFS, Spark, HBase
Ability to write code in Go, Python, Bash, or Perl for automation
6-8 years of proven experience in previous roles or one of the following roles: DevOps Engineer
Linux System Administrator
Site Reliability Engineer (SRE)
Built or maintained a private-cloud infrastructure running centos/rocky linux on a mix of bare-metal, virtualization, and containerization
Managed public cloud environments such as aws, gcp, azure and their federation into on-premise environments
Life-cycle management of baremetal servers such as Dell and Supermicro in globally distributed data centers (e.g. break-fix, baseband/firmware updates)
Built or maintained on-premise and cloud Kubernetes clusters: Kubadm, Kind, EKS, GKE
Built or operated automation & orchestration frameworks for deployment & maintenance pipelines: e.g. kafka, stackstorm, ansible, argocd, terraform to push out code or configuration updates, and building new infrastructure systems
Communication: Clear and effective communication within and across teams. While we place a huge premium on technical skill, we value just as much your ability to work with other people
Curiosity: things can (and will) break for different reasons; your curiosity will help drive you to identify and fix the things that go wrong
Alertness: we can never predict when things will go wrong so it is your job to be vigilant and prepared to respond when they do; you must be ready to reach out, ask questions and sound the alarm when necessary
Analytical Thinking: Monitor and analyze activity, collaborate with other departments to maintain technical defense
Reliability: Prioritize the reliability of our systems, ensuring our exchange customers can trust in our services 24/7. Adhere to operational procedures, best practices, and security protocols
Continuous Improvement: Embrace a culture of continuous learning and innovation, always seeking ways to enhance our operational efficiency
Customer-Centricity: Committed to providing the best possible experience for our customers, both internal and external
Accountability: Take ownership of our responsibilities and hold ourselves accountable for the quality of our work

Responsibilities

Maintain oversight on internal metrics, including the health, security, and performance of on-premises & hybrid-cloud network and systems infrastructure environments
Execute timely and effective incident response, identifying and mitigating issues to minimize downtime
Respond to alerts within our established SLOs and assist in incident triage, ensuring that the right teams are engaged to address issues promptly
Participate in maintaining system backups, disaster recovery plans, and security protocols are in place and maintained
Serve as a point-of-contact team for operational issues, providing both internal and external teams with technical support and ensuring the issue remains in custody until resolution
Collaborate with product and software engineering teams to relay operational insights and requirements
Continuously identify opportunities for optimization and present findings to technical leads and management
Research and implement improvements enhancing systems performance and scalability
Continuously research and embrace technological advancements and industry best practices to deliver exceptional service
Actively identify and mitigate risks and escalate them so the team can proactively address present or anticipated operational challenges
Develop, implement, and maintain automation frameworks streamlining operational processes, reducing time spent on manual tasks
Identify catalysts for future optimization including provisioning techniques, deployment optimization, ancillary services, pipelines, ansible playbooks, power usage, bandwidth etc
Draft comprehensive documentation for system configurations, processes, and incident resolution procedures
Participate in knowledge sharing within the team and with support provided about the content and delivery, provide cross-training to other relevant departments
Create and maintain runbooks and technical documentation, in addition to being familiar with internal and external escalation pathways
Joining a globally distributed team that maintains coverage 24X7. As a member of this team and broader group, you may be required to occasionally work some weekends, holidays, and after hours to respond to high-urgency or emergency events outside of your local time-zone

Benefits

Company paid comprehensive health and life insurance plans
Paid Time off and flexible work schedules
Company contribution to Provident Fund
Participation in our company Stock options plan
Company paid Parental Leave
Monthly internet stipend
Quarterly Wellness allowance
Community engagement opportunities and donation-matching program
Volunteer paid day off
Annual virtual company retreats and regular community-led team events
A workplace that supports a diverse, equitable, and inclusive environment

Senior Production Operations Engineer

Index Exchange

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Senior

Share this job:

Similar Remote Jobs

Remote

DevOps

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Remote

Product

Senior

Remote

Product

Senior

Remote

Software Development

Senior

Remote

Software Development

Senior

Wave Mobile Money

Remote

Product

Senior

Remote

Project Management

Manager