Senior Production Operations Engineer
Index Exchange
Summary
Join Index Exchange, a leading ad tech company, and become a crucial member of the global Production Operations team. You will play a vital role in ensuring the operational stability and reliability of our worldwide 24/7 on-premises and hybrid-cloud environments. This position requires in-depth understanding of systems, network, and hardware fundamentals, along with experience in private-cloud infrastructure engineering and maintenance. You will be responsible for incident response, system maintenance, automation, and documentation. The ideal candidate possesses strong technical expertise, excellent communication skills, and a commitment to continuous improvement. Index Exchange offers a supportive and collaborative work environment with comprehensive benefits.
Requirements
- In-depth understanding of the Linux operating environment: kernel tuning, network stack tuning, system observability & instrumentation, and security & access management
- Solid understanding of layer 2-7 networking fundamentals and the relationship between servers & services, and the transit of their packets through network hardware
- In-depth experience engineering and maintaining a private-cloud infrastructure: Bare-metal, vSphere, KVM, Kubernetes
- Experience with tools like Ansible, Terraform, Docker, Kafka, Nexus
- Experiencing with observability platforms: InfluxDB, Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
- Familiarity with Big Data tools: Hadoop, HDFS, Spark, HBase
- Ability to write code in Go, Python, Bash, or Perl for automation
- 6-8 years of proven experience in previous roles or one of the following roles: DevOps Engineer
- Linux System Administrator
- Site Reliability Engineer (SRE)
- Built or maintained a private-cloud infrastructure running centos/rocky linux on a mix of bare-metal, virtualization, and containerization
- Managed public cloud environments such as aws, gcp, azure and their federation into on-premise environments
- Life-cycle management of baremetal servers such as Dell and Supermicro in globally distributed data centers (e.g. break-fix, baseband/firmware updates)
- Built or maintained on-premise and cloud Kubernetes clusters: Kubadm, Kind, EKS, GKE
- Built or operated automation & orchestration frameworks for deployment & maintenance pipelines: e.g. kafka, stackstorm, ansible, argocd, terraform to push out code or configuration updates, and building new infrastructure systems
- Communication: Clear and effective communication within and across teams. While we place a huge premium on technical skill, we value just as much your ability to work with other people
- Curiosity: things can (and will) break for different reasons; your curiosity will help drive you to identify and fix the things that go wrong
- Alertness: we can never predict when things will go wrong so it is your job to be vigilant and prepared to respond when they do; you must be ready to reach out, ask questions and sound the alarm when necessary
- Analytical Thinking: Monitor and analyze activity, collaborate with other departments to maintain technical defense
- Reliability: Prioritize the reliability of our systems, ensuring our exchange customers can trust in our services 24/7. Adhere to operational procedures, best practices, and security protocols
- Continuous Improvement: Embrace a culture of continuous learning and innovation, always seeking ways to enhance our operational efficiency
- Customer-Centricity: Committed to providing the best possible experience for our customers, both internal and external
- Accountability: Take ownership of our responsibilities and hold ourselves accountable for the quality of our work
Responsibilities
- Maintain oversight on internal metrics, including the health, security, and performance of on-premises & hybrid-cloud network and systems infrastructure environments
- Execute timely and effective incident response, identifying and mitigating issues to minimize downtime
- Respond to alerts within our established SLOs and assist in incident triage, ensuring that the right teams are engaged to address issues promptly
- Participate in maintaining system backups, disaster recovery plans, and security protocols are in place and maintained
- Serve as a point-of-contact team for operational issues, providing both internal and external teams with technical support and ensuring the issue remains in custody until resolution
- Collaborate with product and software engineering teams to relay operational insights and requirements
- Continuously identify opportunities for optimization and present findings to technical leads and management
- Research and implement improvements enhancing systems performance and scalability
- Continuously research and embrace technological advancements and industry best practices to deliver exceptional service
- Actively identify and mitigate risks and escalate them so the team can proactively address present or anticipated operational challenges
- Develop, implement, and maintain automation frameworks streamlining operational processes, reducing time spent on manual tasks
- Identify catalysts for future optimization including provisioning techniques, deployment optimization, ancillary services, pipelines, ansible playbooks, power usage, bandwidth etc
- Draft comprehensive documentation for system configurations, processes, and incident resolution procedures
- Participate in knowledge sharing within the team and with support provided about the content and delivery, provide cross-training to other relevant departments
- Create and maintain runbooks and technical documentation, in addition to being familiar with internal and external escalation pathways
- Joining a globally distributed team that maintains coverage 24X7. As a member of this team and broader group, you may be required to occasionally work some weekends, holidays, and after hours to respond to high-urgency or emergency events outside of your local time-zone
Benefits
- Company paid comprehensive health and life insurance plans
- Paid Time off and flexible work schedules
- Company contribution to Provident Fund
- Participation in our company Stock options plan
- Company paid Parental Leave
- Monthly internet stipend
- Quarterly Wellness allowance
- Community engagement opportunities and donation-matching program
- Volunteer paid day off
- Annual virtual company retreats and regular community-led team events
- A workplace that supports a diverse, equitable, and inclusive environment