Senior Production Operations Engineer

Index Exchange Logo

Index Exchange

πŸ“Remote - India

Summary

Join Index Exchange, a leading ad tech company, and become a crucial member of the global Production Operations team. You will play a vital role in ensuring the operational stability and reliability of our worldwide 24/7 on-premises and hybrid-cloud environments. This position requires in-depth understanding of systems, network, and hardware fundamentals, along with experience in private-cloud infrastructure engineering and maintenance. You will be responsible for incident response, system maintenance, automation, and documentation. The ideal candidate possesses strong technical expertise, excellent communication skills, and a commitment to continuous improvement. Index Exchange offers a supportive and collaborative work environment with comprehensive benefits.

Requirements

  • In-depth understanding of the Linux operating environment: kernel tuning, network stack tuning, system observability & instrumentation, and security & access management
  • Solid understanding of layer 2-7 networking fundamentals and the relationship between servers & services, and the transit of their packets through network hardware
  • In-depth experience engineering and maintaining a private-cloud infrastructure: Bare-metal, vSphere, KVM, Kubernetes
  • Experience with tools like Ansible, Terraform, Docker, Kafka, Nexus
  • Experiencing with observability platforms: InfluxDB, Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
  • Familiarity with Big Data tools: Hadoop, HDFS, Spark, HBase
  • Ability to write code in Go, Python, Bash, or Perl for automation
  • 6-8 years of proven experience in previous roles or one of the following roles: DevOps Engineer
  • Linux System Administrator
  • Site Reliability Engineer (SRE)
  • Built or maintained a private-cloud infrastructure running centos/rocky linux on a mix of bare-metal, virtualization, and containerization
  • Managed public cloud environments such as aws, gcp, azure and their federation into on-premise environments
  • Life-cycle management of baremetal servers such as Dell and Supermicro in globally distributed data centers (e.g. break-fix, baseband/firmware updates)
  • Built or maintained on-premise and cloud Kubernetes clusters: Kubadm, Kind, EKS, GKE
  • Built or operated automation & orchestration frameworks for deployment & maintenance pipelines: e.g. kafka, stackstorm, ansible, argocd, terraform to push out code or configuration updates, and building new infrastructure systems
  • Communication: Clear and effective communication within and across teams. While we place a huge premium on technical skill, we value just as much your ability to work with other people
  • Curiosity: things can (and will) break for different reasons; your curiosity will help drive you to identify and fix the things that go wrong
  • Alertness: we can never predict when things will go wrong so it is your job to be vigilant and prepared to respond when they do; you must be ready to reach out, ask questions and sound the alarm when necessary
  • Analytical Thinking: Monitor and analyze activity, collaborate with other departments to maintain technical defense
  • Reliability: Prioritize the reliability of our systems, ensuring our exchange customers can trust in our services 24/7. Adhere to operational procedures, best practices, and security protocols
  • Continuous Improvement: Embrace a culture of continuous learning and innovation, always seeking ways to enhance our operational efficiency
  • Customer-Centricity: Committed to providing the best possible experience for our customers, both internal and external
  • Accountability: Take ownership of our responsibilities and hold ourselves accountable for the quality of our work

Responsibilities

  • Maintain oversight on internal metrics, including the health, security, and performance of on-premises & hybrid-cloud network and systems infrastructure environments
  • Execute timely and effective incident response, identifying and mitigating issues to minimize downtime
  • Respond to alerts within our established SLOs and assist in incident triage, ensuring that the right teams are engaged to address issues promptly
  • Participate in maintaining system backups, disaster recovery plans, and security protocols are in place and maintained
  • Serve as a point-of-contact team for operational issues, providing both internal and external teams with technical support and ensuring the issue remains in custody until resolution
  • Collaborate with product and software engineering teams to relay operational insights and requirements
  • Continuously identify opportunities for optimization and present findings to technical leads and management
  • Research and implement improvements enhancing systems performance and scalability
  • Continuously research and embrace technological advancements and industry best practices to deliver exceptional service
  • Actively identify and mitigate risks and escalate them so the team can proactively address present or anticipated operational challenges
  • Develop, implement, and maintain automation frameworks streamlining operational processes, reducing time spent on manual tasks
  • Identify catalysts for future optimization including provisioning techniques, deployment optimization, ancillary services, pipelines, ansible playbooks, power usage, bandwidth etc
  • Draft comprehensive documentation for system configurations, processes, and incident resolution procedures
  • Participate in knowledge sharing within the team and with support provided about the content and delivery, provide cross-training to other relevant departments
  • Create and maintain runbooks and technical documentation, in addition to being familiar with internal and external escalation pathways
  • Joining a globally distributed team that maintains coverage 24X7. As a member of this team and broader group, you may be required to occasionally work some weekends, holidays, and after hours to respond to high-urgency or emergency events outside of your local time-zone

Benefits

  • Company paid comprehensive health and life insurance plans
  • Paid Time off and flexible work schedules
  • Company contribution to Provident Fund
  • Participation in our company Stock options plan
  • Company paid Parental Leave
  • Monthly internet stipend
  • Quarterly Wellness allowance
  • Community engagement opportunities and donation-matching program
  • Volunteer paid day off
  • Annual virtual company retreats and regular community-led team events
  • A workplace that supports a diverse, equitable, and inclusive environment

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.

Similar Remote Jobs