Senior DevOps Engineer

Binance Logo

Binance

πŸ“Remote - Taiwan, Australia

Summary

Join Binance, a leading global blockchain ecosystem, and become a key player in maintaining our ultra-low-latency infrastructure. You will own and optimize EC2 fleets, ensuring high-throughput messaging and network integrity. Responsibilities include performance tuning, building immutable infrastructure, and implementing robust observability measures. You will also participate in reliability testing, incident response, and capacity planning. Collaboration with cross-functional teams is crucial for identifying and resolving performance bottlenecks. This role requires extensive experience in Linux low-latency tuning, AWS operations, and high-throughput messaging systems.

Requirements

  • Linux low-latency tuning – CPU pinning, NUMA awareness, IRQ affinity, TCP/UDP stack tweaks, hugepages
  • AWS operations at scale – EKS, EC2, VPC, NLB/ALB, Auto Scaling, multi-AZ fail-over, cost & quota management
  • Infrastructure as Code / GitOps – Terraform (modular state)
  • CI/CD pipelines – GitLab CI or Jenkins; blue-green / canary deploys, sub-2-minute rollbacks, latency smoke-test gates
  • Observability – Prometheus + Grafana, Alertmanager, high-cardinality metrics, centralized log aggregation, eBPF tracing for Β΅s-level hotspots
  • High-throughput messaging – Kafka cluster operations (partition strategy, ISR tuning, < 3 ms end-to-end), Nginx WebSocket termination
  • Trading-grade networking – ENA/SR-IOV, packet-loss analysis, security-group hardening
  • Performance & reliability engineering – perf, FlameGraph, chaos/load testing, p95/p99 latency SLO ownership
  • Automation & scripting – Python or Go for tooling, incident remediation, environment bootstrap

Responsibilities

  • Own ultra-low-latency EC2 fleets - Design cluster placement groups with ENA / SR-IOV networking
  • Kernel-level performance tuning - Apply CPU pinning, NUMA alignment, IRQ affinity, hugepages, and TCP/UDP sysctl tweaks to flatten tail latency
  • Immutable infrastructure & automated rollouts - Build Packer AMIs and Terraform Auto Scaling Groups; run GitLab/Jenkins pipelines with blue-green or canary deploys and sub-2-minute automatic rollbacks
  • High-throughput messaging & gateways - Operate Kafka clusters (partition/ISR tuning, rack awareness) and Nginx WebSocket edges serving 100 k + clients with single-digit-ms fan-out
  • Network integrity - Run packet-loss analysis and MTU/ECN/queue-depth tuning; enforce least-privilege security-group micro-segmentation
  • Observability & SLO stewardship - Instrument Prometheus/Grafana dashboards for order-ack latency, queue depth, reject rate; write Alertmanager rules driven by p95/p99 error-budget burn
  • Reliability testing & incident response - Schedule chaos/load drills; take part in 24 Γ— 7 on-call, use perf/eBPF/FlameGraphs/tcpdump for Β΅s-level RCA, and publish post-mortems with remediation actions
  • Capacity planning around macro events - Pre-warm spot pools and leverage Savings Plans to balance headroom and cost
  • Automation & tooling - Write Go/Python scripts for bootstrap, health probes, latency regression tests, and one-click remediation
  • Cross-team collaboration - Pair with Java/Rust engineers and quants to profile hot-path code, and eliminate bottlenecks without trading downtime

Preferred Qualifications

Rust/Go code familiarity, CNCF/AWS certifications, XDP/DPDK experience for kernel-bypass networking

Benefits

  • Competitive salary and company benefits
  • Work-from-home arrangement (the arrangement may vary depending on the work nature of the business team)

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.