Summary

Join ServiceNow's PLATO organization as a Senior Staff Machine Learning Engineer - Site Reliability Engineer and contribute to the design, development, and implementation of infrastructure and platform features for AI workloads. Collaborate with researchers and infrastructure teams to ensure efficient and reliable GPU cluster performance. Continuously improve SRE practices by transforming operational use cases into software tooling requirements. Execute deployment and support activities for AI/ML developers, building high-quality, scalable, and reusable code. Work with product owners to understand requirements and own code from design to delivery. Mentor colleagues and promote knowledge sharing. This position requires passing a ServiceNow background screening, including a credit check, criminal/misdemeanor check, and drug test. Due to Federal requirements, only US citizens, US naturalized citizens, or US Permanent Residents are considered.

Requirements

Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry
Proficient in prompt engineering and developing LLM based features
Experience with methods of training and fine tuning large language models, such as distilation, supervised fine-tunning and policy optimization
Experience in using AI productivity tools such as Cursor, Windsurf, etc
8+ years of experience with infrastructure and platform operations, deployments, SRE, and DevOps with a continued focus on improving Platform health
6+ years of experience operating highly-available distributed workloads on Kubernetes following a DevOps approach
6+ years of development experience with Python, GoLang, Java or similar languages
Experience with DevOps tooling (e.g. Helm / Ansible / Kubernetes / Prometheus /Splunk/ GitLab CI)
Strong working experience operating distributed systems built on Linux and J2EE
Experience with software-defined networking, infrastructure as code and configuration management
Experience building software for compliance and security in regulated environments
Ability to drive outcome in projects with material technical risk
This position requires passing a ServiceNow background screening, USFedPASS (US Federal Personnel Authorization Screening Standards). This includes a credit check, criminal/misdemeanor check and taking a drug test. Any employment is contingent upon passing the screening
Due to Federal requirements, only US citizens, US naturalized citizens or US Permanent Residents, holding a green card, will be considered

Responsibilities

Contribute to the design, development and implementation of infrastructure, platform, deployment and observability features that power AI workloads
Collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable
Contribute to the continuous improvement of the SRE practice by turning operational use cases into requirements for software tooling
Contribute to the execution of deployment and support activities for AI/ML developers
Build high-quality, clean, scalable and reusable code by enforcing best practices around software engineering architecture and processes (Code Reviews, Unit testing, etc.)
Work with the product owners to understand detailed requirements and own your code from design, implementation, test automation and delivery of high-quality product to our users
Experience with operating LLMs on NVIDIA GPUs
Be a mentor for colleagues and help promote knowledge-sharing

Benefits

Health plans, including flexible spending accounts
A 401(k) Plan with company match
ESPP
Matching donations
A flexible time away plan
Family leave programs

Senior Staff Machine Learning Engineer-DevOps/Site Reliability Engineer

ServiceNow

Summary

Requirements

Responsibilities

Benefits

Remote

DevOps

Senior

Share this job: