Senior Software Engineer, Data

T

The Allen Institute for AI

πŸ’΅ $146k-$220k
πŸ“Remote - United States

Summary

Join the Allen Institute for AI (Ai2) as a Data Engineer to integrate a large U.S. patent corpus into the Semantic Scholar platform. This NSF-funded, 2-year fixed-term position (with renewal possibility) involves high-impact data engineering, focusing on linking patent and academic research data, resolving citations, disambiguating inventors and authors, applying topic models, and extending data products and APIs. You will work in a high-performing engineering environment, handling full-stack data tasks such as building pipelines, integrating or training practical ML models, and deploying production services. The role requires strong Python engineering skills, experience with SQL and schema design, familiarity with ML workflows, and experience with workflow orchestration tools and cloud infrastructure. The position offers a competitive compensation package including a base salary range of $146,880 - $220,320 and generous bonus plans. Remote work from any US state is allowed.

Requirements

  • Bachelor's degree and 8+ years of technical experience; relevant experience may substitute for education
  • Strong Python engineering skills, especially for building and maintaining data pipelines
  • Experience with SQL and schema design in production settings (PostgreSQL preferred)
  • Familiarity with common ML workflows (training classifiers, tuning models, and deploying for inference), particularly for large-scale or ambiguous structured datasets
  • Comfortable working with structured datasets (XML/JSON/Parquet) and writing ETL code
  • Experience with workflow orchestration tools (Airflow or similar) and cloud infrastructure (e.g. AWS, S3, Docker)
  • Strong communicator and a strong sense of ownership for results

Responsibilities

  • Build scalable data pipelines (Airflow) for citation resolution and corpus integration
  • Develop and deploy lightweight ML models for inventor disambiguation and author linking
  • Train or adapt a topic model to classify patents using titles, abstracts, claims, and specs
  • Extend REST APIs to expose linked metadata and topic classifications
  • Contribute to dashboards and tools for evaluating data quality and model precision
  • Collaborate with Ai2 engineers to ensure maintainability, test coverage, and robust deployment
  • Produce reliable, well-documented code and contribute technical designs that support long-term maintainability

Preferred Qualifications

  • Experience with author disambiguation, entity resolution, or record linkage problems
  • Experience applying vector-based similarity or topic modeling techniques to real-world corpora at scale
  • Exposure to citation networks or scholarly data systems (e.g., arXiv, OpenAlex, USPTO)
  • Comfort building internal APIs and dashboards to support ML and data quality review

Benefits

  • Team members and their families are covered by medical, dental, vision, and an employee assistance program
  • Team members are able to enroll in our health savings account plan, our healthcare reimbursement arrangement plan, and our health care and dependent care flexible spending account plans
  • Team members are able to enroll in our company’s 401k plan
  • Team members will receive $125 per month to assist with commuting or internet expenses and will also receive $200 per month for fitness and wellbeing expenses
  • Team members will also receive up to ten sick days per year, up to seven personal days per year, up to 20 vacation days per year and twelve paid holidays throughout the calendar year
  • Team members will be able to receive annual bonuses
  • Our base salary range is $146,880 - $220,320, and in addition we have generous bonus plans to provide a competitive compensation package
  • Persons in these roles are welcome to work remotely from any state in the US

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.