Data Engineer Intern

Sayari Logo

Sayari

πŸ’΅ $41k-$52k
πŸ“Remote - United States

Summary

Join Sayari's Data Engineering team as a Data Engineer Intern specializing in web crawling! This remote, paid internship (20-30 hours/week) focuses on maintaining and improving Sayari's web crawling framework, emphasizing scalability and reliability. You'll collaborate with Product and Software Engineering teams to ensure crawling deployments meet product requirements and integrate efficiently with ETL pipelines. The internship involves investigating and implementing web crawlers for new sources, maintaining existing infrastructure, improving metrics and reporting, and contributing to Sayari's data product development. This role offers valuable experience in large-scale web crawling and data engineering.

Requirements

  • Experience with Python
  • Experience managing web crawling at scale, any framework, Scrapy is a plus
  • Experience working with Kubernetes
  • Experience working collaboratively with git
  • Experience working with selectors such as: XPath, CSS, JMESPath
  • Experience with WebDev tools (Chrome/Firefox)

Responsibilities

  • Investigate and implement web crawlers for new sources
  • Maintain and improve existing crawling infrastructure
  • Improve metrics and reporting for web crawling
  • Help improve and maintain ETL processes
  • Contribute to development and design of Sayari’s data product

Preferred Qualifications

  • Experience with Apache projects such as Spark, Avro, Nifi, and Airflow
  • Experience with datastores Postgres and/or RocksDB
  • Experience working on a cloud platform like GCP, AWS, or Azure
  • Working knowledge of API frameworks, primarily REST
  • Understanding of or interest in knowledge graphs
  • Experience with *nix environments
  • Experience with reverse engineering
  • Proficient in bypassing anti-crawling techniques
  • Experience with Javascript

Benefits

  • This is a remote paid internship with work expectations being between 20-30 hours a week
  • $20 - $25 an hour

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.