Summary

Join our team at Legalist as a Data Acquisition Engineer and contribute to designing and implementing the architecture of a large-scale crawling system.

Requirements

3+ Years experience with Python for data wrangling and cleaning
2+ Years experience with data crawling & scraping at scale (100+ spiders at least)
Productionized experience with Scrapy is mandatory. Distributed crawling and advanced scrapy experience are a plus
Familiarity with scraping libraries and monitoring tools highly recommended (BeautifulSoup, Xpaths, Selenium, Puppeteer, Splash)
Familiarity with data pipelining to integrate scraped items into existing data pipelines
Experience extracting data from multiple disparate sources including HTML, XML, REST, GraphQL, PDF, and spreadsheets
Experience running, monitoring and maintaining a large set of broad crawlers (100+ spiders)
Sound Knowledge in bypassing Bot Detection Techniques
Experience using techniques to protect web scrapers against site ban, IP leak, browser crash, CAPTCHA and proxy failure
Experience with cloud environments like GCP, AWS, as well as containerization tools like Docker and orchestration such as kubernetes or others
Ability to maintain all aspects of a scraping pipeline end to end (building and maintaining spiers, avoiding bot prevention techniques, data cleaning and pipelining, monitoring spider health and performance)
OOP, SQL and Django ORM basics

Responsibilities

Help to design and implement the architecture of a large-scale crawling system
Design, implement, and maintain various components of our data acquisition infrastructure (building new crawlers, maintain existing crawlers, data cleaners & loaders)
Work on developing tools to facilitate the scraping at scale, monitor the health of crawlers and ensure data quality of the scraped items
Collaborate with our product and business teams to understand / anticipate requirements to strive for greater functionality and impact in our data gathering systems

Preferred Qualifications

Experience with microservices architecture would be a plus
Familiarity with message brokers such as Kafka, RabbitMQ, etc
Experience with DevOps
Expertise in data warehouse maintenance, specifically with Google BigQuery (ETLs, data sourcing, modeling, cleansing, documentation, and maintenance)
Familiarity with job scheduling & orchestration frameworks - e.g. Jenkins, Dagster, Prefect

Remote Web Scraping Engineer

Legalist

Job highlights

Summary

Requirements

Responsibilities

Preferred Qualifications

Remote

Software Development

Mid-level

Share this job:

Similar Remote Jobs

Web Automation Engineer, Anti-Scraping

Apify

Remote

Software Development

Mid-level

Fullstack Software Engineer II

Quorum

Remote

Software Development

Mid-level

Engineering Manager

Quorum

Remote

Software Development

Manager

Senior Software Engineer

Spreetail

Remote

Software Development

Senior

Software Engineer III

Spreetail

Remote

Software Development

Mid-level

Software Engineering Manager

Spreetail

Remote

Software Development

Manager

Senior Software Engineer

Hypersonix

Remote

Software Development

Senior

Senior Data Engineer (MLOps)

Massive Rocket

Remote

Data

Senior

Enterprise Solutions Engineer

Zyte

Remote

Software Development

Mid-level