Senior Data Collection Engineer

Centric Software Logo

Centric Software

πŸ“Remote - Worldwide

Summary

Join Centric Software as a Data Collection Engineer and build scalable, high-quality data collection systems. Collaborate with cross-functional teams to drive innovation and maintain a robust data pipeline. Design and build robust web crawlers using Scrapy, ensuring modularity and maintainability. Enhance and maintain infrastructure, including CI/CD pipelines and Scrapyd. Uphold coding standards, conduct code reviews, and mentor junior engineers. Integrate performance monitoring systems and conduct regular spider audits. Build data validation mechanisms and collaborate with internal consumers to ensure data quality. Work cross-functionally and promote a culture of knowledge sharing and continuous improvement. This role requires expertise in web technologies, cloud infrastructure, and data pipeline development. The ideal candidate will possess strong communication and problem-solving skills.

Requirements

  • Comfort with Git workflows, code reviews, and CI/CD pipelines
  • Experience with cloud infrastructure like AWS
  • Experience with monitoring/observability systems like Grafana and Sentry
  • Knowledge of the Web environment (model, standards, DOM, Request-Response, Cookies, JavaScript, Browsers, Headers, XHR, etc.)
  • Excellent communication skills in English, both written and spoken
  • A collaborative mindset with a proactive approach to knowledge sharing
  • Strong analytical thinking and problem-solving abilities
  • Commitment to continuous improvement, mentoring, and agile team dynamics
  • Remain up-to-date with technology trends to keep our software as innovative as possible

Responsibilities

  • Design and Build Robust Web Crawlers
  • Develop and maintain spiders for high-scale data extraction using Scrapy
  • Ensure spiders are modular, reusable, and easy to maintain with components such as loaders, middlewares, and pipelines
  • Apply advanced techniques to bypass anti-bot mechanisms, including rotating proxies, captcha-solving strategies and fingerprinting
  • Enhance and Maintain Infrastructure
  • Build scalable CI/CD pipelines for automated testing, deployment, and monitoring of spiders
  • Leverage tools like Scrapyd for centralized spider scheduling and lifecycle management
  • Ensure efficient parallelization and cloud deployment for high-throughput crawling
  • Code Quality and Consistency
  • Uphold coding standards and implement consistent practices across teams
  • Conduct thorough code reviews and mentor junior engineers on clean code principles
  • Maintain version control and detailed change logs for spider development
  • Monitoring, Maintenance & Reliability
  • Integrate performance monitoring systems to ensure real-time alerts and health checks
  • Schedule periodic spider audits to handle structure changes and improve reliability
  • Troubleshoot failures and optimize resource usage (CPU/network) for crawling efficiency
  • Data Integrity and Accuracy
  • Build robust data validation mechanisms to guarantee quality outputs
  • Collaborate with internal consumers to ensure data collected aligns with business requirements
  • Continuously track data anomalies and automate recovery strategies
  • Collaboration and Knowledge Sharing
  • Work cross-functionally with product, engineering, and other data teams
  • Promote a culture of documentation, onboarding tools, and internal knowledge bases
  • Contribute to training initiatives, helping the team stay current on scraping techniques and technologies

Preferred Qualifications

  • Familiarity with TLS/SSL, TCP/IP stack, and low-level web networking is a strong plus
  • Proficient in designing fault-tolerant systems and deploying them at scale
  • Familiarity with containerized deployments
  • Proficient in developing scalable web crawlers and data pipelines using Python and Scrapy
  • Experience building resilient scraping systems across diverse web architectures
  • Prior experience mentoring or leading junior developers

Share this job:

Disclaimer: Please check that the job is real before you apply. Applying might take you to another website that we don't own. Please be aware that any actions taken during the application process are solely your responsibility, and we bear no responsibility for any outcomes.