Senior Big Data Engineer

closed
Definitive Healthcare Logo

Definitive Healthcare

πŸ“Remote - Worldwide

Summary

Join Definitive Healthcare, a leading healthcare commercial intelligence company, and contribute to our mission of transforming data and analytics. We offer a collaborative and inclusive work environment where employees are valued and empowered to make a difference. As a Data Engineer, you will design, develop, and maintain scalable data pipelines, ensuring data quality and integrity. You will work with various technologies, including Python, Spark, Databricks, and cloud platforms. We provide a competitive benefits package and opportunities for professional growth. Our company culture prioritizes community service and employee well-being, making it a rewarding place to work.

Requirements

  • Technical Skills: Hands-on Python or Scala programming
  • Strong experience with Apache Spark and Databricks
  • Hands-on experience with Apache Airflow or similar workflow orchestration tools
  • Data modeling and processing fundamentals with large-scale volume of data
  • Knowledge of data cleansing and curation techniques
  • Familiarity with Unity Catalog or other metadata management tools
  • Understanding of data governance principles and best practices
  • Experience with cloud platforms (AWS and GCP)
  • Strong understanding of normalization and denormalization
  • Proficiency in CI/CD tools and practices (e.g., Jenkins, GitLab CI, etc.)
  • Experience with JVM tuning and Spark job performance investigation
  • Experience with Medallion architecture for data maturity lifecycle
  • Familiarity with containerization
  • Excellent problem-solving and analytical skills
  • Strong communication and collaboration skills
  • Ability to work independently and as part of a team
  • Detail-oriented with a focus on delivering high-quality work

Responsibilities

  • Design and Develop Data Pipelines: Build and maintain scalable data pipelines using Python, Spark, and Databricks
  • Implement data workflows and ETL processes using Apache Airflow
  • Data Integration and Management: Integrate data from various sources (AWS, GCP, on-premises) into a unified data warehouse
  • Handle variety of data formats such as csv, text, xml, parquet, delta etc
  • Ensure data quality and integrity through effective data cleansing and curation practices
  • Manage and optimize data storage solutions, ensuring high availability and performance
  • Automate observability of data and workloads
  • Metadata Management and Governance: Implement and manage Unity Catalog for metadata management
  • Ensure data governance policies are followed, including data security, privacy, and compliance
  • Develop and maintain data documentation and data dictionaries
  • Automate data observability across pipelines
  • Performance Tuning and Troubleshooting: Optimize Spark jobs for performance and efficiency
  • Investigate and resolve performance bottlenecks in Spark applications
  • Utilize JVM tuning techniques to improve application performance
  • Data Maturity Lifecycle: Implement and manage the Medallion architecture for data maturity lifecycle
  • Ensure data is appropriately processed and categorized at different stages (bronze, silver, gold) to maximize its usability and value
  • Collaboration and Continuous Improvement: Work closely with data scientists, analysts, and other stakeholders to understand data needs and deliver solutions
  • Implement CI/CD pipelines to automate deployment and testing of data infrastructure
  • Stay up to date with the latest industry trends and technologies to continuously improve data engineering practices

Preferred Qualifications

  • Certification in cloud platforms (AWS Certified Data Analytics, Google Cloud Professional Data Engineer, etc.)
  • Familiarity with SQL and NoSQL databases
  • Experience in a similar role within a fast-paced, data-driven environment

Benefits

  • Competitive benefits package including great healthcare benefits and a 401(k) match
  • Flexible and dynamic culture
This job is filled or no longer available