Data Scientist

Everstream Analytics
Summary
Join our Natural Language Processing (NLP) and Generative AI Data Science team as a Data Science Intern. You will gain hands-on experience collecting and working with real-world, publicly available data from online sources, including news outlets and company websites. This role involves writing Python scripts, diving into web structures, and building clean, usable datasets for high-impact AI models. The internship offers a chance to work on real data projects supporting NLP and generative AI initiatives, collaborate with a talented team, and build a portfolio. It provides meaningful exposure and flexibility in a fully remote work environment. This is an opportunity to work with modern tools and techniques used in the industry and level up your Python and web scraping skills.
Requirements
- Pursue a degree in Computer Science, Data Science, Information Technology, or a related field
- Demonstrate familiarity with Python and libraries such as BeautifulSoup, Scrapy, or Selenium for data collection tasks
- Show understanding of HTML, CSS, and JavaScript to navigate and parse web content effectively
- Possess basic knowledge of data storage formats and databases (e.g., CSV, JSON, SQL)
- Possess strong problem-solving skills and attention to detail
- Demonstrate excellent communication skills, both written and verbal
Responsibilities
- Develop and maintain scripts to automate the collection of publicly available data from online sources, ensuring compliance with each website's terms of service and robots.txt directives
- Clean, validate, and organize collected data to ensure accuracy and usability for downstream tasks
- Store extracted data in structured formats such as CSV, JSON, or databases, ensuring efficient retrieval and analysis
- Work closely with data scientists and analysts to understand data requirements and ensure legal compliance
- Document data collection processes, data dictionaries, and any challenges encountered to facilitate knowledge sharing and future maintenance
Preferred Qualifications
- Demonstrate familiarity with AI-powered data collection tools (e.g., Firecrawl)
- Demonstrate familiarity with web concepts such as sitemaps, robots.txt, and RSS feeds
- Demonstrate experience with data visualization tools or libraries (e.g., Matplotlib, Seaborn)
- Demonstrate familiarity with version control systems like Git
- Show understanding of ethical considerations and legal guidelines related to data collection
- Demonstrate ability to work independently and manage time effectively in a remote or hybrid work environment
Benefits
Fully remote work environment