This repo contains supplementary materials for the book Data Science Methods and Practice, published by China Machine Press in 2024/2025.
This repo is organized as follows:
- Data Science Interview Questions & Practical Exercises.
- Case study A: Personalized recommendation based on Linkedin profile.
- Case study B: Data pipeline for a conceptual convenience store example.
Interview questions & exercises
The overall project aims to create a system that provides users with personalized content recommendations. It achieves this by collecting relevant articles from various websites based on predefined topics (RSS Crawling), using advanced techniques like embeddings and clustering to group similar articles together, (Embedding and Clustering) and analyzing a user's LinkedIn profile to understand their interests (Profile Understanding and Recommendation).
The crawling module is responsible for crawling from RSS-enabled websites and saving the crawled content, with the following steps:
- Determine mapping between predefined interests and RSS-enabled websites.
- Crawl the RSS feeds of the identified websites on a daily or weekly basis and save the crawled content.
This module processes the content gathered by the crawling module and organizes it using embeddings and clustering techniques, with the following steps:
- Generate an embedding representation by calling OpenAI API.
- Apply clustering algorithm to group similar articles together based on their embedding representations.
- Load the clustering and embedding representations into Weaviate, a vector database.
This module focuses on understanding a user's interests based on their LinkedIn profile and provides personalized recommendations:
- Use a large language model like GPT to analyze users’ LinkedIn profiles and identify their professional interests.
- Use Weaviate’s hybrid search approach to provide personalized offline recommendations.
(used to introduce data schema, real-time and offline data flow, and transaction analysis)
本 repo 包含由机械工业出版社于 2024/2025 年出版的书籍 数据科学方法与实践 的配套资源。本 repo 组织如下:
- 书中思考题的详细答案或提示。
- 案例研究 A:基于 Linkedin 个人资料的个性化内容推荐。
- 案例研究 B:便利店示例的数据管道。
整个项目旨在创建一个为用户提供个性化内容推荐的系统。它根据预定义主题从各个网站抓取相关文章(RSS 抓取)、使用嵌入和聚类等技术将相似的文章聚合在一起(内容嵌入和聚类),分析用户的 LinkedIn 个人资料了解用户画像并进行推荐(用户理解和推荐)。
RSS抓取模块负责从支持 RSS 的网站抓取内容,步骤如下:
- 确定兴趣主题与支持 RSS 的热门网站之间的映射。
- 抓取已识别网站的 RSS 源,并保存抓取的文章。
本模块处理抓取的内容并使用嵌入和聚类技术对其进行组织,步骤如下:
- 调用 OpenAI API 生成嵌入表示。嵌入是文本的数字表示,可捕获内容的语义含义。
- 应用聚类算法,根据相似文章的嵌入表示将它们分组在一起。
- 将嵌入和聚类结果加载到向量数据库 Weaviate 中。
本模块根据用户的 LinkedIn 个人资料理解用户兴趣并进行推荐,步骤如下:
- 使用大型语言模型 GPT 分析用户的 LinkedIn 个人资料并确定他们的专业兴趣。
- 使用 Weaviate 的混合搜索方法提供个性化的离线推荐。
(用于介绍数据模式、实时和离线数据流以及交易分析)