🚀 Extract, Transform, and Load (ETL) real-time Reddit data, analyze sentiment, and store it in a star-schema data warehouse.
This project extracts trending Reddit posts from a subreddit, performs sentiment analysis, and stores structured data in a PostgreSQL data warehouse using a star schema.
✅ ETL Pipeline: Extracts, cleans, and loads real-time Reddit data.
✅ Sentiment Analysis: Uses AI (TextBlob) to categorize sentiment (Positive, Neutral, Negative).
✅ Data Warehouse: Stores posts in a structured star schema for optimized querying.
✅ SQL Optimization: Enables fast queries for insights into top trends and discussions.
🔹 Python – For data extraction, transformation, and loading
🔹 Reddit API (PRAW) – Fetches live Reddit data
🔹 PostgreSQL – Stores structured data
🔹 SQLAlchemy – Connects Python to PostgreSQL
🔹 TextBlob – Performs sentiment analysis
🔹 Pandas – Data transformation
🔹 Git – Version control
┌──────────────────────────┐
│  Reddit API (PRAW)       │
└──────────┬──────────────┘
           ▼  
┌──────────────────────────┐
│  Python ETL Script       │
│  - Extract: Fetch posts  │
│  - Transform: Clean data │
│  - Sentiment Analysis    │
└──────────┬──────────────┘
           ▼  
┌──────────────────────────┐
│  PostgreSQL (Star Schema)│
│  - fact_posts           │
│  - dim_authors
│  - dim_subreddit        │
│  - dim_date             │
└──────────┬──────────────┘
           ▼  
┌──────────────────────────┐
│  SQL Queries & Insights  │
│  - Top trending posts    │
│  - Sentiment breakdown   │
│  - Most active users     │
└──────────────────────────┘
git clone https://github.com/yourusername/reddit-trend-tracker.git
cd reddit-trend-trackerEnsure you have Python 3.8+ installed. Then, install the required libraries:
pip install praw pandas textblob sqlalchemy psycopg2Make sure to setup the praw configuration from praw's docs.
- Open PostgreSQL and create a new database:
CREATE DATABASE reddit_dw;- Run the SQL schema script to set up tables:
psql -U your_user -d reddit_dw -f setup_schema.sql| Column | Type | Description | 
|---|---|---|
| post_id | VARCHAR(20) | Reddit post ID | 
| author_id | INT | Foreign key to dim_authors | 
| subreddit_id | INT | Foreign key to dim_subreddit | 
| score | INT | Upvotes for the post | 
| num_comments | INT | Number of comments on the post | 
| sentiment | VARCHAR(20) | Positive, Negative, or Neutral | 
| date_id | INT | Foreign key to dim_date | 
| Column | Type | Description | 
|---|---|---|
| id | SERIAL (PK) | Unique author ID | 
| author_name | VARCHAR(255) | Reddit username | 
| Column | Type | Description | 
|---|---|---|
| id | SERIAL (PK) | Unique subreddit ID | 
| subreddit_name | VARCHAR(255) | Subreddit name | 
| Column | Type | Description | 
|---|---|---|
| date_id | SERIAL (PK) | Unique date ID | 
| date_value | DATE | Actual date | 
| year | INT | Year of post | 
| month | INT | Month of post | 
| day_of_week | INT | Day of the week | 
Run the Python script to fetch and process Reddit posts:
python extract_transform.pyAfter extracting data, load it into your database:
python load_to_postgres.pySELECT p.post_id, a.author_name, p.score
FROM fact_posts p
JOIN dim_authors a ON p.author_id = a.author_id
ORDER BY p.score DESC
LIMIT 5;SELECT sentiment, COUNT(*) as post_count
FROM fact_posts
GROUP BY sentiment
ORDER BY post_count DESC;SELECT a.author_name, COUNT(*) as post_count
FROM fact_posts p
JOIN dim_authors a ON p.author_id = a.author_id
GROUP BY a.author_name
ORDER BY post_count DESC
LIMIT 10;- Enhance Sentiment Analysis: Use NLTK or a machine learning model instead of TextBlob.
- Streamline with Apache Airflow: Automate ETL pipeline scheduling.
- Deploy on AWS: Store processed data in Amazon S3, run queries using AWS Athena.
- Create a Web Dashboard: Use Tableau or Power BI to visualize Reddit trends.
🌟 Ready to dive into the data? Clone, run, and explore the trends!
📌 GitHub Repository: 🔗 Link to Repo