This repository implements a full ETL pipeline and subsequent data analysis for Exsport Bag products on Tokopedia. The pipeline extracts raw data via web scraping, transforms and enriches it, loads the results into a PostgreSQL database and an Excel workbook, and performs exploratory and trend analyses in a Jupyter Notebook.
e-commerce-analyst-portfolio/
│
├── extract-raw-data/ # Raw CSVs from scraping tasks
│ ├── exsport_tokped_raw.csv
│ └── exsport_stock_tokped.csv
│
├── clean-data/ # Cleaned & transformed CSVs
│ ├── data_product_exsport.csv
│ ├── data_stock_exsport.csv
│ ├── data_name_exsport.csv
│ ├── data_color_exsport.csv
│ ├── data_category_exsport.csv
│ ├── data_order_tokped.csv
│ ├── data_review_order_tokped.csv
│ └── data_user_tokped.csv
│
├── database/ # Final exports
│ ├── ecommerce_exsport.xlsx
│ └── ecommerce_exsport_db.sql
│
├── source/ # ETL source code
│ ├── extract_exsport.py # Extract raw product and stock data
│ ├── func.py # Helper functions (categorization, ID generation)
│ ├── generate_data.py # Generate synthetic user/order/review data
│ ├── transform_exsport.py # Clean, categorize, and transform data
│ └── load_exsport.py # Load data into PostgreSQL & Excel
│
├── validate-raw-data/ # Validation logs for raw data
│ └── validate_data.txt
│
├── ecommerce_data_analysis.ipynb # EDA and trend analysis notebook
└── README.md # This file
-
Extract (
extract_exsport.py
via Luigi + Selenium)- Scrapes dynamic product listings (name, link, price, discount, rating, sold count, image URL) across multiple pages.
- Handles infinite scrolling and JS-rendered content to ensure complete data capture.
- Outputs raw CSVs under
extract-raw-data/
.
-
Generate Synthetic Data (
generate_data.py
)- Enriches product data with synthetic stock levels, customer, order, and review datasets using Faker.
- Outputs additional CSVs simulating user behavior.
-
Transform (
transform_exsport.py
via Luigi + Pandas)- Cleans and normalizes text and numeric fields (prices, sold counts).
- Builds dimension tables (product, color, category) and fact tables following a star schema.
- Outputs cleaned CSVs under
clean-data/
.
-
Load & Export (
load_exsport.py
via Luigi, psycopg2, OpenPyXL)- Creates
ecommerce_exsport_db
and corresponding schemas in PostgreSQL. - Bulk loads cleaned CSVs with upsert logic to ensure idempotence.
- Exports all tables into
ecommerce_exsport.xlsx
and a SQL dump for database portability.
- Creates
The ecommerce_data_analysis.ipynb
notebook performs exploratory and trend analyses on the cleaned data:
- Objective: Identify top-selling product categories by total quantity sold.
- Approach: Merge orders, stock, and category tables, then aggregate and sort by quantity.
- Visualization: Horizontal bar chart highlighting the volume for each category.
- Insight: Reveals which product categories drive the highest sales volumes.
- Objective: Examine how sales volumes per category evolve over time.
- Approach: Convert order dates to
year_month
, merge with product and category data, and group by period. - Visualization: Line plot comparing monthly sales trends across categories.
- Insight: Highlights seasonal patterns and growth opportunities.
- Objective: Calculate total revenue generated by each category.
- Approach: Compute
revenue = quantity * price_sale
, merge with category metadata, and summarize. - Visualization: Highlighted horizontal bar chart with category revenue, emphasizing the top performer.
- Insight: Pinpoints the most profitable categories and informs inventory prioritization.
All charts are generated using Matplotlib and Seaborn, showcasing the capability to transform raw sales data into actionable business insights.
- Python & Luigi for ETL orchestration
- Selenium for robust web scraping
- Pandas for data manipulation & cleaning
- psycopg2 & PostgreSQL for relational storage
- OpenPyXL for Excel exports
- Matplotlib & Seaborn for data visualization
- Ensure dependencies are installed: install required Python packages:
pip install luigi selenium pandas psycopg2 openpyxl faker
- Run the full pipeline:
luigi --module extract_exsport ExtractTokpedExsportData luigi --module generate_data TransformTokpedExportData luigi --module transform_exsport TransformTokpedExportData luigi --module load_exsport LoadExsportDatabase
- Explore analysis: Open
ecommerce_data_analysis.ipynb
in Jupyter Notebook. - Check outputs in
clean-data/
,database/
, and database engine.
by Zulfa Nurfajar