Skip to content

ChloeW125/aws-recommender-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS Intelligent Service Recommender

A dual-engine natural language recommendation system designed to map user requirements (e.g., "quickly deliver content globally") to specific AWS services (e.g., Amazon CloudFront) by analyzing technical whitepapers.

This project implements an A/B Architecture comparing a statistical baseline (TF-IDF) against a neural semantic model (BERT) to demonstrate the shift from keyword matching to context understanding.

Architecture Overview

The project is structured into two isolated engines sharing a common data source. Both are containerized using Docker to ensure reproducibility across environments.

Statistical Engine (Baseline)

  • Core Logic: TF-IDF (Term Frequency-Inverse Document Frequency).
  • Features: Unigrams + Bigrams (N-Grams) to capture phrases like "content delivery."
  • Tech Stack: PySpark (for distributed text processing), Java/OpenJDK.
  • Interface: Command Line Interface (CLI).
  • Why used: Establishes a baseline for keyword-based search performance.

Neural Engine (Production)

  • Core Logic: Semantic Search using Dense Vector Embeddings.
  • Model: all-MiniLM-L6-v2 (BERT-based Transformer).
  • Tech Stack: PyTorch, Sentence-Transformers, Pandas.
  • Interface: Interactive Web App (Streamlit).
  • Why used: Captures intent and context (e.g., knowing that "latency" relates to "speed") which keyword matching misses.

Project Structure

AWS_Recommender/
│
├── convert_pdf_to_csv.py       # Master Data Pipeline (PDF Extraction)
├── raw_data/                   # Source of Truth (AWS Whitepapers)
│
├── tfidf_engine/               # [Spark Engine]
│   ├── train_model.py          # PySpark pipeline (Tokenization -> HashingTF -> IDF)
│   ├── cli_recommend.py        # CLI tool for querying the Spark model
│   └── Dockerfile              # Java + Python environment
│
└── bert_engine/                # [BERT Engine]
    ├── generate_brain.py       # Embedding generation & noise cleaning
    ├── app.py                  # Streamlit Web Application
    └── Dockerfile              # Lightweight Python-slim environment

Next Steps

  • Improve accuracy via sliding-window chunking techniques
  • Implement hybrid search, combining TF-IDF scores with BERT scores for maximum accuracy
  • Improve UI
  • Deploy for testing with interested individuals

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published