A dual-engine natural language recommendation system designed to map user requirements (e.g., "quickly deliver content globally") to specific AWS services (e.g., Amazon CloudFront) by analyzing technical whitepapers.
This project implements an A/B Architecture comparing a statistical baseline (TF-IDF) against a neural semantic model (BERT) to demonstrate the shift from keyword matching to context understanding.
The project is structured into two isolated engines sharing a common data source. Both are containerized using Docker to ensure reproducibility across environments.
- Core Logic: TF-IDF (Term Frequency-Inverse Document Frequency).
- Features: Unigrams + Bigrams (N-Grams) to capture phrases like "content delivery."
- Tech Stack: PySpark (for distributed text processing), Java/OpenJDK.
- Interface: Command Line Interface (CLI).
- Why used: Establishes a baseline for keyword-based search performance.
- Core Logic: Semantic Search using Dense Vector Embeddings.
- Model:
all-MiniLM-L6-v2(BERT-based Transformer). - Tech Stack: PyTorch, Sentence-Transformers, Pandas.
- Interface: Interactive Web App (Streamlit).
- Why used: Captures intent and context (e.g., knowing that "latency" relates to "speed") which keyword matching misses.
AWS_Recommender/
│
├── convert_pdf_to_csv.py # Master Data Pipeline (PDF Extraction)
├── raw_data/ # Source of Truth (AWS Whitepapers)
│
├── tfidf_engine/ # [Spark Engine]
│ ├── train_model.py # PySpark pipeline (Tokenization -> HashingTF -> IDF)
│ ├── cli_recommend.py # CLI tool for querying the Spark model
│ └── Dockerfile # Java + Python environment
│
└── bert_engine/ # [BERT Engine]
├── generate_brain.py # Embedding generation & noise cleaning
├── app.py # Streamlit Web Application
└── Dockerfile # Lightweight Python-slim environment
- Improve accuracy via sliding-window chunking techniques
- Implement hybrid search, combining TF-IDF scores with BERT scores for maximum accuracy
- Improve UI
- Deploy for testing with interested individuals