Quora Duplicate Question Pairs

�️ Demo Screenshots

Below are example screenshots of the web app in action:

Not Duplicate Prediction Example

Duplicate Prediction Example

��🚀 Project Overview

This project helps you automatically detect whether two questions asked on Quora are duplicates. It uses powerful machine learning and natural language processing (NLP) techniques to analyze question pairs, extract meaningful features, and predict if they are asking the same thing. This can save time, improve search results, and help organize large Q&A platforms.

📦 Dataset Details

train.csv: The main dataset contains thousands of question pairs. Each row has:
- qid1, qid2: Unique IDs for each question.
- question1, question2: The actual questions in text.
- is_duplicate: 1 if the questions are duplicates, 0 otherwise.
How question IDs work: Each question has a unique ID, but the same question can appear in many pairs. For example, if question ID 2561 appears 88 times, it means that question was compared with 88 other questions.

🧩 Main Features & Workflow

Exploratory Data Analysis (EDA)
- Understand the data: missing values, duplicate rows, repeated questions.
- Visualize distributions and relationships using Jupyter notebooks.
Feature Engineering
- Extract features like word overlap, token similarity, and bag-of-words (BoW) representations.
- Use advanced NLP tools to capture the meaning and structure of questions.
Model Training & Evaluation
- Train machine learning models (XGBoost, scikit-learn, etc.) to classify question pairs.
- Evaluate model performance and tune for best results.
Interactive Web App
- Use the Streamlit app to enter any two questions and instantly see if they are duplicates.

🗂️ Project Structure

hello.py: Simple script to test your environment.
streamlit_app/: Contains the interactive web app and helper functions.
- app.py: Main Streamlit interface for predictions.
- helper.py: Functions for text cleaning, feature extraction, and more.
Jupyter notebooks for EDA, feature engineering, and model building.
requirements.txt & pyproject.toml: All required Python packages listed for easy setup.
Pre-trained model and vectorizer files (model.pkl, cv.pkl).

🛠️ Installation & Setup

Clone the repository

git clone https://github.com/Code-With-Samuel/Quora-Duplicate-Question-Pairs.git
cd Quora-Duplicate-Question-Pairs

Install dependencies
```
pip install -r requirements.txt
```

(Optional) Create a virtual environment

python -m venv venv
venv\Scripts\activate  # On Windows
source venv/bin/activate  # On Mac/Linux

💡 How to Use

1. Explore the Data

Open Jupyter notebooks (like initial_EDA.ipynb) to understand and visualize the dataset.

2. Train & Test Models

Use the provided notebooks and scripts to train your own models or use the pre-trained ones.

3. Try the Web App

Run the Streamlit app:
```
 streamlit run streamlit_app/app.py
```
Enter any two questions and get an instant prediction: "Duplicate" or "Not Duplicate".

🔍 Key Concepts Explained Simply

Question IDs: Each question has a unique number, but popular questions can appear in many pairs. This helps us find which questions are asked most often.
Feature Extraction: The code looks at things like how many words are shared, how similar the sentences are, and other smart ways to compare questions.
Value Counts: Shows how many times each question is used. If a question is repeated a lot, it might be a common or trending topic.

📊 Example

Suppose you see this output:

2561      88
30782     120
4044      111
...

This means question ID 2561 was involved in 88 different pairs. It does NOT mean there are 88 different questions with that ID—just that it was compared with 88 other questions.

🧪 Dependencies

Main Python libraries used:

numpy, pandas, seaborn, matplotlib, scikit-learn, xgboost, distance, fuzzywuzzy, nltk, BeautifulSoup4, plotly, streamlit

🌟 Why This Project?

Helps Quora and similar platforms keep their content clean and organized.
Saves users time by reducing duplicate answers.
Demonstrates practical use of NLP and machine learning for real-world problems.

📄 License

MIT License

This project is designed to be beginner-friendly, well-documented, and easy to extend. Whether you want to learn about NLP, build your own duplicate detection system, or just explore a cool dataset, this repository is a great place to start!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
images		images
streamlit_app		streamlit_app
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
bow-with-basic-features.ipynb		bow-with-basic-features.ipynb
bow-with-preprocessing-and-advanced-features.ipynb		bow-with-preprocessing-and-advanced-features.ipynb
cv.pkl		cv.pkl
hello.py		hello.py
initial_EDA.ipynb		initial_EDA.ipynb
model.pkl		model.pkl
only-bow.ipynb		only-bow.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.csv		train.csv
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quora Duplicate Question Pairs

�️ Demo Screenshots

Not Duplicate Prediction Example

Duplicate Prediction Example

��🚀 Project Overview

📦 Dataset Details

🧩 Main Features & Workflow

🗂️ Project Structure

🛠️ Installation & Setup

💡 How to Use

1. Explore the Data

2. Train & Test Models

3. Try the Web App

🔍 Key Concepts Explained Simply

📊 Example

🧪 Dependencies

🌟 Why This Project?

📄 License

About

Uh oh!

Releases

Packages

Languages

Code-With-Samuel/Quora-Duplicate-Question-Pairs

Folders and files

Latest commit

History

Repository files navigation

Quora Duplicate Question Pairs

�️ Demo Screenshots

Not Duplicate Prediction Example

Duplicate Prediction Example

���🚀 Project Overview

📦 Dataset Details

🧩 Main Features & Workflow

🗂️ Project Structure

🛠️ Installation & Setup

💡 How to Use

1. Explore the Data

2. Train & Test Models

3. Try the Web App

🔍 Key Concepts Explained Simply

📊 Example

🧪 Dependencies

🌟 Why This Project?

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

��🚀 Project Overview

Packages