updated readme.
A robust FastAPI application that classifies the type of a given website (e.g., E-commerce, News, Blog, Corporate) using a hybrid approach. It combines powerful Machine Learning models with intelligent heuristic rules for accurate and reliable classification.
- Hybrid Classification Engine: Utilizes a trained scikit-learn model (Logistic Regression with TF-IDF features) alongside a comprehensive set of heuristic rules (keywords, domain suffixes, structural checks) for enhanced accuracy.
- FastAPI Backend: Provides a high-performance, asynchronous API for website classification.
- Simple Web UI: Includes a basic, embedded React-based frontend for easy testing and demonstration directly from your browser.
- Extensible: Easily extendable with new website types, refined heuristics, or updated ML models by training with more diverse data.
- Robust Web Scraping: Handles common issues like missing URL schemes and uses appropriate headers for fetching website content.
Follow these steps to set up and run the Website Type Classifier API on your local machine.
- Python 3.10 or higher (as seen in your screenshot,
python 3.10.11) pip(Python package installer)
-
Clone the repository:
git clone [https://github.com/your-username/website-type-classifier.git](https://github.com/your-username/website-type-classifier.git) cd website-type-classifier(Replace
https://github.com/your-username/website-type-classifier.gitwith your actual repository URL) -
Create a Virtual Environment (Recommended): It's good practice to use a virtual environment to manage dependencies.
python -m venv venv
-
Activate the Virtual Environment:
- On Windows:
.\venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install Dependencies: First, you'll need a
requirements.txtfile (see the "Steps to take" section in the previous response). Once you have it:pip install -r requirements.txt
This project relies on the following Python libraries for its functionality, as listed in requirements.txt:
fastapi: A modern, fast (high-performance) web framework for building APIs with Python 3.8+ based on standard Python type hints. It's the core framework powering your API.uvicorn: An ASGI (Asynchronous Server Gateway Interface) server, used to run FastAPI applications and handle incoming requests.pydantic: Data validation and settings management using Python type hints. Used for defining yourWebsiterequest body, ensuring valid input.requests: An elegant and simple HTTP library for Python, used for making HTTP requests to fetch website content (i.e., sending GET requests to the URLs you want to classify).beautifulsoup4(often imported asbs4): A library for pulling data out of HTML and XML files. It's used to parse the HTML content fetched from websites, allowing you to extract text, titles, meta descriptions, and find specific HTML elements.tldextract: Accurately separates a URL into its subdomain, domain, and top-level domain (TLD). This is crucial for your domain-based heuristics inmain.py.scikit-learn(often imported assklearn): A comprehensive machine learning library for Python, providing various classification, regression, and clustering algorithms. It's used intrain_model.pyfor TF-IDF vectorization, Logistic Regression model training, and evaluation metrics.joblib: A set of tools to provide lightweight pipelining in Python, primarily used for efficiently saving and loading Python objects to/from disk. In your project, it's essential for persisting your trained TF-IDF vectorizer and the ML classification model (.pklfiles).numpy: The fundamental package for numerical computing with Python. It's essential for handling arrays and performing mathematical operations, especially when dealing with the numerical outputs and probabilities from your machine learning model.pandas: A powerful and flexible open-source data analysis and manipulation library. It's used intrain_model.pyfor handling and processing your training datasets efficiently.
The API relies on a pre-trained ML model (website_classifier_model.pkl) and its corresponding vectorizer (tfidf_vectorizer.pkl). These files are not included in the repository and must be generated by running the training script.
-
Run the training script:
python train_model.py
This script will download some sample data, train the
LogisticRegressionmodel, and save thetfidf_vectorizer.pklandwebsite_classifier_model.pklfiles in your project root directory.Note: The
train_model.pyyou provided has a very small dataset for demonstration. For a truly robust and accurate classifier, you'll need significantly more diverse and larger datasets with many examples per category.
Once the ML model files are generated, you can start the FastAPI server.
-
Ensure your virtual environment is active. (If not, activate it as shown in "Installation" step 3).
-
Start the Uvicorn server:
uvicorn main:app --reload --port 8000
(The
--reloadflag is useful for development, as it restarts the server automatically on code changes.)You should see output indicating that the server is running, typically on
http://127.0.0.1:8000.
Here are some key screenshots from the project:
Open your web browser and navigate to:
http://127.0.0.1:8000/
You will see a simple input field where you can enter a website URL and click "Classify Website" to see the results.
You can also interact with the API directly using curl or any HTTP client (like Postman, Insomnia, or a Python requests script).
Endpoint: POST /classify
Content-Type: application/json
Example Request:
curl -X POST "[http://127.0.0.1:8000/classify](http://127.0.0.1:8000/classify)" \
-H "Content-Type: application/json" \
-d '{"url": "[https://www.nytimes.com/](https://www.nytimes.com/)"}'


