Skip to content

LizaanB/Classification-Project

Repository files navigation

Please note: Download the repository as a zipped folder, create a private repository, and upload the content to it. This way, you can collaborate with your teammates effectively.

2401PTDS_Classification_Project

Analysing News Articles Dataset

Streamlit App

Table of contents

1. Project Overview

Your team has been hired as data science consultants for a news outlet to create classification models using Python and deploy it as a web application with Streamlit. The aim is to provide you with a hands-on demonstration of applying machine learning techniques to natural language processing tasks. This end-to-end project encompasses the entire workflow, including data loading, preprocessing, model training, evaluation, and final deployment. The primary stakeholders for the news classification project for the news outlet could include the editorial team, IT/tech support, management, readers, etc. These groups are interested in improved content categorization, operational efficiency, and enhanced user experience.

2. Dataset

The dataset is comprised of news articles that need to be classified into categories based on their content, including Business, Technology, Sports, Education, and Entertainment. You can find both the train.csv and test.csv datasets here.

Dataset Features:

Column Description
Headlines The headline or title of the news article.
Description A brief summary or description of the news article.
Content The full text content of the news article.
URL The URL link to the original source of the news article.
Category The category or topic of the news article (e.g., business, education, entertainment, sports, technology).

3. Packages

To carry out all the objectives for this repo, the following necessary dependencies were loaded:

  • Pandas 2.2.2 and Numpy 1.26
  • Matplotlib 3.8.4

4. Environment

It's highly recommended to use a virtual environment for your projects, there are many ways to do this; we've outlined one such method below. Make sure to regularly update this section. This way, anyone who clones your repository will know exactly what steps to follow to prepare the necessary environment. The instructions provided here should enable a person to clone your repo and quickly get started.

Create the new evironment - you only need to do this once

# create the conda environment
conda create --name <env>

This is how you activate the virtual environment in a terminal and install the project dependencies

# activate the virtual environment
conda activate <env>
# install the pip package
conda install pip
# install the requirements for this project
pip install -r requirements.txt

5. MLFlow

MLOps, which stands for Machine Learning Operations, is a practice focused on managing and streamlining the lifecycle of machine learning models. The modern MLOps tool, MLflow is designed to facilitate collaboration on data projects, enabling teams to track experiments, manage models, and streamline deployment processes. For experimentation, testing, and reproducibility of the machine learning models in this project, you will use MLflow. MLflow will help track hyperparameter tuning by logging and comparing different model configurations. This allows you to easily identify and select the best-performing model based on the logged metrics.

6. Streamlit

What is Streamlit?

Streamlit is a framework that acts as a web server with dynamic visuals, multiple responsive pages, and robust deployment of your models.

In its own words:

Streamlit ... is the easiest way for data scientists and machine learning engineers to create beautiful, performant apps in only a few hours! All in pure Python. All for free.

It’s a simple and powerful app model that lets you build rich UIs incredibly quickly.

Streamlit takes away much of the background work needed in order to get a platform which can deploy your models to clients and end users. Meaning that you get to focus on the important stuff (related to the data), and can largely ignore the rest. This will allow you to become a lot more productive.

Description of files

For this repository, we are only concerned with a single file:

File Name Description
base_app.py Streamlit application definition.

6.1 Running the Streamlit web app on your local machine

As a first step to becoming familiar with our web app's functioning, we recommend setting up a running instance on your own local machine. To do this, follow the steps below by running the given commands within a Git bash (Windows), or terminal (Mac/Linux):

  • Ensure that you have the prerequisite Python libraries installed on your local machine:
pip install -U streamlit numpy pandas scikit-learn
  • Navigate to the base of your repo where your base_app.py is stored, and start the Streamlit app.
cd 2401PTDS_Classification_Project/Streamlit/
streamlit run base_app.py

If the web server was able to initialise successfully, the following message should be displayed within your bash/terminal session:

  You can now view your Streamlit app in your browser.

    Local URL: http://localhost:8501
    Network URL: http://192.168.43.41:8501

You should also be automatically directed to the base page of your web app. This should look something like:

Congratulations! You've now officially deployed your first web application!

6.2 Deploying your Streamlit web app

  • To deploy your app for all to see, click on deploy.

  • Please note: If it's your first time deploying it will redirect you to set up an account first. Please follow the instructions.

7. Team Members

Name Email
Jana Liebenberg-Fouche jliebenberg-fouche@sandtech.com
Edmund Dotsey edotsey@sandtech.com
Farayi Myambo fmyambo@sandtech.com

Additional Resources to create a README file:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •