Skip to content

The implementation of Naive Bayes Classifier built from scratch!

Notifications You must be signed in to change notification settings

theEmperorofDaiViet/naive_bayes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

Table of Contents
  1. About The Project
  2. Getting Started
  3. API Documentation
  4. Usage
  5. Contact

About The Project

This is a part of my Introduction to Data Science's assignment at university. In this part, I tried to write my own implementation of Naive Bayes Classifier from scratch!

Built With

  • Numpy

(back to top)

Getting Started

Prerequisites

To use this module, your system needs to have:

  • numpy
    pip install numpy

Installation

You can install this module by cloning this repository into your current working directory:

git clone https://github.com/theEmperorofDaiViet/naive_bayes.git

(back to top)

API Documentation

The Naive_Bayes module implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions.

Naive_Bayes.Gaussian_Naive_Bayes

This model is mainly used when dealing with continuous data.

fit(X, y)[source]

Fit Gaussian Naive Bayes according to X, y.

Parameters X: np.array of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

y: np.array of shape (n_samples)

Target values.

Returns None

gaussian_density(x, mean, var)[source]

Calculate the probabiliti(es) density function of Gaussian distribution for a give sample, knowing the mean(s) and the variance(s).

Parameters x: float or np.array(dtype = float) of shape (n_features)

Value(s) of a feature or each feature of a certain sample.

mean: float or np.array(dtype = float) of shape (n_features)

Mean(s) of a feature or each feature.

var: float or np.array(dtype = float) of shape (n_features)

Variance(s) of a feature or each feature.

Returns C: float or np.array(dtype = float) of shape (n_features)

Returns the probabiliti(es) of a feature or each feature of the sample.


class_probability(x)[source]

Calculate the probabilities of a given sample to belong to each class, then choose the class with maximum probability.

Parameters x: np.array(dtype = float) of shape (n_features)

A certain sample.

Returns C: str or int

Returns the class which have the maximum probability of the input sample belong to it.


predict(X)[source]

Perform classification on an array of test vectors X.

Parameters X: np.array of shape (n_samples, n_features)

The input samples.

Returns C: np.array of shape (n_samples)

Predicted target values for X.


(back to top)

Usage

Here is an example of how this module can be used to perform data classification.

In this example, I use the dry bean dataset from Kaggle.

Import libraries, modules and load data

>>> from Naive_Bayes import Gaussian_Naive_Bayes
>>> import correctness
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split

>>> df = pd.read_excel('Dry_Bean_Dataset.xlsx')
>>> df.shape
(13611, 17)

The correctness module I import is my other built-from-scratch module. It's used for evaluating the performance of classification models. You'll see it's effect below, or you can take a look at it here.

Preprocess and split data

>>> data = df.drop(['ConvexArea','EquivDiameter','AspectRation','Eccentricity','Class','Area','Perimeter','ShapeFactor2','ShapeFactor3','ShapeFactor1','ShapeFactor4'],axis = 1)
>>> target = df['Class']

>>> X = np.array(data)
>>> y = np.array(target)

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Perform classification using this module and evaluate the model performance

>>> nb = Gaussian_Naive_Bayes()
>>> nb.fit(X_train, y_train)
>>> y_pred = nb.predict(X_test)

>>> cm = correctness.confusion_matrix(y_test, y_pred)
>>> scratch = correctness.accuracy(cm)
>>> print(correctness.report(cm))
CLASSIFICATION REPORT:
   precision    recall  f1-score  support
0   0.814394  0.911017  0.860000      264
1   1.000000  1.000000  1.000000      106
2   0.920245  0.874636  0.896861      326
3   0.879501  0.940741  0.909091      722
4   0.964770  0.922280  0.943046      369
5   0.957286  0.929268  0.943069      398
6   0.866171  0.821869  0.843439      538
          precision    recall  f1-score  support
                                                
macro      0.914624  0.914259  0.913644     2723
micro      0.903048  0.903048  0.903048     2723
weighted   0.904705  0.903048  0.903094     2723
accuracy    0.903048

Perform classification but using sklearn.naive_bayes.GaussianNB and evaluate the model performance

>>> from sklearn.naive_bayes import GaussianNB

>>> sknb = GaussianNB()
>>> sknb.fit(X_train, y_train)
>>> y_sk = sknb.predict(X_test)

>>> skcm = correctness.confusion_matrix(y_test, y_sk)
>>> sklearn = correctness.accuracy(skcm)
>>> print(correctness.report(skcm))
CLASSIFICATION REPORT:
   precision    recall  f1-score  support
0   0.814394  0.907173  0.858283      264
1   1.000000  1.000000  1.000000      106
2   0.920245  0.879765  0.899550      326
3   0.876731  0.939169  0.906877      722
4   0.964770  0.924675  0.944297      369
5   0.957286  0.927007  0.941904      398
6   0.862454  0.815466  0.838302      538
          precision    recall  f1-score  support
                                                
macro      0.913697  0.913322  0.912745     2723
micro      0.901579  0.901579  0.901579     2723
weighted   0.903176  0.901579  0.901603     2723
accuracy    0.901579 

Compare the accuracy of two models:

>>> Naive_Bayes_report = pd.DataFrame([[sklearn, scratch]])
>>> Naive_Bayes_report.columns = ['sklearn NB', 'scratch NB']
>>> Naive_Bayes_report
  sklearn NB	scratch NB
  0.901579	  0.903048

As you can see, the accuracy of two models using my "scratch" Gaussian_Naive_Bayes and using the sklearn's GaussianNB are approximately the same. And with little luck, my module's accuracy is slightly higher.


(back to top)

Contact

You can contact me via:


(back to top)

Style Sheets

Github's markdown processor cannot render <style> sheets, so you may see it lying here:

<style> table, th, td { border: 1px solid black; border-collapse: collapse; } th { align: left; vertical-align: top; width: 12% } mark { background-color: gray; color: black; } </style>

You can read this file with the best experience by using other text editor, e.g. Visual Studio Code's Open Preview mode (Ctrl+Shift+V)