Table of Contents
This is a part of my Introduction to Data Science's assignment at university. In this part, I tried to write my own implementation of Naive Bayes Classifier from scratch!
To use this module, your system needs to have:
- numpy
pip install numpy
You can install this module by cloning this repository into your current working directory:
git clone https://github.com/theEmperorofDaiViet/naive_bayes.git
The Naive_Bayes module implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions.
This model is mainly used when dealing with continuous data.
fit(X, y)
[source]
Fit Gaussian Naive Bayes according to X, y.
Parameters |
X: np.array of shape (n_samples, n_features) Training vectors, where n_samples is the number of samples and n_features is the number of features. y: np.array of shape (n_samples)Target values. |
---|---|
Returns |
None |
gaussian_density(x, mean, var)
[source]
Calculate the probabiliti(es) density function of Gaussian distribution for a give sample, knowing the mean(s) and the variance(s).
Parameters |
x: float or np.array(dtype = float) of shape (n_features) Value(s) of a feature or each feature of a certain sample. mean: float or np.array(dtype = float) of shape (n_features)Mean(s) of a feature or each feature. var: float or np.array(dtype = float) of shape (n_features)Variance(s) of a feature or each feature. |
---|---|
Returns |
C: float or np.array(dtype = float) of shape (n_features) Returns the probabiliti(es) of a feature or each feature of the sample. |
class_probability(x)
[source]
Calculate the probabilities of a given sample to belong to each class, then choose the class with maximum probability.
Parameters |
x: np.array(dtype = float) of shape (n_features) A certain sample. |
---|---|
Returns |
C: str or int Returns the class which have the maximum probability of the input sample belong to it. |
predict(X)
[source]
Perform classification on an array of test vectors X.
Parameters |
X: np.array of shape (n_samples, n_features) The input samples. |
---|---|
Returns |
C: np.array of shape (n_samples) Predicted target values for X. |
Here is an example of how this module can be used to perform data classification.
In this example, I use the dry bean dataset from Kaggle.
>>> from Naive_Bayes import Gaussian_Naive_Bayes
>>> import correctness
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> df = pd.read_excel('Dry_Bean_Dataset.xlsx')
>>> df.shape
(13611, 17)
The correctness
module I import is my other built-from-scratch module. It's used for evaluating the performance of classification models. You'll see it's effect below, or you can take a look at it here.
>>> data = df.drop(['ConvexArea','EquivDiameter','AspectRation','Eccentricity','Class','Area','Perimeter','ShapeFactor2','ShapeFactor3','ShapeFactor1','ShapeFactor4'],axis = 1)
>>> target = df['Class']
>>> X = np.array(data)
>>> y = np.array(target)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
>>> nb = Gaussian_Naive_Bayes()
>>> nb.fit(X_train, y_train)
>>> y_pred = nb.predict(X_test)
>>> cm = correctness.confusion_matrix(y_test, y_pred)
>>> scratch = correctness.accuracy(cm)
>>> print(correctness.report(cm))
CLASSIFICATION REPORT:
precision recall f1-score support
0 0.814394 0.911017 0.860000 264
1 1.000000 1.000000 1.000000 106
2 0.920245 0.874636 0.896861 326
3 0.879501 0.940741 0.909091 722
4 0.964770 0.922280 0.943046 369
5 0.957286 0.929268 0.943069 398
6 0.866171 0.821869 0.843439 538
precision recall f1-score support
macro 0.914624 0.914259 0.913644 2723
micro 0.903048 0.903048 0.903048 2723
weighted 0.904705 0.903048 0.903094 2723
accuracy 0.903048
>>> from sklearn.naive_bayes import GaussianNB
>>> sknb = GaussianNB()
>>> sknb.fit(X_train, y_train)
>>> y_sk = sknb.predict(X_test)
>>> skcm = correctness.confusion_matrix(y_test, y_sk)
>>> sklearn = correctness.accuracy(skcm)
>>> print(correctness.report(skcm))
CLASSIFICATION REPORT:
precision recall f1-score support
0 0.814394 0.907173 0.858283 264
1 1.000000 1.000000 1.000000 106
2 0.920245 0.879765 0.899550 326
3 0.876731 0.939169 0.906877 722
4 0.964770 0.924675 0.944297 369
5 0.957286 0.927007 0.941904 398
6 0.862454 0.815466 0.838302 538
precision recall f1-score support
macro 0.913697 0.913322 0.912745 2723
micro 0.901579 0.901579 0.901579 2723
weighted 0.903176 0.901579 0.901603 2723
accuracy 0.901579
>>> Naive_Bayes_report = pd.DataFrame([[sklearn, scratch]])
>>> Naive_Bayes_report.columns = ['sklearn NB', 'scratch NB']
>>> Naive_Bayes_report
sklearn NB scratch NB
0.901579 0.903048
As you can see, the accuracy of two models using my "scratch" Gaussian_Naive_Bayes
and using the sklearn's GaussianNB
are approximately the same. And with little luck, my module's accuracy is slightly higher.
You can contact me via:
Github's markdown processor cannot render <style>
sheets, so you may see it lying here:
You can read this file with the best experience by using other text editor, e.g. Visual Studio Code's Open Preview mode (Ctrl+Shift+V)