Implementation of K nearest neighbors model in Python. Supports built-in tuning of k hyper-parameter using k-fold cross validation.
Example usage can be found in the example_usage.ipynb
notebook. Package is contained within the KNN
directory.
-
KNN_data_collection
init
constructor: Creates an instance of a KNN model. Requires inputmodel_type
, indicating whether the model is a classifier or regressor. Optional inputk
sets the number of nearest neighbors to consider when generating predictions (default is 3).load_csv
: Loads a CSV dataset using thecsv
module and splits predictor and response variables. Requires inputpath
(string) indicating the path of the CSV file to load, andresponse
(string) indicating the name of the column in the csv to be set as the predictor variable (all other columns in the CSV are set as predictors). Expects first row of CSV to contain column names.train_test_split
: Performs train/test split on the loaded dataset, storing the resulting arrays as instance attributes to be used by other modules. Randomly selects test indices usingnp.random.choice
to avoid potential bias. Optional inputtest_size
determines the proportion of the dataset that is put into test set (default is 0.3).
-
generate_predictions
euclidean_distance
: computes the Euclidean distance between two points, used as the distance function for KNN implementation.generate_prediction
: Generates a single prediction using KNN model. Requires inputknn_model
, an instance of a KNN model (with pre-loaded CSV);new_obs
, a numpy array or list containing the sample for which to generate a prediction; andsubset
, one of 'train' or 'all' which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generated by computing the Euclidean distance between the inputted observation and all points in either the training set or the entire dataset (depending on the value ofsubset
), sorting these distances, and selecting thek
closest. For a regression model, the mean of thesek
closest points is returned; for a classification model, the most observed class in thesek
closest points is returned. Note that in the case of a tie, classifier prediction will be determined by the order of the training set (a warning will be displayed if this occurs).generate_prediction
: Generates multiple predictions using KNN model, returning the results as a numpy array. Requires inputknn_model
, an instance of a KNN model (with pre-loaded CSV);new_array
, a multi-dimensional numpy array containing the samples for which to generate predictions; andsubset
, one of 'train' or 'all' which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generating by applyinggenerate_prediction
function tonew_array
row-wise usingnp.apply_along_axis
.
-
model_metrics
- Contains a series of functions to assess KNN model's performance.
- Require arguments
actual
, a numpy array or list containing the "true" values; andpredicted
, a numpy array or list containing the "predicted" values. Error metrics are computed for these inputs using their respective mathematical formulas.
- Require arguments
- Metric functions for KNN classifier:
model_accuracy
: Computes accuracymodel_misclassification
: Computes misclassification ratemodel_num_correct
: Computes the number of correct predictionsmodel_num_incorrect
: Computes the number of incorrect predictions
- Metric functions for KNN regressor:
model_rmse
: Computes the root mean squared errormodel_mse
: Computes the mean squared errormodel_mae
: Computes the mean absolute errormodel_mape
: Computes the mean absolute percent error
- Contains a series of functions to assess KNN model's performance.
-
cross_validation
CvKNN
class inherits fromKNN
class (defined in inmodelling
module). Model must load CSV and perform train/test split using functions documented inmodelling
module in order to perform cross validation.init
constructor: Creates an instance of a KNN model for performing cross validation. Requires inputmodel_type
, indicating whether the model is a classifier or regressor. Optional inputnum_folds
indicates the number of folds used in K-fold cross validation (default is 5).perform_cv
: Performs k-fold cross validation on the training set. Requires argumentk_values
, a list or numpy array ofk
values for which to perform CV on. This is done by first creating the indices needed to split training set intonum_folds
folds. For every value ofk
ink_values
, predictions are generated for 1/num_folds
of the training set (becoming the CV test set), using the other (num_folds
-1)/num_folds
values as the CV training set. The model's average performance across all folds (mse
is used as the performance metric for regressors, whilemisclassificaiton_rate
is used for classifiers) is recorded for each value ofk
ink_values
.get_cv_results
: Displays the results fromperform_cv
. Average performance using the appropriate error metric across all folds is printed for each value ofk
provided inperform_cv
. A lineplot created using theseaborn
library is displayed, offering a visual representation of how the variousk
values influence the model's performance.get_best_k
: Prints the "best" k value fromperform_csv
, defined as the value of k which produced the lowest cross validation loss (mse for regressors, and misclassification rate for classifiers). Sets this value of k as an instance attribute, allowing theCvKNN
instance to be passed into theassessment_metrics
function in thegenerate_prediction
module in order to assess the KNN model's performance with the tuned k hyper-parameter.