Implementation of K nearest neighbors model in Python. Supports built-in tuning of k hyper-parameter using k-fold cross validation.
Example usage can be found in the example_usage.ipynb notebook. Package is contained within the KNN directory.
-
KNN_data_collection
initconstructor: Creates an instance of a KNN model. Requires inputmodel_type, indicating whether the model is a classifier or regressor. Optional inputksets the number of nearest neighbors to consider when generating predictions (default is 3).load_csv: Loads a CSV dataset using thecsvmodule and splits predictor and response variables. Requires inputpath(string) indicating the path of the CSV file to load, andresponse(string) indicating the name of the column in the csv to be set as the predictor variable (all other columns in the CSV are set as predictors). Expects first row of CSV to contain column names.train_test_split: Performs train/test split on the loaded dataset, storing the resulting arrays as instance attributes to be used by other modules. Randomly selects test indices usingnp.random.choiceto avoid potential bias. Optional inputtest_sizedetermines the proportion of the dataset that is put into test set (default is 0.3).
-
generate_predictions
euclidean_distance: computes the Euclidean distance between two points, used as the distance function for KNN implementation.generate_prediction: Generates a single prediction using KNN model. Requires inputknn_model, an instance of a KNN model (with pre-loaded CSV);new_obs, a numpy array or list containing the sample for which to generate a prediction; andsubset, one of 'train' or 'all' which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generated by computing the Euclidean distance between the inputted observation and all points in either the training set or the entire dataset (depending on the value ofsubset), sorting these distances, and selecting thekclosest. For a regression model, the mean of thesekclosest points is returned; for a classification model, the most observed class in thesekclosest points is returned. Note that in the case of a tie, classifier prediction will be determined by the order of the training set (a warning will be displayed if this occurs).generate_prediction: Generates multiple predictions using KNN model, returning the results as a numpy array. Requires inputknn_model, an instance of a KNN model (with pre-loaded CSV);new_array, a multi-dimensional numpy array containing the samples for which to generate predictions; andsubset, one of 'train' or 'all' which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generating by applyinggenerate_predictionfunction tonew_arrayrow-wise usingnp.apply_along_axis.
-
model_metrics
- Contains a series of functions to assess KNN model's performance.
- Require arguments
actual, a numpy array or list containing the "true" values; andpredicted, a numpy array or list containing the "predicted" values. Error metrics are computed for these inputs using their respective mathematical formulas.
- Require arguments
- Metric functions for KNN classifier:
model_accuracy: Computes accuracymodel_misclassification: Computes misclassification ratemodel_num_correct: Computes the number of correct predictionsmodel_num_incorrect: Computes the number of incorrect predictions
- Metric functions for KNN regressor:
model_rmse: Computes the root mean squared errormodel_mse: Computes the mean squared errormodel_mae: Computes the mean absolute errormodel_mape: Computes the mean absolute percent error
- Contains a series of functions to assess KNN model's performance.
-
cross_validation
CvKNNclass inherits fromKNNclass (defined in inmodellingmodule). Model must load CSV and perform train/test split using functions documented inmodellingmodule in order to perform cross validation.initconstructor: Creates an instance of a KNN model for performing cross validation. Requires inputmodel_type, indicating whether the model is a classifier or regressor. Optional inputnum_foldsindicates the number of folds used in K-fold cross validation (default is 5).perform_cv: Performs k-fold cross validation on the training set. Requires argumentk_values, a list or numpy array ofkvalues for which to perform CV on. This is done by first creating the indices needed to split training set intonum_foldsfolds. For every value ofkink_values, predictions are generated for 1/num_foldsof the training set (becoming the CV test set), using the other (num_folds-1)/num_foldsvalues as the CV training set. The model's average performance across all folds (mseis used as the performance metric for regressors, whilemisclassificaiton_rateis used for classifiers) is recorded for each value ofkink_values.get_cv_results: Displays the results fromperform_cv. Average performance using the appropriate error metric across all folds is printed for each value ofkprovided inperform_cv. A lineplot created using theseabornlibrary is displayed, offering a visual representation of how the variouskvalues influence the model's performance.get_best_k: Prints the "best" k value fromperform_csv, defined as the value of k which produced the lowest cross validation loss (mse for regressors, and misclassification rate for classifiers). Sets this value of k as an instance attribute, allowing theCvKNNinstance to be passed into theassessment_metricsfunction in thegenerate_predictionmodule in order to assess the KNN model's performance with the tuned k hyper-parameter.