Skip to content

Implementation of KNN regression and classification models in Python

Notifications You must be signed in to change notification settings

ericphillips99/knn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

knn

Implementation of K nearest neighbors model in Python. Supports built-in tuning of k hyper-parameter using k-fold cross validation.

Example usage can be found in the example_usage.ipynb notebook. Package is contained within the KNN directory.

modelling subpackage: KNN_data_collection and generate_prediction modules

  • KNN_data_collection

    • init constructor: Creates an instance of a KNN model. Requires input model_type, indicating whether the model is a classifier or regressor. Optional input k sets the number of nearest neighbors to consider when generating predictions (default is 3).
    • load_csv: Loads a CSV dataset using the csv module and splits predictor and response variables. Requires input path (string) indicating the path of the CSV file to load, and response (string) indicating the name of the column in the csv to be set as the predictor variable (all other columns in the CSV are set as predictors). Expects first row of CSV to contain column names.
    • train_test_split: Performs train/test split on the loaded dataset, storing the resulting arrays as instance attributes to be used by other modules. Randomly selects test indices using np.random.choice to avoid potential bias. Optional input test_size determines the proportion of the dataset that is put into test set (default is 0.3).
  • generate_predictions

    • euclidean_distance: computes the Euclidean distance between two points, used as the distance function for KNN implementation.
    • generate_prediction: Generates a single prediction using KNN model. Requires input knn_model, an instance of a KNN model (with pre-loaded CSV); new_obs, a numpy array or list containing the sample for which to generate a prediction; and subset, one of 'train' or 'all' which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generated by computing the Euclidean distance between the inputted observation and all points in either the training set or the entire dataset (depending on the value of subset), sorting these distances, and selecting the k closest. For a regression model, the mean of these k closest points is returned; for a classification model, the most observed class in these k closest points is returned. Note that in the case of a tie, classifier prediction will be determined by the order of the training set (a warning will be displayed if this occurs).
    • generate_prediction: Generates multiple predictions using KNN model, returning the results as a numpy array. Requires input knn_model, an instance of a KNN model (with pre-loaded CSV); new_array, a multi-dimensional numpy array containing the samples for which to generate predictions; and subset, one of 'train' or 'all' which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generating by applying generate_prediction function to new_array row-wise using np.apply_along_axis.

assessment subpackage: model_metrics and cross_validation modules

  • model_metrics

    • Contains a series of functions to assess KNN model's performance.
      • Require arguments actual, a numpy array or list containing the "true" values; and predicted, a numpy array or list containing the "predicted" values. Error metrics are computed for these inputs using their respective mathematical formulas.
    • Metric functions for KNN classifier:
      • model_accuracy: Computes accuracy
      • model_misclassification: Computes misclassification rate
      • model_num_correct: Computes the number of correct predictions
      • model_num_incorrect: Computes the number of incorrect predictions
    • Metric functions for KNN regressor:
      • model_rmse: Computes the root mean squared error
      • model_mse: Computes the mean squared error
      • model_mae: Computes the mean absolute error
      • model_mape: Computes the mean absolute percent error
  • cross_validation

    • CvKNN class inherits from KNN class (defined in in modelling module). Model must load CSV and perform train/test split using functions documented in modelling module in order to perform cross validation.
    • init constructor: Creates an instance of a KNN model for performing cross validation. Requires input model_type, indicating whether the model is a classifier or regressor. Optional input num_folds indicates the number of folds used in K-fold cross validation (default is 5).
    • perform_cv: Performs k-fold cross validation on the training set. Requires argument k_values, a list or numpy array of k values for which to perform CV on. This is done by first creating the indices needed to split training set into num_folds folds. For every value of k in k_values, predictions are generated for 1/num_folds of the training set (becoming the CV test set), using the other (num_folds-1)/num_folds values as the CV training set. The model's average performance across all folds (mse is used as the performance metric for regressors, while misclassificaiton_rate is used for classifiers) is recorded for each value of k in k_values.
    • get_cv_results: Displays the results from perform_cv. Average performance using the appropriate error metric across all folds is printed for each value of k provided in perform_cv. A lineplot created using the seaborn library is displayed, offering a visual representation of how the various k values influence the model's performance.
    • get_best_k: Prints the "best" k value from perform_csv, defined as the value of k which produced the lowest cross validation loss (mse for regressors, and misclassification rate for classifiers). Sets this value of k as an instance attribute, allowing the CvKNN instance to be passed into the assessment_metrics function in the generate_prediction module in order to assess the KNN model's performance with the tuned k hyper-parameter.

About

Implementation of KNN regression and classification models in Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published