-
Importing Libraries: The code begins by importing necessary libraries such as pandas for data manipulation, numpy for numerical operations, matplotlib.pyplot for plotting graphs, and various modules from scikit-learn for machine learning tasks.
-
Loading Data: The code loads the cancer data from a CSV file named "cancerdata.csv" into a pandas DataFrame named cancerdata.
-
Data Preprocessing & EDA:
Converting the diagnosis column from 'B' to 'Benign' and from 'M' to 'Malignant'. Dropping the 'id' column. Printing information about the DataFrame using info() method. Describing the DataFrame using describe() method. Splitting Data: Splitting the DataFrame into features (cancerdata_X) and target (cancerdata_y).
4.Pipeline Creation:
Defining numeric and categorical features. Creating pipelines for numerical and categorical preprocessing. Combining the pipelines using ColumnTransformer to handle different types of features. Fitting and transforming the data using fit() and transform() methods. Scaling Features: Creating a pipeline to scale the features using MinMaxScaler().
Splitting Data into Train and Test Sets: Using train_test_split() function from scikit-learn to split the data into training and testing sets.
- Model Training:
Iterating over different numbers of neighbors for KNN classifier. Training KNN classifiers for each number of neighbors. Calculating training and testing accuracies and storing them in a list. Plotting Accuracy: Plotting the training and testing accuracies against the number of neighbors.
- Model Evaluation:
Training a KNN classifier with a chosen number of neighbors. Predicting the target labels for the test data. Calculating accuracy using accuracy_score() function. Creating a confusion matrix using crosstab() function from pandas.