Stanford Lecture - CS231n theory implementation using NumPy for deep learning fundamentals
Implemented a complete k-NN classifier from scratch using NumPy.
- Vectorization: optimized performance by replacing nested loops with matrix operations (Broadcasting), achieving significant speedup.
-
Hyperparameter Tuning: Conducted K-fold cross-validation to find the optimal
$k$ . -
Modules:
-
k_nn_utils.py: Core logic for distance calculation (L2) and prediction. -
knn_cifar10.py: Script for training, testing, and visualization.
-
Investigated the relationship between dataset size and model performance to verify the effectiveness of Pixel-wise L2 distance in high-dimensional space (CIFAR-10).
-
Setup: Compared accuracy between small dataset (
$N=5,000$ ) and full dataset ($N=50,000$ ). - Result: Accuracy saturated around 33% despite a 10x increase in data.
1. The Curse of Dimensionality Even with 50,000 samples, the data remains sparse in the 3,072-dimensional space. The distance to the nearest neighbor does not decrease significantly, leading to diminishing returns in performance (Logarithmic growth).
2. Semantic Gap in L2 Distance
- Observation: The model often misclassifies images based on dominant background colors rather than object shapes.
- Analysis: L2 distance calculates the sum of independent pixel differences. It is sensitive to global color distributions (e.g., green background) but fails to capture local semantic features (e.g., edges, shapes).
- Conclusion: Pure data scaling cannot overcome the structural limitations of pixel-based distance metrics. This necessitates the use of feature-extraction-based models like Linear Classifiers or CNNs.

Implemented a Multiclass SVM (Hinge Loss) classifier to overcome the memory and prediction speed limitations of k-NN.
-
Parametric Approach: Transitioned from memory-based (k-NN) to model-based learning (
$f(x, W) = Wx + b$ ), compressing the knowledge of the entire dataset into a weight matrix$W$ . - Fully Vectorized Loss: Implemented the SVM loss function without explicit loops, utilizing NumPy broadcasting and advanced indexing for massive performance gains.
-
Modules:
-
linear_classifier.py: Implements forward pass (score calculation) and vectorized loss computation.
-
Verified the correctness of the vectorized implementation by analyzing the initial loss value with unoptimized random weights.
-
Setup: Initialized
$W$ using standard normal distribution (np.random.randn) scaled by$0.01$ . Input images were scaled to$[0, 1]$ . -
Result: Calculated Initial Loss
$\approx 338.4$ (on$N=50,000$ ).
1. Validation of Vectorization The calculation for 50,000 images was completed almost instantly. The resulting loss value (~338) aligns with the expected theoretical range for unnormalized random weights, confirming that the broadcasting and masking logic works correctly across the entire batch.
2. The Need for Optimization Unlike k-NN, where performance is fixed by the dataset, this high loss value serves as the baseline for learning. The quantitative loss metric proves that the current random model is failing to classify correctly, setting the stage for implementing Gradient Descent to minimize this loss.
Implemented the core training logic to minimize the SVM Loss using Gradient Descent.
-
Analytic Gradient: Derived and implemented the gradient of the SVM loss function (
$\nabla_W L$ ) using fully vectorized NumPy operations, avoiding inefficient numerical differentiation. - Stochastic Gradient Descent (SGD): Transformed the training loop from Batch Gradient Descent (using all 50k images) to SGD (Mini-batch size: 200), achieving massive speed improvements.
- Hyperparameter Tuning: Experimented with Learning Rate and Batch Size to stabilize training.
- Observation: During the first training attempt, the Loss skyrocketed from 321 to 71,047 within 10 iterations (Divergence).
-
Root Cause Analysis:
- The input data was unscaled (
$0 \sim 255$ ), resulting in large score values. - The large scores caused massive gradients, and combined with the learning rate, the weights updated too aggressively ("overshooting" the minima).
- The input data was unscaled (
-
Solution:
-
Data Preprocessing: Applied Normalization (
X_train /= 255.0) to scale pixel values between$[0, 1]$ . -
Type Casting: Converted data to
float32before division to prevent type mismatch errors.
-
Data Preprocessing: Applied Normalization (
-
Result after Fix:
-
Initial Loss: Dropped to 10.6 (Close to the theoretical expected loss for random weights:
$\approx 9.0$ ). - Training Dynamics: Loss decreased steadily without divergence.
-
Initial Loss: Dropped to 10.6 (Close to the theoretical expected loss for random weights:
-
SGD Efficiency:
- Switching to SGD (Batch size 200) accelerated the training loop by approx. 250x compared to Batch GD.
- Final Loss: Reached ~7.9 after 1,500 iterations.
-
Fluctuation: Observed the characteristic "noisy" descent of SGD (e.g., Loss jumping 8.7
$\to$ 9.3$\to$ 7.9), confirming the stochastic nature of sampling.
Successfully evaluated the best SVM model on the test set and visualized the learned templates.
- Hyperparameter Tuning: Searched through multiple
learning_ratesandreg_strengths.- Best Combination:
lr: 0.001,reg: 0.25 - Best Validation Accuracy: 36.20%
- Best Combination:
- Final Test Performance: Achieved 33.66% Accuracy on the CIFAR-10 test set.
- Weight Visualization: Observed that the model learns "spatial templates" for each class (e.g., green blobs for frogs, blue backgrounds for ships).
- The following images represent the learned weights (templates) for each class:
Unlike the linear classifier, the two-layer neural network learns distributed representations. The following image shows the weights (
- Observation: The neurons act as various filters for edges, colors, and blobs, which are then combined in the second layer to classify the image.
Implemented a modular Convolutional Neural Network (CNN) from scratch to capture spatial hierarchies in image data, moving beyond the limitations of flat vector inputs used in Linear Classifiers and MLPs.
-
Full Modular Architecture: Implemented
Conv - ReLU - Pool - Affine - ReLU - Affine - Softmaxarchitecture. -
Manual Backpropagation: Derived and implemented the analytic gradients for Convolution and Max Pooling layers using the Chain Rule, handling 4D tensors (
$N, C, H, W$ ) without automatic differentiation. -
Modules:
-
layers.py: Containsforwardandbackwardmethods forConv_naive,MaxPool_naive, etc. -
cnn.py: Assembles the layers into aThreeLayerConvNetclass. -
train_overfit.py: Script for verifying implementation integrity.
-
Before training on the full dataset, it is crucial to verify the correctness of the complex backpropagation logic (specifically dimensions and gradient flow in 4D tensors).
-
Setup: Trained the model on a tiny dataset (
$N=5$ images) with a high learning rate ($0.1$ ) for 20 epochs. - Hypothesis: If the forward and backward passes are implemented correctly, the model should have enough capacity to perfectly memorize (overfit) the small dataset, driving the loss to near zero.
-
Result:
-
Initial Loss:
$\approx 2.3$ (Random guessing). -
Final Loss:
$\approx 0.02$ (Perfect memorization).
-
Initial Loss:
1. Validation of Gradient Flow (Chain Rule)
The convergence to near-zero loss confirms that the gradient of the loss function is correctly flowing back through the Max Pooling (routing gradients to max indices) and Convolution layers (cross-correlating gradients with filters). If there were any dimension mismatch or mathematical error in conv_backward, the loss would have stagnated or exploded.
2. SGD Dynamics: The "Overshooting" Phenomenon
-
Observation: During the training (Epoch 10-12), the loss temporarily spiked (
$0.86 \to 1.93$ ) before settling down. -
Analysis: This illustrates the behavior of Stochastic Gradient Descent with a high learning rate. The optimizer "overshot" the local minimum due to the large step size but successfully corrected its trajectory. This confirms that the update logic (
$W \leftarrow W - \eta \cdot \nabla W$ ) is working robustly even under aggressive hyperparameter settings.
To verify the correctness of the implementation (especially Backpropagation), I conducted a "Sanity Check" by overfitting a small dataset (
-
Setup:
learning_rate = 0.01,epochs = 20 -
Observation: The loss decreased very slowly (
$2.29 \to 2.03$ ). - Analysis: The gradient updates were too small to converge within 20 epochs. This indicated the need for a more aggressive learning rate for this tiny dataset.
-
Setup:
learning_rate = 0.1,epochs = 20 -
Observation:
-
Overshooting: A spike in loss occurred at Epoch 11 (
$1.16$ ) and 12 ($2.00$ ), indicating the step size was large enough to jump over the local minima temporarily. - Convergence: The optimizer successfully corrected the trajectory, driving the final loss to 0.0225.
-
Overshooting: A spike in loss occurred at Epoch 11 (
-
Conclusion: The model has sufficient capacity to memorize the dataset, confirming that the
forwardandbackwardpasses are mathematically correct.
Scaled up the experiment from the sanity check to learning actual visual features from the CIFAR-10 dataset using the implemented CNN architecture.
-
Data Pipeline: Implemented efficient data loading for a subset (
$N=5,000$ ) to optimize for CPU-based training.-
Preprocessing: Applied Mean Subtraction and dimension transposition (
HWC$\to$ CHW) to align with the customim2colimplementation.
-
Preprocessing: Applied Mean Subtraction and dimension transposition (
-
Training Loop: Implemented the full SGD loop (
cnn_cifar10.py) with real-time loss logging and visualization logic.
The goal was to verify if the implementation could learn meaningful visual representations (spatially organized features) rather than just memorizing pixel values.
- Setup: Trained on 5,000 CIFAR-10 images using
batch_size=50. - Tuning: Initially, with
lr=0.001, the loss stagnated at 2.302 (Random Guessing). Increasing the learning rate to0.01triggered immediate convergence.
1. Emergence of Visual Patterns (Gabor Filters)
Upon visualization, the first-layer weights (
- Color Blobs: Specific filters (e.g., #17 Green, #16 Pink) learned to activate on dominant background colors.
- Edge Detectors: Other filters (e.g., #5, #14) evolved into "Edge Detectors" capable of recognizing horizontal or diagonal lines.
- Conclusion: This visually proves that the Convolution operation and Backpropagation are correctly extracting low-level features (edges, colors) from raw pixels, which is the fundamental basis of Deep Learning vision models.
2. Loss Dynamics & SGD Fluctuation
-
Observation: The loss dropped significantly to ~1.57 but showed high fluctuation (
$1.6 \leftrightarrow 2.0$ ) in later iterations. -
Analysis: This behavior is characteristic of Stochastic Gradient Descent with a small batch size (
$50$ ) and an aggressive learning rate. The model successfully escaped the initial plateau and converged to a meaningful state, proving the robustness of the update rule even with limited data.
This section documents the transition from manual NumPy implementations to efficient deep learning workflows using PyTorch, focusing on hardware acceleration and automatic differentiation for Vision AI research.
Explored the core data structure of PyTorch and configured the environment for high-performance computing on Apple Silicon.
- Tensor Manipulation: Mastered tensor creation, indexing, and broadcasting, which are the building blocks for handling high-dimensional image data.
- Device Management: Implemented logic to detect and utilize MPS (Metal Performance Shaders), enabling GPU acceleration on Mac devices.
- Interoperability: Leveraged the bridge between NumPy and PyTorch to maintain flexibility in data preprocessing.
Deep-dived into the mechanics of Automatic Differentiation, the core technology that replaces manual gradient derivations in Stanford CS231n theory.
- Computational Graphs: Understood how PyTorch dynamically builds graphs to track operations on tensors with
requires_grad=True. - The Backward Mechanism: Practiced triggering the chain rule via
.backward()to automatically compute gradients for complex functions. - Gradient Accumulation: Identified that PyTorch accumulates gradients by default, necessitating the use of
optimizer.zero_grad()to prevent interference between training iterations.
Integrated the fundamentals into a complete end-to-end training pipeline for a simple regression task (
- Standardized Workflow: Established the 5-step routine:
Forward -> Loss -> Zero_grad -> Backward -> Step. - Directory:
pytorch_practice/day3_linear_regression.py
Investigated the impact of the Learning Rate (
- Epochs: 200
- Optimizer: SGD
- Loss Function: MSELoss
1. Gradient Explosion & Divergence (The "NaN" Problem)
-
Setup:
lr = 0.8 -
Observation: Loss skyrocketed to
$10^{33}$ within 10 epochs and eventually hitinfandnan. - Analysis: The step size was too aggressive, causing the optimizer to overshoot the global minimum and diverge. This proves that without proper LR scaling, even a convex problem can fail to initialize.
2. Slow Convergence (The "Turtle" Problem)
- Setup:
lr = 0.0001 - Observation: The loss decreased at an imperceptible rate. After 200 epochs, the weight reached only ~1.4 (Target: 2.0).
- Analysis: While stable, the updates were too small to reach the optimal solution.
3. Optimal Convergence
-
Setup:
lr = 0.01 -
Observation: The model successfully converged to
$w \approx 1.96, b \approx 1.03$ , effectively filtering out the synthetic noise to find the underlying trend.
| Explosion (lr=0.8) | Slow (lr=0.0001) | Optimal (lr=0.01) |
|---|---|---|
![]() |
![]() |
![]() |
| (Note: High LR results in no fitted line due to NaN coordinates in the weight/bias tensors.) |
Expanded the linear model to handle multiple input features, transitioning from scalar-based logic to Vectorized Matrix Operations using PyTorch.
-
High-Dimensional Mapping: Implemented a model to predict a single target value from three independent features (
$$x_1, x_2, x_3$$ ), following the hypothesis:$$y = w_1x_1 + w_2x_2 + w_3x_3 + b$$ . -
Vectorization with
matmul: Replaced manual summation withtorch.matmul(X, W)to process 100 samples simultaneously, ensuring high computational efficiency on Apple Silicon (MPS). -
Internal Weight Logic: Analyzed
nn.Linear's storage format ($$out \times in$$ ) and verified that PyTorch internally transposes weights ($$W^T$$ ) to execute$$y = XW^T + b$$ . -
Modules:
-
pytorch_practice/day4_multivariable.py: Full implementation of multi-variable training loop and weight analysis.
-
The objective was to verify if the optimizer could accurately recover multiple hidden weights (
- Hyperparameters:
epochs = 2000,learning_rate = 0.01,optimizer = SGD. - Convergence: Successfully reached a stable state around Epoch 400.
1. The "Noise Floor" Phenomenon
- Observation: Despite 2,000 epochs of training, the loss reached a minimum plateau of ~0.2300 and never hit zero.
-
Analysis: This represents the theoretical limit of error (Noise Floor). Since we added noise with
$$\sigma = 0.5$$ , the minimum achievable MSE is approximately$$\sigma^2 = 0.25$$ . The model effectively learned 100% of the underlying signal, with the remaining loss being irreducible noise.
2. Statistical Accuracy in Weights
-
Result: Final learned weights were
$$W \approx [1.997, 3.103, 4.007], b \approx 5.092$$ . -
Analysis: The slight deviation from the exact integers (
$$2, 3, 4$$ ) is expected due to the stochastic nature of the noise and limited sample size ($$N=100$$ ). This proves that the model found the statistically optimal solution rather than simply memorizing the noisy data points.
3. Vectorized Data Extraction
-
Technique: Refined the logic for extracting trained parameters based on their dimensionality.
-
Scalar Bias (
$$b$$ ): Used.item()for direct conversion to float. -
Weight Vector (
$$W$$ ): Used.detach().tolist()to safely extract the multi-dimensional array while disconnecting it from the Autograd graph to prevent memory overhead.
-
Scalar Bias (
Transitioned from processing entire datasets in memory to a scalable data pipeline using PyTorch's standardized components.
- Modular Design: Subclassed
torch.utils.data.Datasetto encapsulate data storage (__init__), size reporting (__len__), and index-based retrieval (__getitem__). - Mini-batch Management: Integrated
DataLoaderto partition 100 samples into batches of size 8, implementingshuffle=Trueto prevent the model from memorizing data order. - Nested Training Loop: Designed a two-tier loop structure (Epoch and Batch) to enable memory-efficient parameter updates.
- Modules:
pytorch_practice/day5_dataloader.py: Implementation of the custom Dataset class and mini-batch training loop.
Analyzed the convergence behavior of Mini-batch Stochastic Gradient Descent (SGD) on a noisy multi-variable dataset.
- Setup:
batch_size = 8,learning_rate = 0.01,epochs = 100. - Result: Successfully reduced the Average Loss from ~55.0 to a stable ~0.0102.
1. The "Invisible Gradient" Incident (Parentheses Syntax)
- Observation: The loss initially stagnated at ~55 despite the training loop running.
-
Root Cause Analysis:
loss.backwardwas called without parentheses(), meaning the function was referenced but not executed. This prevented gradient computation, leaving weights ($W, b$ ) static despite callingoptimizer.step(). -
Solution: Explicitly invoked the function using
loss.backward()to trigger the autograd engine.
2. Theoretical Loss Floor Verification
-
Analysis: Since the synthetic data included Gaussian noise with
$\sigma = 0.1$ , the theoretical minimum MSE is$\sigma^2 = 0.01$ . -
Conclusion: The final loss of 0.0102 indicates that the model has perfectly recovered the underlying signal (
$W=[2, 3, 4], b=5$ ), with only the irreducible noise remaining.
3. Format Specifiers for Log Readability
- Implementation: Utilized the
:3dformat specifier in f-strings to ensure consistent vertical alignment of epoch logs regardless of the number of digits.
Overcame the mathematical limitations of linear models by implementing a Multi-Layer Perceptron (MLP) and introducing non-linear activation functions to solve complex decision boundaries.
- nn.Module Subclassing: Adopted the standard PyTorch design pattern by subclassing
nn.Module, ensuring proper parameter registration viasuper().__init__(). - Breaking Linearity: Introduced
nn.ReLUandnn.Sigmoidbetween linear layers to prevent "Linear Stacking," where multiple linear layers mathematically collapse into a single transformation. - Refactoring with
nn.Sequential: Compared manual layer linking inforward()with the more concisenn.Sequentialcontainer, optimizing code readability for feed-forward architectures. - Modules:
pytorch_practice/day6_mlp_xor.py: Implementation of MLP architectures (V1: Manual, V2: Sequential) for the XOR problem.
The XOR problem is the classic benchmark for non-linearity, as its classes cannot be separated by a single linear hyperplane. The goal was to verify if a 2-layer MLP could distort the input space to achieve perfect classification.
- Setup:
input: 2,hidden: 10,output: 1. UsedBCELossfor binary classification andlr: 1.0to handle the small dataset. - Result: Successfully reached 100.0% Accuracy. Average Loss dropped from 0.6924 (random guessing) to 0.000038 after 10,000 steps.
1. The Chain of Linearity & Activation Functions
-
Theory: Without non-linear activations,
$y = W_2(W_1x + b_1) + b_2$ simplifies to$y = W_{new}x + b_{new}$ . -
Experiment: Verified that removing
nn.ReLUcauses the accuracy to plateau at 50%, proving that "Depth" without "Non-linearity" is mathematically equivalent to a single-layer linear model.
2. Optimization Dynamics: High Learning Rate (
-
Analysis: For extremely small datasets like XOR (
$N=4$ ), the gradient updates are infrequent (once per epoch in full batch). -
Strategy: A high learning rate was essential to escape the flat plateaus (saddle points) of the Sigmoid-based loss landscape. Unlike large-scale datasets where
$lr=1.0$ might cause divergence, here it facilitated rapid convergence toward the global minimum.
3. Binary Cross Entropy (BCE) vs. MSE
- Observation: Transitioned from
MSELosstoBCELossfor the classification task. - Mechanism: Combined with a
Sigmoidoutput,BCELosspenalizes wrong predictions exponentially as they approach the opposite class, providing a much stronger gradient signal for binary outcomes than squared error.
4. Advanced Tensor Post-processing
- Technique: Implemented the
.detach().numpy()chain for result visualization. - Insight: Mastered the requirement of disconnecting tensors from the autograd graph (
.detach()) before converting them to NumPy for interoperability with standard Python data tools.
Expanded beyond binary classification to implement a Multi-class Classification model for the MNIST dataset (0-9 digits).
-
Softmax & Cross Entropy: Utilized the Softmax function to ensure the sum of output probabilities equals 1, and applied
nn.CrossEntropyLossto maximize the predicted probability of the correct class. -
Efficient Pipeline: Implemented
view(-1, 784)to transform$28 \times 28$ images into 784-dimensional vectors and used thedrop_last=Trueoption in the DataLoader to ensure batch consistency. - Advanced Optimization: Transitioned from basic SGD to the Adam optimizer, which combines Momentum and RMSProp (Adaptive Learning Rate) for superior performance.
| Optimizer | Final Cost | Test Accuracy | Note |
|---|---|---|---|
| SGD (lr=0.1) | 0.0300 | 97.37% | Stable, but convergence is relatively slow. |
| Adam (lr=0.001) | 0.0117 | 97.92% | 2.5x Cost reduction, 0.55%p Accuracy increase. |
1. Numerical Stability of nn.CrossEntropyLoss
- PyTorch's
nn.CrossEntropyLossinternally combinesLogSoftmaxandNLLLoss. - By passing "Raw Logits" to the loss function instead of manually applying Softmax at the final layer, the model avoids potential Overflow during exponential calculations, ensuring numerical stability.
2. Superior Convergence of Adam
- Observation: As shown in the graph above, Adam (blue) reached a lower loss plateau much faster than SGD (gray).
- Analysis: The Adaptive Learning Rate mechanism, which adjusts the step size for each parameter individually, allowed the model to react more sensitively to critical features in the 784-dimensional MNIST input, effectively navigating the complex loss landscape.
3. Inference Mode & Memory Management
- Technique: Utilized
with torch.no_grad():during the testing phase to disable the generation of the computational graph. - Insight: This minimized memory overhead and accelerated inference. The predicted labels were derived using
.argmax(), which identifies the index with the highest probability among the 10 classes.
Advanced from simple MLPs to Convolutional Neural Networks (CNNs) to effectively capture spatial hierarchies in image data. Conducted a rigorous benchmark study to optimize model architecture using modern deep learning techniques.
-
Architecture Evolution:
-
Basic CNN: A standard 2-layer structure (
Conv - ReLU - MaxPool$\times$ 2) to establish a baseline. - Deep CNN: A robust 3-layer architecture integrated with Batch Normalization and Dropout to improve stability and generalization.
-
Basic CNN: A standard 2-layer structure (
- Initialization Strategy: Transitioned from Xavier Initialization (for Sigmoid/Tanh) to He (Kaiming) Initialization, which is mathematically optimal for ReLU activation functions.
-
Modules:
-
pytorch_practice/day8_cnn_comparison.py: Full implementation of the comparative experiment, including visualization logic.
-
The objective was to break the 99% accuracy barrier on the MNIST dataset by overcoming the limitations of shallow networks.
-
Dataset: MNIST (Normalized to
$[0, 1]$ ) -
Hyperparameters:
epochs = 15,batch_size = 100,lr = 0.001(Adam). -
Comparison:
- Model A (Basic): Simple Feature Extractor + Classifier.
- Model B (Deep): Added Depth (3rd Layer), Batch Normalization (to fix Internal Covariate Shift), and Dropout (p=0.5, for Regularization).
| Model | Final Cost | Test Accuracy | Error Rate |
|---|---|---|---|
| Basic CNN | 0.0069 | 98.99% | 1.01% |
| Deep CNN | 0.0059 | 99.29% | 0.71% |
1. The "Silent Killer": Data Scaling Mismatch
- Incident: Initially, the Deep CNN yielded a disastrous 64% accuracy despite low training loss.
-
Root Cause Analysis: The training data was normalized via
transforms.ToTensor()($0 \sim 1$ ), but the test data was raw pixel values ($0 \sim 255$ ). -
Critical Lesson: Batch Normalization layers are extremely sensitive to input statistics (
$\mu, \sigma$ ). Feeding unscaled data ($255\times$ larger magnitude) during inference completely shattered the learned statistics. Applying/ 255.0to the test set immediately restored accuracy to 99.29%.
2. The Power of Batch Normalization
- Observation: As seen in the Training Cost graph (Left), the Deep CNN (Orange) starts at a significantly lower cost and converges faster than the Basic CNN (Blue).
-
Analysis: BN standardizes the inputs to each layer (
$\mu=0, \sigma=1$ ), preventing gradients from vanishing or exploding. This allowed the model to focus on learning complex features from the very first epoch without wasting time adapting to shifting distributions.
3. The Weight of 0.3% (Error Rate Reduction)
-
Analysis: While the accuracy difference (
$98.99% \to 99.29%$ ) seems small, it represents a ~30% reduction in the Error Rate ($1.01% \to 0.71%$ ). - Conclusion: In the high-performance regime (above 98%), marginally increasing accuracy requires exponentially better feature extraction. The combination of Deeper Layers (semantic complexity) and Dropout (ensemble effect) was necessary to correctly classify the most ambiguous edge cases in MNIST.
Transitioned from training networks from scratch to leveraging massive pre-trained architectures (ResNet-18) via Transfer Learning, solving the critical issue of data scarcity in computer vision tasks.
- Feature Extractor Freezing: Frozen the pre-trained weights of the ResNet backbone (
requires_grad = False) to retain the rich hierarchical features learned from ImageNet, preventing catastrophic forgetting during early training. - Classifier Replacement: Dynamically extracted the number of input features (
num_ftrs) and replaced the final Fully Connected (FC) layer to output 2 classes (Ants vs. Bees) instead of the original 1000. - Advanced Data Augmentation: Implemented a robust
transformspipeline includingRandomResizedCropandRandomHorizontalFlipto artificially expand the highly limited training dataset. - Modules:
day9_transfer_learning.py: Full implementation of the fine-tuning pipeline, custom training loop with model checkpointing, and visualization logic.
The objective was to evaluate the extreme efficiency of Transfer Learning by fine-tuning a deep CNN on a "micro" dataset that would typically lead to severe overfitting if trained from scratch.
- Dataset: Hymenoptera (Ants vs. Bees)
- Train: 244 images
- Validation: 153 images
- Hyperparameters:
epochs = 5,batch_size = 4,lr = 0.001(SGD with Momentum 0.9). - Hardware: Apple Silicon (MPS).
1. Learning Curve & Convergence

| Model | Pre-trained | Train Size | Training Time | Best Val Accuracy |
|---|---|---|---|---|
| ResNet-18 (Fine-tuned) | Yes (ImageNet) | 244 | ~15 secs | 94.77% |
1. The "Reversed" Learning Curve Phenomenon
- Observation: As seen in the Learning Curve graph, the Validation Accuracy (Orange) starts significantly higher than the Training Accuracy (Purple).
- Analysis: This counter-intuitive result is caused by Data Augmentation and Pre-trained Knowledge. The training set is heavily distorted (
RandomResizedCrop, flips) making classification artificially difficult. Conversely, the validation set is cleanly center-cropped, allowing the already-smart ResNet backbone to easily classify the inputs from Epoch 1.
2. Apple Silicon (MPS) Architecture Constraints
- Incident: Encountered a
TypeErrorduring accuracy calculation: "Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64." - Solution: Identified that Apple's Metal Performance Shaders (MPS) currently lack native support for 64-bit precision (
.double()). Explicitly downcasted the correct predictions tensor to 32-bit precision using.float()before division, successfully enabling GPU acceleration on the Mac environment.
3. State Dictionary & Memory Isolation (copy.deepcopy)
- Observation: In Stochastic Gradient Descent with small datasets, validation loss often fluctuates, meaning the final epoch's model is rarely the optimal one.
- Implementation: Designed a checkpointing logic inside the training loop to capture the weights (
model.state_dict()) whenever a new highest validation accuracy is achieved. - Insight: Crucially utilized
copy.deepcopy()to physically isolate the saved weights in memory. Using a shallow copy (assignment) would result in the "best" weights being continuously overwritten by subsequent suboptimal updates due to PyTorch's reference-based memory management.
4. Inverse Normalization for Visualization
- Technique: To visualize the tensor predictions back as raw images, implemented an
Un-normalizepipeline. - Mechanism: Re-applied the ImageNet statistics by multiplying the standard deviation (
std * img) and adding the mean (+ mean), followed bynp.clip(img, 0, 1)to handle overflow. This restored the distorted tensors into mathematically correct RGB color spaces for human-readable output.









