An Experimental Analysis of Under-Sampling Techniques in Semi-Supervised Anomaly Detection with Auto-Encoders
Author: Bruno Guzzo
Matriculation: 242504
Institution: Università della Calabria - Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica (DIMES)
Degree: Master of Science in Computer Engineering
Academic Year: 2024/2025
Supervisor: Prof. Fabrizio Angiulli
Co-supervisor: Prof. Luca Ferragina
AE-SAD Reconstruction Performance: Comparison of original and reconstructed normal and anomalous digits.
This Master's thesis presents a rigorous experimental investigation into the efficacy of under-sampling techniques within the context of deep semi-supervised anomaly detection. The class imbalance problem, wherein anomalous instances are vastly outnumbered by normal instances, poses a significant challenge to the training of deep learning models, often leading to a bias towards the majority class and consequently, suboptimal detection performance. This research leverages the Auto-Encoder for Semi-Supervised Anomaly Detection (AE-SAD) methodology, a novel approach that modifies the traditional auto-encoder training objective to actively utilize the information from a small set of labeled anomalies. The core of this work is to systematically evaluate how various under-sampling strategies—ranging from random selection to sophisticated neighborhood-based cleaning rules—impact the performance, efficiency, and behavior of the AE-SAD model. Through a comprehensive suite of experiments conducted on the MNIST benchmark dataset, this study demonstrates that under-sampling can yield substantial reductions in computational training time. Furthermore, our findings reveal that in complex, multi-class anomaly scenarios (many-vs-many), judicious application of under-sampling can lead to significant improvements in detection accuracy, as measured by the Area Under the ROC Curve (AUC). The research concludes that a carefully selected under-sampling strategy is a potent tool for augmenting deep semi-supervised anomaly detection frameworks, offering a critical balance between computational efficiency and model effectiveness.
The repository is organized into several distinct directories, each serving a specific purpose within the research workflow.
MSc-AI-ML-thesis-anomaly-detection/
│
├── datasets/ # Contains the raw MNIST dataset files.
│
├── latex/
│ ├── presentation/ # LaTeX source for the thesis presentation.
│ └── thesys/ # LaTeX source for the main thesis document.
│
├── torch-AE-SAD/ # Core implementation of the experimental framework.
│ ├── README.md # High-level documentation for this submodule.
│ ├── dataset_loaders/ # Data loading and preparation scripts.
│ ├── model/ # Model architectures, trainers, and analysis scripts.
│ └── utils/ # Shared utility scripts (e.g., logging).
│
├── undersampling/ # Scripts for synthetic data generation and visualization.
│
├── requirements.txt # Python package dependencies.
└── install_requirements.sh # Shell script for installing dependencies.
-
latex/: Contains all the LaTeX source code for the thesis document and the final presentation. -
torch-AE-SAD/: This is the primary module containing the complete PyTorch implementation of the experimental framework. It is organized into several sub-packages that handle data loading, model definition, training, and analysis. For a detailed overview of this module, see thetorch-AE-SAD/README.md.dataset_loaders/: Manages all data loading and preparation logic. It includes parsers for the MNIST IDX format and buildsDatasetobjects for the one-vs-all and many-vs-many experimental setups. The integration of under-sampling techniques is also handled here. For more details, see thetorch-AE-SAD/dataset_loaders/README.md.model/: The core of the project, containing all model architectures, training scripts, and post-processing utilities. For a comprehensive guide, see thetorch-AE-SAD/model/README.md.
-
undersampling/: Includes scripts for generating synthetic 2D datasets and visualizing the behavior of different under-sampling algorithms, which aids in the intuitive understanding of their mechanics.
The foundational methodology of this research is the AE-SAD framework \cite{angiulli2024reconstructionerrorbasedanomalydetection}. Traditional auto-encoders, when applied to anomaly detection, are trained on normal data to minimize reconstruction error. The core assumption is that anomalous data will yield a higher reconstruction error. However, deep models can often generalize too well, learning to reconstruct anomalies with low error, thus diminishing their detectability.
AE-SAD addresses this limitation in a semi-supervised setting by leveraging a small number of labeled anomalies. It employs a custom loss function that bifurcates the training objective:
- For normal instances (
$y_i=0$ ), it minimizes the standard reconstruction error, forcing the auto-encoder to learn an accurate representation. - For anomalous instances (
$y_i=1$ ), it minimizes the error between the reconstruction and a transformed version of the input, effectively training the network to reconstruct anomalies incorrectly.
The AE-SAD loss function is formally defined as:
where
The following charts illustrate how different activation functions in the decoder's final layer affect the latent space learned by the AE-SAD model. The model is trained to map normal instances (class 0, purple) and anomalous instances (class 1, yellow) into distinct, separable clusters in the latent space. This separation is key to the model's ability to distinguish between normal and anomalous data.
Latent space visualization with a ReLU activation function in the final layer of the decoder.
Latent space visualization with a Linear activation function in the final layer of the decoder.
Latent space visualization with a Tanh activation function in the final layer of the decoder.
To combat the class imbalance inherent in anomaly detection tasks, this thesis investigates a wide array of under-sampling techniques. These methods reduce the size of the majority (normal) class to create a more balanced training set. The evaluated techniques include:
- Random Under-Sampling: Randomly discards instances from the majority class.
- Cluster Centroids: Uses K-Means to find centroids of the majority class, which then represent the entire class.
- Condensed Nearest Neighbor (CNN): Iteratively selects a subset of instances that can still correctly classify the entire dataset.
- Edited Nearest Neighbors (ENN): Removes instances whose class label differs from the majority of their k-nearest neighbors, cleaning the class boundaries.
- Neighborhood Cleaning Rule (NCR): A more aggressive extension of ENN.
- Tomek Links: Removes majority instances from pairs of nearest neighbors that belong to different classes.
- NearMiss (Versions 1, 2, 3): Selects majority instances based on their distance to the nearest or farthest minority class instances.
All experiments are conducted on the MNIST dataset, a standard benchmark for image-based machine learning tasks. The dataset consists of 70,000 grayscale images of handwritten digits (
- Framework: The models and experimental pipelines are implemented in Python using the PyTorch deep learning framework.
- Hardware Acceleration: All training and evaluation processes are accelerated using NVIDIA GPUs via the CUDA platform.
- Models: The primary architecture is a deep, fully-connected Auto-Encoder. Variants using different activation functions (ReLU, Tanh, Linear) and architectures (Convolutional, Variational) are also explored.
- Optimizer: The AdamW optimizer is used for its improved weight decay implementation, which enhances regularization and model generalization.
To replicate the experimental environment, follow these steps:
-
Create a virtual environment (optional but recommended):
python3 -m venv .venv source .venv/bin/activate -
Install dependencies using the provided scripts: The
install_requirements.shscript handles the installation of PyTorch with the correct CUDA toolkit version and other necessary packages.bash install_requirements.sh
Alternatively, you can install packages from
requirements.txt:pip install -r requirements.txt
The primary scripts for conducting the experiments are located in the torch-AE-SAD/model/ directory. For a detailed guide on the model architectures and training scripts, please refer to the torch-AE-SAD/model/README.md.
To run a new experiment:
- Navigate to the
torch-AE-SAD/model/directory. - Open the desired trainer script (e.g.,
deep_ae_sad_multi_class_trainer.py). - Modify the
ANOMALY_CONFIGandSAMPLER_CONFIGlists at the bottom of the script to define your experiment. - Execute the script as a module from the project root directory:
python -m torch-AE-SAD.model.deep_ae_sad_multi_class_trainer
- Experimental results, including performance metrics and training metadata, will be saved as JSON files in the
torch-AE-SAD/model/json_metrics/staged/directory.
The experimental analysis yielded several key insights:
- Efficiency: Under-sampling techniques dramatically reduce the training time of the AE-SAD model, in some cases by over two orders of magnitude, without a catastrophic loss in performance.
- Performance in Complex Scenarios: In many-vs-many configurations, where both normal and anomalous classes are composed of multiple digit types, moderate under-sampling (e.g., reducing the normal class to 50%) significantly improves the AUC score compared to training on the full imbalanced dataset.
- Method-Specific Strengths:
- Neighborhood Cleaning Rule (NCR) and Edited Nearest Neighbors (ENN) demonstrated the highest peak AUC scores, particularly in the complex many-vs-many scenario, highlighting the efficacy of boundary-cleaning methods.
- Cluster Centroids offered the best trade-off between performance and efficiency, maintaining a high AUC with substantial time savings.
- NearMiss variants proved to be the fastest methods, providing acceptable AUC for rapid prototyping.
The full LaTeX source code for the thesis is available in the latex/thesys/ directory. The final compiled PDF document (main.pdf) contains the complete theoretical background, methodological details, experimental results, and conclusions of this research.