An Experimental Analysis of Under-Sampling Techniques in Semi-Supervised Anomaly Detection with Auto-Encoders

Author: Bruno Guzzo
Matriculation: 242504
Institution: Università della Calabria - Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica (DIMES)
Degree: Master of Science in Computer Engineering
Academic Year: 2024/2025

Supervisor: Prof. Fabrizio Angiulli
Co-supervisor: Prof. Luca Ferragina

$AE-SAD Reconstruction Performance$ AE-SAD Reconstruction Performance: Comparison of original and reconstructed normal and anomalous digits.

Abstract

This Master's thesis presents a rigorous experimental investigation into the efficacy of under-sampling techniques within the context of deep semi-supervised anomaly detection. The class imbalance problem, wherein anomalous instances are vastly outnumbered by normal instances, poses a significant challenge to the training of deep learning models, often leading to a bias towards the majority class and consequently, suboptimal detection performance. This research leverages the Auto-Encoder for Semi-Supervised Anomaly Detection (AE-SAD) methodology, a novel approach that modifies the traditional auto-encoder training objective to actively utilize the information from a small set of labeled anomalies. The core of this work is to systematically evaluate how various under-sampling strategies—ranging from random selection to sophisticated neighborhood-based cleaning rules—impact the performance, efficiency, and behavior of the AE-SAD model. Through a comprehensive suite of experiments conducted on the MNIST benchmark dataset, this study demonstrates that under-sampling can yield substantial reductions in computational training time. Furthermore, our findings reveal that in complex, multi-class anomaly scenarios (many-vs-many), judicious application of under-sampling can lead to significant improvements in detection accuracy, as measured by the Area Under the ROC Curve (AUC). The research concludes that a carefully selected under-sampling strategy is a potent tool for augmenting deep semi-supervised anomaly detection frameworks, offering a critical balance between computational efficiency and model effectiveness.

Repository Structure

The repository is organized into several distinct directories, each serving a specific purpose within the research workflow.

MSc-AI-ML-thesis-anomaly-detection/
│
├── datasets/                  # Contains the raw MNIST dataset files.
│
├── latex/
│   ├── presentation/           # LaTeX source for the thesis presentation.
│   └── thesys/                 # LaTeX source for the main thesis document.
│
├── torch-AE-SAD/             # Core implementation of the experimental framework.
│   ├── README.md             # High-level documentation for this submodule.
│   ├── dataset_loaders/        # Data loading and preparation scripts.
│   ├── model/                  # Model architectures, trainers, and analysis scripts.
│   └── utils/                  # Shared utility scripts (e.g., logging).
│
├── undersampling/              # Scripts for synthetic data generation and visualization.
│
├── requirements.txt            # Python package dependencies.
└── install_requirements.sh     # Shell script for installing dependencies.

Key Directories

latex/: Contains all the LaTeX source code for the thesis document and the final presentation.
torch-AE-SAD/: This is the primary module containing the complete PyTorch implementation of the experimental framework. It is organized into several sub-packages that handle data loading, model definition, training, and analysis. For a detailed overview of this module, see the torch-AE-SAD/README.md.
- dataset_loaders/: Manages all data loading and preparation logic. It includes parsers for the MNIST IDX format and builds Dataset objects for the one-vs-all and many-vs-many experimental setups. The integration of under-sampling techniques is also handled here. For more details, see the torch-AE-SAD/dataset_loaders/README.md.
- model/: The core of the project, containing all model architectures, training scripts, and post-processing utilities. For a comprehensive guide, see the torch-AE-SAD/model/README.md.
undersampling/: Includes scripts for generating synthetic 2D datasets and visualizing the behavior of different under-sampling algorithms, which aids in the intuitive understanding of their mechanics.

Methodology

1. Semi-Supervised Anomaly Detection with AE-SAD

The foundational methodology of this research is the AE-SAD framework \cite{angiulli2024reconstructionerrorbasedanomalydetection}. Traditional auto-encoders, when applied to anomaly detection, are trained on normal data to minimize reconstruction error. The core assumption is that anomalous data will yield a higher reconstruction error. However, deep models can often generalize too well, learning to reconstruct anomalies with low error, thus diminishing their detectability.

AE-SAD addresses this limitation in a semi-supervised setting by leveraging a small number of labeled anomalies. It employs a custom loss function that bifurcates the training objective:

For normal instances ($y_i=0$), it minimizes the standard reconstruction error, forcing the auto-encoder to learn an accurate representation.
For anomalous instances ($y_i=1$), it minimizes the error between the reconstruction and a transformed version of the input, effectively training the network to reconstruct anomalies incorrectly.

The AE-SAD loss function is formally defined as:

$$L_{\lambda}(x_i) = (1 - y_i) \|x_i - \hat{x}_i\|_2^2 + \lambda y_i \|F(x_i) - \hat{x}_i\|_2^2$$

where $\hat{x}_i$ is the reconstruction of input $x_i$, $F(x)$ is a transformation function (e.g., $F(x) = 1-x$), and $\lambda$ is a balancing hyperparameter. The anomaly score for any given instance is subsequently calculated using the standard reconstruction error, $S(x) = |x - \hat{x}|_2^2$. This procedure is designed to maximize the contrast in reconstruction error between normal and anomalous data.

The following charts illustrate how different activation functions in the decoder's final layer affect the latent space learned by the AE-SAD model. The model is trained to map normal instances (class 0, purple) and anomalous instances (class 1, yellow) into distinct, separable clusters in the latent space. This separation is key to the model's ability to distinguish between normal and anomalous data.

Latent space visualization with a ReLU activation function in the final layer of the decoder.

Latent space visualization with a Linear activation function in the final layer of the decoder.

Latent space visualization with a Tanh activation function in the final layer of the decoder.

2. Under-Sampling Techniques

To combat the class imbalance inherent in anomaly detection tasks, this thesis investigates a wide array of under-sampling techniques. These methods reduce the size of the majority (normal) class to create a more balanced training set. The evaluated techniques include:

Random Under-Sampling: Randomly discards instances from the majority class.
Cluster Centroids: Uses K-Means to find centroids of the majority class, which then represent the entire class.
Condensed Nearest Neighbor (CNN): Iteratively selects a subset of instances that can still correctly classify the entire dataset.
Edited Nearest Neighbors (ENN): Removes instances whose class label differs from the majority of their k-nearest neighbors, cleaning the class boundaries.
Neighborhood Cleaning Rule (NCR): A more aggressive extension of ENN.
Tomek Links: Removes majority instances from pairs of nearest neighbors that belong to different classes.
NearMiss (Versions 1, 2, 3): Selects majority instances based on their distance to the nearest or farthest minority class instances.

Experimental Setup

Dataset

All experiments are conducted on the MNIST dataset, a standard benchmark for image-based machine learning tasks. The dataset consists of 70,000 grayscale images of handwritten digits ($28 \times 28$ pixels). Experiments are formulated in "one-vs-all" and "many-vs-many" scenarios, where one or more digit classes are designated as normal and the others as anomalous.

Implementation Details

Framework: The models and experimental pipelines are implemented in Python using the PyTorch deep learning framework.
Hardware Acceleration: All training and evaluation processes are accelerated using NVIDIA GPUs via the CUDA platform.
Models: The primary architecture is a deep, fully-connected Auto-Encoder. Variants using different activation functions (ReLU, Tanh, Linear) and architectures (Convolutional, Variational) are also explored.
Optimizer: The AdamW optimizer is used for its improved weight decay implementation, which enhances regularization and model generalization.

Installation

To replicate the experimental environment, follow these steps:

Create a virtual environment (optional but recommended):
```
python3 -m venv .venv
source .venv/bin/activate
```
Install dependencies using the provided scripts: The install_requirements.sh script handles the installation of PyTorch with the correct CUDA toolkit version and other necessary packages.
```
bash install_requirements.sh
```
Alternatively, you can install packages from requirements.txt:
```
pip install -r requirements.txt
```

How to Run Experiments

The primary scripts for conducting the experiments are located in the torch-AE-SAD/model/ directory. For a detailed guide on the model architectures and training scripts, please refer to the torch-AE-SAD/model/README.md.

To run a new experiment:

Navigate to the torch-AE-SAD/model/ directory.
Open the desired trainer script (e.g., deep_ae_sad_multi_class_trainer.py).
Modify the ANOMALY_CONFIG and SAMPLER_CONFIG lists at the bottom of the script to define your experiment.
Execute the script as a module from the project root directory:
```
python -m torch-AE-SAD.model.deep_ae_sad_multi_class_trainer
```
Experimental results, including performance metrics and training metadata, will be saved as JSON files in the torch-AE-SAD/model/json_metrics/staged/ directory.

Key Results

The experimental analysis yielded several key insights:

Efficiency: Under-sampling techniques dramatically reduce the training time of the AE-SAD model, in some cases by over two orders of magnitude, without a catastrophic loss in performance.
Performance in Complex Scenarios: In many-vs-many configurations, where both normal and anomalous classes are composed of multiple digit types, moderate under-sampling (e.g., reducing the normal class to 50%) significantly improves the AUC score compared to training on the full imbalanced dataset.
Method-Specific Strengths:
- Neighborhood Cleaning Rule (NCR) and Edited Nearest Neighbors (ENN) demonstrated the highest peak AUC scores, particularly in the complex many-vs-many scenario, highlighting the efficacy of boundary-cleaning methods.
- Cluster Centroids offered the best trade-off between performance and efficiency, maintaining a high AUC with substantial time savings.
- NearMiss variants proved to be the fastest methods, providing acceptable AUC for rapid prototyping.

Thesis Document

The full LaTeX source code for the thesis is available in the latex/thesys/ directory. The final compiled PDF document (main.pdf) contains the complete theoretical background, methodological details, experimental results, and conclusions of this research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An Experimental Analysis of Under-Sampling Techniques in Semi-Supervised Anomaly Detection with Auto-Encoders

Abstract