Code used to generate the results of the DMKD journal paper A Practical Approach to Novel Class Discovery in Tabular Data
This python library proposes an ensemble tools for the Machine Learning problem of Novel Class Discovery.
In this library, you will find the following tools illustrated through Jupyter Notebooks:
- An hyperparameter optimization procedure tailored to transfer the results from the known classes to the novel classes.
- An estimation of the number of clusters by applying clustering quality metrics in the latent space of NCD methods.
- Two unsupervised clustering algorithms modified to utilize the data available in the NCD setting.
- A novel method called PBN (for Projection-Based NCD).
Option 1 - With Anaconda:
# Create the virtual environment and install the packages with conda
conda env create --file environment.yml --prefix ./venvpracticalncd
# Activate the virtual environment
conda activate .\venvpracticalncd
# Add package missing from conda repositories
pip install iteration-utilities==0.11.0
Prerequisite: having Python 3.10.9 the default python 3.10 version.
# Create the empty virtual environment
py -3.10 -m venv venvpracticalncd
# Activate the virtual environment
# On windows:
.\venvpracticalncd\Scripts\activate
# On linux:
source venvpracticalncd/bin/activate
# Install the needed packages
pip install -r requirements.txt
# And finish by installing pytorch independently
pip install torch==1.12.1 --index-url https://download.pytorch.org/whl/cu113
# Add the virtual environment as a jupyter kernel
ipython kernel install --name "venvpracticalncd" --user
# Check if torch supports GPU (you need CUDA 11 installed)
python -c "import torch; print(torch.cuda.is_available())"
Three notebooks are available:
- Full_notebook.ipynb lets you train and evaluate the models when the number of clusters k is known in advance.
- Full_notebook_with_k_estimation.ipynb (self-explanatory).
- results_wrt_n_unknown_classes.ipynb is used to evaluate the performance of all the models when the number of novel classes increases. It was used to generate Figure C1 of Appendix C.
The datasets will be automatically downloaded from https://archive.ics.uci.edu/ on the first execution.
If it fails, please try disabling proxies.
However, the data splits for some datasets are random and the results can vary compared to the paper.
The most impacted datasets are:
- LetterRecognition
- USCensus1990
- multiple_feature
If you found this work useful, please use the following citation:
@article{tr2024practical,
title = {A Practical Approach to Novel Class Discovery in Tabular Data},
author = {Troisemaine, Colin and Reiffers{-}Masson, Alexandre and Gosselin, St{'{e}}phane and Lemaire, Vincent and Vaton, Sandrine},
journal = {Data Mining and Knowledge Discovery},
year = {2024},
month = {May},
day = {31},
issn = {1573-756X},
doi = {10.1007/s10618-024-01025-y}
}
Copyright (c) 2023 Orange.
This code is released under the MIT license. See the LICENSE file for more information.