The Spatial Contextualization for Closed Itemset Mining (SCIM) algorithm is a mining procedure that builds a space for the target database in such a way that relevant closed itemsets can be retrieved regarding the relative spatial location of their items.
The SCIM algorithm uses Dual Scaling to map the items of the database to a multidimensional metric space called Solution Space. The representation of the database in the Solution Space assists in the interpretation and definition of overlapping clusters of related items. The distances of the items to the centers of the clusters are used as criteria for generating itemsets. Therefore, instead of using the minimum support threshold, a distance threshold is defined concerning the reference and the maximum distances computed per cluster during the mapping procedure.
The approach was developed by Altobelli B. Mantuan and Leandro A. F. Fernandes. Check out the project's website for details.
This repository includes the C++ implementation of the SCIM algorithm, and a sample application using this implementation.
Please cite our IEEE ICDM 2018 paper if you use this code in your research:
@InProceedings{mantuan_fernandes-icdm-2018,
author = {Mantuan, Altobelli B. and Fernandes, Leandro A. F.},
title = {Spatial contextualization for closed itemset mining},
booktitle = {Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM)},
year = {2018},
pages = {1176--1181},
doi = {https://doi.org/10.1109/ICDM.2018.00155},
url = {http://www.ic.uff.br/~laffernandes/projects/sodm},
}
Do not exitate to contact Altobelli B. Mantuan (amantuan@ic.uff.br, altobelli.bm@gmail.com) if any problems are encountered.
All code is released under the GNU General Public License, version 3, or (at your option) any later version.
We have compiled and tested the sample application on Linux and Windows using GCC 4.9.1 and Microsoft Visual C++ 2013.
Make sure that you have all the following tools and libraries installed and working before attempting to compile the SCIM implementation.
Required tools:
- GCC 4.9.1 or later (Linux or Windows)
- Microsoft Visual C++ 2013 or later (Windows)
- CMake
Required C++ libraries:
Use the git clone command to download the project:
$ git clone https://github.com/Prograf-UFF/SCIM.git SCIM
$ cd SCIM
Make sure you have the environment variables for Eigen (EIGEN3_INCLUDE_DIR
) and Boost (BOOST_ROOT
) defined in your system.
The basic steps for configuring and building the sample application look like this:
$ mkdir build
$ cd build
$ cmake [-G <generator>] [options] -DCMAKE_BUILD_TYPE=Release ..
Assuming a makefile generator was used:
$ make
To run the sample application, just call:
$ SCIM <database-file-path> <dr-threshold-value> <output-folder-path>
The database file must follow the .num
format used by The LUCS-KDD Discretised/normalised ARM and CARM Data Library.
The distance ratio threshold (dr-threshold-value
) must be in the [0, 1] range. We believe that 0 (zero) is an excellent initial guess value. The user may increase the parameter value slightly in an exploratory fashion in order to detect more closed itemsets.