GSP-Py: A Python-powered library to mine sequential patterns in large datasets, based on the robust Generalized Sequence Pattern (GSP) algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
Important
GSP-Py is compatible with Python 3.10 and later versions!
- π What is GSP?
- π§ Requirements
- π Installation
- π οΈ Developer Installation
- π‘ Usage
- π Planned Features
- π€ Contributing
- π License
- π Citation
The Generalized Sequential Pattern (GSP) algorithm is a sequential pattern mining technique based on Apriori principles. Using support thresholds, GSP identifies frequent sequences of items in transaction datasets.
- Support-based pruning: Only retains sequences that meet the minimum support threshold.
- Candidate generation: Iteratively generates candidate sequences of increasing length.
- General-purpose: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
For example:
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next."
- In a website clickstream, GSP might find patterns like "Users visit A, then go to B, and later proceed to C."
You will need Python installed on your system. On most Linux systems, you can install Python with:
sudo apt install python3
For package dependencies of GSP-Py, they will automatically be installed when using pip
.
GSP-Py can be easily installed from either the repository or PyPI.
To manually clone the repository and set up the environment:
git clone https://github.com/jacksonpradolima/gsp-py.git
cd gsp-py
Refer to the Developer Installation section and run the setup with uv.
Alternatively, install GSP-Py from PyPI with:
pip install gsppy
This project now uses uv for dependency management and virtual environments.
curl -Ls https://astral.sh/uv/install.sh | bash
Make sure uv is on your PATH (for most Linux setups):
export PATH="$HOME/.local/bin:$PATH"
Create a local virtual environment and install dependencies from uv.lock (single source of truth):
uv venv .venv
uv sync --frozen --extra dev # uses uv.lock
uv pip install -e .
Rust acceleration is optional and provides faster support counting using a PyO3 extension. Python fallback remains available.
Build the extension locally:
make rust-build
Select backend at runtime (auto tries Rust, then falls back to Python):
export GSPPY_BACKEND=rust # or python, or unset for auto
Run benchmarks (adjust to your machine):
make bench-small
make bench-big # may use significant memory/CPU
# or customize:
GSPPY_BACKEND=auto uv run --python .venv/bin/python --no-project \
python benchmarks/bench_support.py --n_tx 1000000 --tx_len 8 --vocab 50000 --min_support 0.2 --warmup
GPU acceleration is experimental and currently optimizes singleton (k=1) support counting using CuPy. Non-singleton candidates fall back to the Rust/Python backend.
Install the optional extra (choose a CuPy build that matches your CUDA/ROCm setup if needed):
uv run pip install -e .[gpu]
Select the GPU backend at runtime:
export GSPPY_BACKEND=gpu
If a GPU isn't available, an error will be raised when GSPPY_BACKEND=gpu is set. Otherwise, the default "auto" uses CPU.
After the environment is ready, activate it and run tasks with standard tools:
source .venv/bin/activate
pytest -n auto
ruff check .
pyright
If you prefer, you can also prefix commands with uv without activating:
uv run pytest -n auto
uv run ruff check .
uv run pyright
You can use the Makefile to automate common tasks:
make setup # create .venv with uv and pin Python
make install # sync deps (from uv.lock) + install project (-e .)
make test # pytest -n auto
make lint # ruff check .
make format # ruff --fix
make typecheck # pyright (and mypy if configured)
make pre-commit-install # install the pre-commit hook
make pre-commit-run # run pre-commit on all files
# Rust-specific shortcuts
make rust-setup # install rustup toolchain
make rust-build # build PyO3 extension with maturin
make bench-small # run small benchmark
make bench-big # run large benchmark
Note
Tox in this project uses the "tox-uv" plugin. When running make tox
or tox
, missing Python interpreters can be provisioned automatically via uv (no need to pre-install all versions). This makes local setup faster.
The library is designed to be easy to use and integrate with your own projects. You can use GSP-Py either programmatically (Python API) or directly from the command line (CLI).
GSP-Py provides a command-line interface (CLI) for running the Generalized Sequential Pattern algorithm on transactional data. This allows you to mine frequent sequential patterns from JSON or CSV files without writing any code.
First, install GSP-Py (if not already installed):
pip install gsppy
This will make the gsppy
CLI command available in your environment.
Your input file should be either:
-
JSON: A list of transactions, each transaction is a list of items. Example:
[ ["Bread", "Milk"], ["Bread", "Diaper", "Beer", "Eggs"], ["Milk", "Diaper", "Beer", "Coke"], ["Bread", "Milk", "Diaper", "Beer"], ["Bread", "Milk", "Diaper", "Coke"] ]
-
CSV: Each row is a transaction, items separated by commas. Example:
Bread,Milk Bread,Diaper,Beer,Eggs Milk,Diaper,Beer,Coke Bread,Milk,Diaper,Beer Bread,Milk,Diaper,Coke
Use the following command to run GSPPy on your data:
gsppy --file path/to/transactions.json --min_support 0.3 --backend auto
Or for CSV files:
gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
--file
: Path to your input file (JSON or CSV). Required.--min_support
: Minimum support threshold as a fraction (e.g.,0.3
for 30%). Default is0.2
.--backend
: Backend to use for support counting. One ofauto
(default),python
,rust
, orgpu
.--verbose
: (Optional) Enable detailed output for debugging.
Suppose you have a file transactions.json
as shown above. To find patterns with at least 30% support:
gsppy --file transactions.json --min_support 0.3
Sample output:
Pre-processing transactions...
Starting GSP algorithm with min_support=0.3...
Run 1: 6 candidates filtered to 5.
Run 2: 20 candidates filtered to 3.
Run 3: 2 candidates filtered to 2.
Run 4: 1 candidates filtered to 0.
GSP algorithm completed.
Frequent Patterns Found:
1-Sequence Patterns:
Pattern: ('Bread',), Support: 4
Pattern: ('Milk',), Support: 4
Pattern: ('Diaper',), Support: 4
Pattern: ('Beer',), Support: 3
Pattern: ('Coke',), Support: 2
2-Sequence Patterns:
Pattern: ('Bread', 'Milk'), Support: 3
Pattern: ('Milk', 'Diaper'), Support: 3
Pattern: ('Diaper', 'Beer'), Support: 3
3-Sequence Patterns:
Pattern: ('Bread', 'Milk', 'Diaper'), Support: 2
Pattern: ('Milk', 'Diaper', 'Beer'), Support: 2
- If the file does not exist or is in an unsupported format, a clear error message will be shown.
- The
min_support
value must be between 0.0 and 1.0 (exclusive of 0.0, inclusive of 1.0).
To see detailed logs for debugging, add the --verbose
flag:
gsppy --file transactions.json --min_support 0.3 --verbose
The following example shows how to use GSP-Py programmatically in Python:
The input to the algorithm is a sequence of transactions, where each transaction contains a sequence of items:
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]
Import the GSP
class from the gsppy
package and call the search
method to find frequent patterns with a support
threshold (e.g., 0.3
):
from gsppy.gsp import GSP
# Example transactions: customer purchases
transactions = [
['Bread', 'Milk'], # Transaction 1
['Bread', 'Diaper', 'Beer', 'Eggs'], # Transaction 2
['Milk', 'Diaper', 'Beer', 'Coke'], # Transaction 3
['Bread', 'Milk', 'Diaper', 'Beer'], # Transaction 4
['Bread', 'Milk', 'Diaper', 'Coke'] # Transaction 5
]
# Set minimum support threshold (30%)
min_support = 0.3
# Find frequent patterns
result = GSP(transactions).search(min_support)
# Output the results
print(result)
The algorithm will return a list of patterns with their corresponding support.
Sample Output:
[
{('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
{('Bread', 'Milk'): 3, ('Milk', 'Diaper'): 3, ('Diaper', 'Beer'): 3},
{('Bread', 'Milk', 'Diaper'): 2, ('Milk', 'Diaper', 'Beer'): 2}
]
- The first dictionary contains single-item sequences with their frequencies (e.g.,
('Bread',): 4
means "Bread" appears in 4 transactions). - The second dictionary contains 2-item sequential patterns (e.g.,
('Bread', 'Milk'): 3
means the sequence " Bread β Milk" appears in 3 transactions). - The third dictionary contains 3-item sequential patterns (e.g.,
('Bread', 'Milk', 'Diaper'): 2
means the sequence "Bread β Milk β Diaper" appears in 2 transactions).
Note
The support of a sequence is calculated as the fraction of transactions containing the sequence, e.g.,
[Bread, Milk]
appears in 3 out of 5 transactions β Support = 3 / 5 = 0.6
(60%).
This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
behavior.
Tip
For more complex examples, find example scripts in the gsppy/tests
folder.
We are actively working to improve GSP-Py. Here are some exciting features planned for future releases:
-
Custom Filters for Candidate Pruning:
- Enable users to define their own pruning logic during the mining process.
-
Support for Preprocessing and Postprocessing:
- Add hooks to allow users to transform datasets before mining and customize the output results.
-
Support for Time-Constrained Pattern Mining:
- Extend GSP-Py to handle temporal datasets by allowing users to define time constraints (e.g., maximum time gaps between events, time windows) during the sequence mining process.
- Enable candidate pruning and support calculations based on these temporal constraints.
Want to contribute or suggest an improvement? Open a discussion or issue!
We welcome contributions from the community! If you'd like to help improve GSP-Py, read our CONTRIBUTING.md guide to get started.
Development dependencies (e.g., testing and linting tools) are handled via uv. To set up and run the main tasks:
uv venv .venv
uv sync --frozen --extra dev
uv pip install -e .
# Run tasks
uv run pytest -n auto
uv run ruff check .
uv run pyright
- Fork the repository.
- Create a feature branch:
git checkout -b feature/my-feature
. - Commit your changes:
git commit -m "Add my feature."
- Push to your branch:
git push origin feature/my-feature
. - Submit a pull request to the main repository!
Looking for ideas? Check out our Planned Features section.
This project is licensed under the terms of the MIT License. For more details, refer to the LICENSE file.
If GSP-Py contributed to your research or project that led to a publication, we kindly ask that you cite it as follows:
@misc{pradolima_gsppy,
author = {Prado Lima, Jackson Antonio do},
title = {{GSP-Py - Generalized Sequence Pattern algorithm in Python}},
month = Dec,
year = 2025,
doi = {10.5281/zenodo.3333987},
url = {https://doi.org/10.5281/zenodo.3333987}
}