Welcome to SeqLab! This project provides a comprehensive framework for training and evaluating various machine learning models, focusing on multi-feature sequential categorical data.
SeqLab is engineered to facilitate systematic experimentation and benchmarking of machine learning models. Utilizing a configuration-driven approach, researchers and practitioners can specify their experimental setups through a JSON file, ensuring reproducibility and flexibility. The project integrates seamlessly with MLflow, providing robust tools for experiment tracking and model management.
SeqLab is optimized for training models that perform sequence modeling and next-step prediction. For example, consider a sequence of musical chords:
A:min E:min F:maj G:maj A:min C:maj G:maj
SeqLab enables the development of models that learn from such sequences and predict the subsequent chord in the progression. This capability is essential for applications in areas such as music generation and sequence prediction in natural language processing.
- Multiple Model Support
- Markov
- Variable-Order Markov
- LSTM
- LSTM with Attention
- Transformer
- GPT
- Multi-feature Sequential Categorical Data Handling
- Automated Hyperparameter Optimization with Optuna
- Experiment Tracking with MLflow
To get started, set up a virtual environment with Python 3.11 and install the necessary dependencies:
-
Set up the virtual environment:
python3.11 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
Use the provided example configuration file, which includes all available models and hyperparameters. You can customize this example by selecting the configurations that interest you and copying them into your config.json
file: example-config.json.
Seqlab accepts data in two formats:
-
TXT Format:
- Ideal for single feature/dimension data.
- Each sequence should be represented in a row with space-separated values.
- Example:
A B C D E F G H I J
-
CSV Format:
- Supports multiple features/dimensions.
- Features are tab-separated, with sequences separated by rows containing the
>*
symbol. - The first line should contain feature names, with each subsequent row representing an event in time. The rows between the separator rows (
>*
) represent sequences. - Example:
feature1 feature2 feature3 A 1 x B 2 y C 3 z >* >* >* D 4 u E 5 v F 6 w >* >* >*
After preparing your data, place it in a designated folder (e.g., data
folder) and add its path to the list of datasets in the experiment configuration file. Next, configure the following settings:
- Number of splits for k-fold cross-validation: Default is 7.
- Number of trials for model fine-tuning: Default is 20.
To start the experiment, execute the following command:
python run.py
To monitor the experiment process, start the MLflow UI in another terminal:
mlflow ui --port=4000
Then, navigate to 127.0.0.1:4000 in your web browser to access the MLflow tracking UI.
Figure: Visualizing experiment tracking with MLflow in SeqLab. Each experiment set is named after its dimensionality and contains multiple models. Each model is evaluated using different folds of data, with multiple trials per fold to optimize hyperparameters. The MLflow UI stores metrics, evaluation results, and important experiment tags for each run, allowing detailed analysis and comparison of model performance.
For detailed information on using SeqLab, please refer to the following sections in the documentation:
If SeqLab contributes to your research, we kindly request that you cite the following publication:
@article{jafari2024striking,
title={Striking a New Chord: Neural Networks in Music Information Dynamics},
author={Jafari, Farshad and Arthur, Claire},
journal={arXiv preprint arXiv:2410.17989},
year={2024}
}