GitHub - B0scos/Financial-Funds-Clustering: A Python pipeline for segmenting financial assets using unsupervised learning models like K-Means and GMM. This project evaluates clustering configurations not only on performance (Silhouette Score) but also on their statistical stability across train, test, and validation sets using the Wasserstein distance.

Financial Asset Clustering & Stability Analysis

This project implements an end-to-end machine learning pipeline for segmenting financial assets using unsupervised clustering algorithms. It goes beyond standard performance metrics by introducing a robust evaluation framework that assesses model stability across different data partitions (training, testing, and validation).

The primary goal is to identify meaningful asset clusters and ensure that the characteristics of these clusters are consistent and reliable, which is critical for real-world financial applications.

Project Conclusion and Key Insights

The analysis demonstrates that selecting a model based solely on a single performance metric, such as the Silhouette Score, can be insufficient. A model that performs well on training data may produce inconsistent or unstable clusters when applied to new, unseen data.

By incorporating the Wasserstein distance as a stability score, this project establishes a more holistic evaluation methodology. The key finding is that the optimal model is often a trade-off between high performance and high stability. This dual-metric approach ensures that the identified asset clusters (e.g., "high-return, low-risk") are not artifacts of the training set but represent genuinely distinct and reliable groupings. This stability is paramount for building trust and utility in quantitative financial strategies derived from the model's output.

Key Features

Experimentation Pipeline: A systematic framework for tuning hyperparameters, including the number of clusters, model type (K-Means, GMM), and data preprocessing techniques.
Advanced Preprocessing: Implements multiple strategies such as standard scaling (scalling) and dimensionality reduction (PCA).
Robust Evaluation:
- Performance: Measured using the Silhouette Score to evaluate cluster density and separation.
- Stability: Measured using the Wasserstein Distance to quantify the statistical similarity of cluster characteristics across train, test, and validation sets.
Structured Logging: All experiment parameters and results are automatically saved to CSV files for comprehensive analysis.
Automated Analysis: An analysis script (summary.py) processes the results to identify the best-performing model and the most stable model, highlighting the trade-offs between the two.

Project Structure

/src: Contains the core application code, organized into:
- models: Clustering model wrappers (K-Means, GMM).
- pipelines: Training and evaluation pipeline logic.
- process: Data preprocessing functions.
- utils: Utility functions, including data loading.
/main.py: The main script to execute the complete experiment pipeline.
/summary.py: The analysis script to interpret experiment results.
/experiment_results.csv: Output file containing detailed, per-cluster metrics from all runs.

Setup and Installation

Prerequisites:
- Python 3.9+
- Git

Clone the repository:

git clone <REPOSITORY_URL>
cd <PROJECT_DIRECTORY>

Create and activate a virtual environment:

python -m venv .venv
# On Windows
.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Run the Experiment Pipeline: Execute the main script to run all configured experiments. This will train the models, evaluate them, and save the results to experiment_results.csv.
```
python main.py
```
Analyze the Results: Run the analysis script to identify the best-performing and most stable models.
```
python summary.py
```

Análise de Clusterização e Estabilidade de Ativos Financeiros

Este projeto implementa um pipeline de machine learning de ponta a ponta para a segmentação de ativos financeiros usando algoritmos de clusterização não supervisionados. O projeto vai além das métricas de desempenho padrão, introduzindo uma estrutura de avaliação robusta que mede a estabilidade do modelo em diferentes partições de dados (treino, teste e validação).

O objetivo principal é identificar agrupamentos de ativos significativos e garantir que as características desses clusters sejam consistentes e confiáveis, o que é fundamental para aplicações financeiras no mundo real.

Conclusão do Projeto e Principais Insights

A análise demonstra que selecionar um modelo com base apenas em uma única métrica de desempenho, como o Silhouette Score, pode ser insuficiente. Um modelo com bom desempenho nos dados de treino pode produzir clusters inconsistentes ou instáveis quando aplicado a dados novos e não vistos.

Ao incorporar a distância de Wasserstein como uma pontuação de estabilidade, este projeto estabelece uma metodologia de avaliação mais holística. A principal conclusão é que o modelo ideal é frequentemente um equilíbrio entre alto desempenho e alta estabilidade. Essa abordagem de métrica dupla garante que os clusters de ativos identificados (por exemplo, "alto retorno, baixo risco") não sejam artefatos do conjunto de treino, mas representem agrupamentos genuinamente distintos e confiáveis. Essa estabilidade é primordial para construir confiança e utilidade em estratégias financeiras quantitativas derivadas dos resultados do modelo.

Principais Funcionalidades

Pipeline de Experimentação: Uma estrutura sistemática para o ajuste de hiperparâmetros, incluindo o número de clusters, tipo de modelo (K-Means, GMM) e técnicas de pré-processamento de dados.
Pré-processamento Avançado: Implementa múltiplas estratégias, como padronização (scalling) e redução de dimensionalidade (PCA).
Avaliação Robusta:
- Desempenho: Medido usando o Silhouette Score para avaliar a densidade e separação dos clusters.
- Estabilidade: Medida usando a Distância de Wasserstein para quantificar a similaridade estatística das características dos clusters entre os conjuntos de treino, teste e validação.
Registro Estruturado: Todos os parâmetros e resultados dos experimentos são salvos automaticamente em arquivos CSV para uma análise abrangente.
Análise Automatizada: Um script de análise (summary.py) processa os resultados para identificar o modelo de melhor desempenho e o modelo mais estável, destacando o equilíbrio entre os dois.

Estrutura do Projeto

/src: Contém o código principal da aplicação, organizado em:
- models: Wrappers para os modelos de clusterização (K-Means, GMM).
- pipelines: Lógica do pipeline de treino e avaliação.
- process: Funções de pré-processamento de dados.
- utils: Funções utilitárias, incluindo o carregamento de dados.
/main.py: O script principal para executar o pipeline completo de experimentos.
/summary.py: O script de análise para interpretar os resultados dos experimentos.
/experiment_results.csv: Arquivo de saída contendo métricas detalhadas por cluster de todas as execuções.

Configuração e Instalação

Pré-requisitos:
- Python 3.9+
- Git

Clone o repositório:

git clone <URL_DO_REPOSITORIO>
cd <DIRETORIO_DO_PROJETO>

Crie e ative um ambiente virtual:

python -m venv .venv
# No Windows
.venv\Scripts\activate
# No macOS/Linux
source .venv/bin/activate

Instale as dependências:
```
pip install -r requirements.txt
```

Utilização

Execute o Pipeline de Experimentos: Execute o script principal para rodar todos os experimentos configurados. Isso treinará os modelos, os avaliará e salvará os resultados em experiment_results.csv.
```
python main.py
```
Analise os Resultados: Execute o script de análise para identificar os modelos com melhor desempenho e maior estabilidade.
```
python summary.py
```

Features

Data Ingestion Module: Automatically collects and prepares raw data.
Processing and Cleaning: Pipelines to validate, clean, and transform data.
Feature Engineering: Creation and selection of features to optimize model performance.
Model Training: Support for multiple clustering algorithms, such as K-Means and Gaussian Mixture Models (GMM).
Modular Structure: Code organized into reusable components, facilitating maintenance and expansion.

Project Structure

The project is organized into the following main directories:

/data_ingestion: Module responsible for the initial collection and storage of data. It contains its own logic, CLI, and configurations.
/src: Contains the main application code, including processing pipelines, model training, and utilities.
/notebooks: Jupyter Notebooks for exploratory analysis, testing, and prototyping.
/main.py: Main entry point to orchestrate the project's pipelines.
/requirements.txt: List of the project's Python dependencies.

Getting Started

Follow the instructions below to set up and run the project in your local environment.

Prerequisites

Python 3.9 or higher
Git

Installation

Clone the repository to your local machine:

git clone <REPOSITORY_URL>
cd <PROJECT_NAME>

Create a virtual environment and activate it:

python -m venv .venv
source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```

🛠️ Usage

The project execution is divided into two main steps: data ingestion and pipeline training.

1. Data Ingestion (Brief)

The data_ingestion module is responsible for downloading and processing the raw data. It has its own command-line interface (CLI) to start the process. For more details, refer to the README.md inside the data_ingestion directory.

To run the ingestion, navigate to the directory and execute the main script:

python data_ingestion/main.py <CLI_COMMANDS>

2. Training Pipeline

After the ingestion step is complete, the data will be ready to be processed and used for training the models. The main.py script in the project root orchestrates all steps of the main pipeline.

To run the full pipeline (processing, feature selection, and training), execute:

python main.py

⚙️ Configuration

Project settings, such as file paths, model parameters, and environment configurations, can be found and modified in the following locations:

Data Ingestion: data_ingestion/config/
Main Pipeline: src/config/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial Asset Clustering & Stability Analysis

Project Conclusion and Key Insights

Key Features

Project Structure

Setup and Installation

Usage

Análise de Clusterização e Estabilidade de Ativos Financeiros

Conclusão do Projeto e Principais Insights

Principais Funcionalidades

Estrutura do Projeto

Configuração e Instalação

Utilização

Features

Project Structure

Getting Started

Prerequisites

Installation

🛠️ Usage

1. Data Ingestion (Brief)

2. Training Pipeline

⚙️ Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
data_ingestion		data_ingestion
src		src
.gitignore		.gitignore
README.md		README.md
experiment_results.csv		experiment_results.csv
main.py		main.py
requirements.txt		requirements.txt
summary.py		summary.py

Folders and files

Latest commit

History

Repository files navigation

Financial Asset Clustering & Stability Analysis

Project Conclusion and Key Insights

Key Features

Project Structure

Setup and Installation

Usage

Análise de Clusterização e Estabilidade de Ativos Financeiros

Conclusão do Projeto e Principais Insights

Principais Funcionalidades

Estrutura do Projeto

Configuração e Instalação

Utilização

Features

Project Structure

Getting Started

Prerequisites

Installation

🛠️ Usage

1. Data Ingestion (Brief)

2. Training Pipeline

⚙️ Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages