Skip to content

jac0be/Self-Supervised-Concept-Discovery

Repository files navigation

Self-Supervised Concept Discovery — Core Implementation

This repository contains the code used to complete the REIT4841 Thesis: Self Supervised Concept Discovery.

Repository Structure

dataset.py - CheXpertDataset loader

dataset_investigator.py - Debugging visualisation tools for inspecting samples

evaluate.py - Evaluation metrics and plotting utilities

modules.py - Core CPV components (ViTTokenExtractor, PrototypeConfig, ConceptPrototypeHead)

nlp.py - SciBERT + medSpaCy concept extraction pipeline. Aka experiment 1

saliency.py - Prototype-based saliency and attribution utilities

train.py - Training utilities for CPVs

utils.py - Shared helper functions (IO, metrics, plotting, json)

warmup.py - K-means init on CPVs

experiment2 - Logic for experiment 2

experiment3 - Logic for experiment 3

experiment4 - Logic for experiment 4

Experiment 1 can be recreated by simply running nlp.py and then nlp.py --evaluate (generated_csv)

Core Components (modules.py)

ViTTokenExtractor

  Wraps BiomedCLIP CLIP ViT model.

  Returns patch embeddings

  Supports frozen or trainable encoders.

PrototypeConfig

  Stores label list, number of positive, negative prototypes,

  embedding dimensions, and margin and temperature hyperparameters.

ConceptPrototypeHead

  Computes prototype-to-patch similarity.

  Aggregates positive and negative similarities into concept logits.

  Provides matrices for visualisation (patch maps, similarity heatmaps).

Together these form a full CPV model.

Dataset Handling (dataset.py)

CheXpertDataset

  Loads trainvaltest splits from a CSV.

  Joins CheXpert-Plus impression text.

  Supports frontal-only filtering.

  Extracts patientstudy metadata automatically.

  Supports image loading or path-only mode.

  Masks uncertain labels (-1 → NaN) if desired.

make_collate_fn()

  Creates a collate function that filters missing items

  and forms dense torch tensors with NaNs preserved.

NLP-based Weak Supervision (optional)

nlp.py defines a SciBERT + medSpaCy pipeline to extract CheXpert concepts

from impression text to produce weak pseudo-labels.

This file is not required for CPV usage, only for generating weak supervision.

An already processed train_spacy.csv, rewritten using nlp_csv_rewriter is available, simply specify TRAINC_CSV="train_spacy.csv" in the above constants for a given file.

Full Requirements

numpy

pandas

matplotlib

Pillow

torch

torchvision

scikit-learn

open-clip-torch

spacy

medspacy

scispacy

tqdm

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_scibert-0.5.4.tar.gz <- en_core_sci_sm scispacy model

Datasets

This codebase expects:

CheXpertSmall

./train.csv

./valid.csv

train/patient_id/image.jpeg

valid/patient_id/image.jpeg

This can be directly downloaded from

https://www.kaggle.com/datasets/ashery/chexpert

CheXpertPlus

df_chexpert_plus_240401.csv

This contains the section_impression text. We notably only require the csv.

Downloaded from

https://aimi.stanford.edu/datasets/chexpert-plus

AI Acknowledgements

I acknowledge that ChatGPT helped with the calculation of trivial or repeatable functionality, such as metric calculation, dataset investigation, logging, formatting print statements, sanity checks and exception handling, cmdline parameters, and the definition of basic baseline models.

Additionally, both of the below files were generated entirely by chatgpt:

dataset_investigator.py

utils.py

All conceptual and methodological components described in the thesis methodology remain my sole work.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages