This repository contains the code used to complete the REIT4841 Thesis: Self Supervised Concept Discovery.
dataset.py - CheXpertDataset loader
dataset_investigator.py - Debugging visualisation tools for inspecting samples
evaluate.py - Evaluation metrics and plotting utilities
modules.py - Core CPV components (ViTTokenExtractor, PrototypeConfig, ConceptPrototypeHead)
nlp.py - SciBERT + medSpaCy concept extraction pipeline. Aka experiment 1
saliency.py - Prototype-based saliency and attribution utilities
train.py - Training utilities for CPVs
utils.py - Shared helper functions (IO, metrics, plotting, json)
warmup.py - K-means init on CPVs
experiment2 - Logic for experiment 2
experiment3 - Logic for experiment 3
experiment4 - Logic for experiment 4
Experiment 1 can be recreated by simply running nlp.py and then nlp.py --evaluate (generated_csv)
ViTTokenExtractor
Wraps BiomedCLIP CLIP ViT model.
Returns patch embeddings
Supports frozen or trainable encoders.
PrototypeConfig
Stores label list, number of positive, negative prototypes,
embedding dimensions, and margin and temperature hyperparameters.
ConceptPrototypeHead
Computes prototype-to-patch similarity.
Aggregates positive and negative similarities into concept logits.
Provides matrices for visualisation (patch maps, similarity heatmaps).
Together these form a full CPV model.
CheXpertDataset
Loads trainvaltest splits from a CSV.
Joins CheXpert-Plus impression text.
Supports frontal-only filtering.
Extracts patientstudy metadata automatically.
Supports image loading or path-only mode.
Masks uncertain labels (-1 → NaN) if desired.
make_collate_fn()
Creates a collate function that filters missing items
and forms dense torch tensors with NaNs preserved.
nlp.py defines a SciBERT + medSpaCy pipeline to extract CheXpert concepts
from impression text to produce weak pseudo-labels.
This file is not required for CPV usage, only for generating weak supervision.
An already processed train_spacy.csv, rewritten using nlp_csv_rewriter is available, simply specify TRAINC_CSV="train_spacy.csv" in the above constants for a given file.
numpy
pandas
matplotlib
Pillow
torch
torchvision
scikit-learn
open-clip-torch
spacy
medspacy
scispacy
tqdm
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_scibert-0.5.4.tar.gz <- en_core_sci_sm scispacy model
This codebase expects:
./train.csv
./valid.csv
train/patient_id/image.jpeg
valid/patient_id/image.jpeg
This can be directly downloaded from
https://www.kaggle.com/datasets/ashery/chexpert
df_chexpert_plus_240401.csv
This contains the section_impression text. We notably only require the csv.
Downloaded from
https://aimi.stanford.edu/datasets/chexpert-plus
I acknowledge that ChatGPT helped with the calculation of trivial or repeatable functionality, such as metric calculation, dataset investigation, logging, formatting print statements, sanity checks and exception handling, cmdline parameters, and the definition of basic baseline models.
Additionally, both of the below files were generated entirely by chatgpt:
dataset_investigator.py
utils.py
All conceptual and methodological components described in the thesis methodology remain my sole work.