Self-Supervised Concept Discovery — Core Implementation

This repository contains the code used to complete the REIT4841 Thesis: Self Supervised Concept Discovery.

Repository Structure

dataset.py - CheXpertDataset loader

dataset_investigator.py - Debugging visualisation tools for inspecting samples

evaluate.py - Evaluation metrics and plotting utilities

modules.py - Core CPV components (ViTTokenExtractor, PrototypeConfig, ConceptPrototypeHead)

nlp.py - SciBERT + medSpaCy concept extraction pipeline. Aka experiment 1

saliency.py - Prototype-based saliency and attribution utilities

train.py - Training utilities for CPVs

utils.py - Shared helper functions (IO, metrics, plotting, json)

warmup.py - K-means init on CPVs

experiment2 - Logic for experiment 2

experiment3 - Logic for experiment 3

experiment4 - Logic for experiment 4

Experiment 1 can be recreated by simply running nlp.py and then nlp.py --evaluate (generated_csv)

Core Components (modules.py)

ViTTokenExtractor

Wraps BiomedCLIP CLIP ViT model.

Returns patch embeddings

Supports frozen or trainable encoders.

PrototypeConfig

Stores label list, number of positive, negative prototypes,

embedding dimensions, and margin and temperature hyperparameters.

ConceptPrototypeHead

Computes prototype-to-patch similarity.

Aggregates positive and negative similarities into concept logits.

Provides matrices for visualisation (patch maps, similarity heatmaps).

Together these form a full CPV model.

Dataset Handling (dataset.py)

CheXpertDataset

Loads trainvaltest splits from a CSV.

Joins CheXpert-Plus impression text.

Supports frontal-only filtering.

Extracts patientstudy metadata automatically.

Supports image loading or path-only mode.

Masks uncertain labels (-1 → NaN) if desired.

make_collate_fn()

Creates a collate function that filters missing items

and forms dense torch tensors with NaNs preserved.

NLP-based Weak Supervision (optional)

nlp.py defines a SciBERT + medSpaCy pipeline to extract CheXpert concepts

from impression text to produce weak pseudo-labels.

This file is not required for CPV usage, only for generating weak supervision.

An already processed train_spacy.csv, rewritten using nlp_csv_rewriter is available, simply specify TRAINC_CSV="train_spacy.csv" in the above constants for a given file.

Full Requirements

numpy

pandas

matplotlib

Pillow

torch

torchvision

scikit-learn

open-clip-torch

spacy

medspacy

scispacy

tqdm

https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_scibert-0.5.4.tar.gz <- en_core_sci_sm scispacy model

Datasets

This codebase expects:

CheXpertSmall

./train.csv

./valid.csv

train/patient_id/image.jpeg

valid/patient_id/image.jpeg

This can be directly downloaded from

https://www.kaggle.com/datasets/ashery/chexpert

CheXpertPlus

df_chexpert_plus_240401.csv

This contains the section_impression text. We notably only require the csv.

Downloaded from

https://aimi.stanford.edu/datasets/chexpert-plus

AI Acknowledgements

I acknowledge that ChatGPT helped with the calculation of trivial or repeatable functionality, such as metric calculation, dataset investigation, logging, formatting print statements, sanity checks and exception handling, cmdline parameters, and the definition of basic baseline models.

Additionally, both of the below files were generated entirely by chatgpt:

dataset_investigator.py

utils.py

All conceptual and methodological components described in the thesis methodology remain my sole work.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
cpv_requirements.txt		cpv_requirements.txt
dataset.py		dataset.py
dataset_investigator.py		dataset_investigator.py
experiment2.py		experiment2.py
experiment3.py		experiment3.py
experiment4.py		experiment4.py
modules.py		modules.py
nlp.py		nlp.py
nlp_csv_rewriter..py		nlp_csv_rewriter..py
nlp_requirements.txt		nlp_requirements.txt
train.py		train.py
train_spacy.csv		train_spacy.csv
utils.py		utils.py
warmup.py		warmup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Supervised Concept Discovery — Core Implementation

Repository Structure

Core Components (modules.py)

Dataset Handling (dataset.py)

NLP-based Weak Supervision (optional)

Full Requirements

Datasets

CheXpertSmall

CheXpertPlus

AI Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

jac0be/Self-Supervised-Concept-Discovery

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Concept Discovery — Core Implementation

Repository Structure

Core Components (modules.py)

Dataset Handling (dataset.py)

NLP-based Weak Supervision (optional)

Full Requirements

Datasets

CheXpertSmall

CheXpertPlus

AI Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages