In this study, we introduce FEET, a standardized protocol designed to guide the development and benchmarking of foundation models. While numerous benchmarks exist for assessing these models, we propose a structured evaluation across three distinct scenarios to obtain a comprehensive understanding of their practical performance. We define three principal use cases: frozen embeddings, few-shot embeddings, and fully fine-tuned embeddings. Each scenario is detailed and exemplified through a case study in the medical domain, illustrating how these evaluations provide an extensive assessment of the effectiveness of foundation models in research applications. This protocol is recommended as a standard for ongoing research dedicated to representation learning models for deep learning research.
This repository introduces FEET (Framework for Evaluating Embedding Techniques), a standardized protocol designed to guide the development and benchmarking of foundation models. FEET focuses on the evaluation of embeddings across three distinct use cases:
- Frozen Embeddings
- Few-Shot Embeddings
- Fully Fine-Tuned Embeddings
The goal is to provide a comprehensive and structured evaluation of embedding techniques that ensures consistent and thorough benchmarking for foundation models. This repository contains the code and tools required to replicate our benchmarking approach.
We define and evaluate foundation models based on three primary embedding scenarios:
- Frozen Embeddings: Pre-trained embeddings that are not updated during the model training process.
- Few-Shot Embeddings: Evaluates the model's ability to learn from a limited number of labeled examples.
- Fully Fine-Tuned Embeddings: Embeddings that are updated through a full fine-tuning process on task-specific data.
Each use case is benchmarked with performance metrics to assess how well the models adapt to different levels of customization.
The main dataset used for evaluation is the MIMIC-IV dataset, specifically focusing on the prediction of antibiotic prescriptions. The dataset contains medical records and features from patients in critical care settings. To access the MIMIC-IV dataset, you must complete the necessary approval process, as it contains sensitive medical data. Preprocessing scripts are contained in preprocessing/
to extract antibiotics cohort.
The results of our evaluation are displayed in FEET Tables, where we compare different models across the embedding scenarios. The tables report AUROC and AUPRC scores across different antibiotics tasks. Furthermore, we introduce
Models | Frozen | Few-shot (2) | Fine-tuned |
---|---|---|---|
BioClinicalBERT | 74.99 | 56.73 | 67.59 |
MedBERT | 74.22 | 55.49 | 69.35 |
SciBERT | 73.98 | 52.77 | 68.31 |
Models | Frozen | Few-shot (2) | Fine-tuned |
---|---|---|---|
BioClinicalBERT | ------ | -21.11% | -7.40% |
MedBERT | ------ | -19.94% | -4.87% |
SciBERT | ------ | -19.80% | -5.67% |
The FEET framework offers a principled and comprehensive way to evaluate foundation models across different embedding scenarios. By reporting on Frozen, Few-shot, and Fully Fine-tuned embeddings, we provide a deeper understanding of model performance, adaptability, and limitations.
We encourage researchers and practitioners to use FEET as a standard protocol for evaluating foundation models in their studies.
@misc{lee2024feetframeworkevaluatingembedding,
title={FEET: A Framework for Evaluating Embedding Techniques},
author={Simon A. Lee and John Lee and Jeffrey N. Chiang},
year={2024},
eprint={2411.01322},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.01322},
}