Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

The official code for Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. If you have any questions, feel free to email (njain17@umd.edu).

About

To complement conventional evaluation, we propose a framework for self-supervised model evaluation. In this framework, metrics are defined as invariances and sensitivities that can be checked in a self-supervised fashion using interventions based only on the model in question rather than external labels. Self-supervised evaluation pipelines are dataset-agnostic, and so they can be utilized over larger corpora of evaluation data than conventional metrics, or even directly in production systems to monitor day-to-day performance. In this work, we develop this framework, discuss desiderata for such metrics, and provide a number of case studies for self-supervised metrics: knownledge capability, toxicity detection, long-range (context), word-order, and tokenization sensitivities. By developing these new metrics, we hope to provide a more comprehensive and nuanced understanding of the strengths and limitations of LLMs.

Installation

You can run pip install byod to directly install our package. Or, install directly from source via pip install git+https://github.com/neelsjain/BYOD/.

Dependencies

transformers==4.28.1
scipy==1.10.1
torch==2.0.0
datasets==2.11.0
nltk==3.8.1
apache_beam==2.48.0

Python 3.8 or higher is recommended

Usage

See run_model.sh for examples on how to evaluate a model. We provide scripts to run all huggingface models against metrics computed on wikipedia data, as an example. These are named run_[metric].py.

Note that only models are huggingface are currently supported.

You can also use the metrics directly, given your own model, tokenizer, and dataset, like so

import BYOD

long_range_sensitivity = BYOD.lrs_metric(model, data, tokenizer)
negation_knowledge = BYOD.negation_metric(model, data, tokenizer)
tokenization_robustness = BYOD.tokenization_metric(model, data, tokenizer)
toxicity_proxy = BYOD.toxicity_metric(model, data, tokenizer)
word_order_sensitivity = BYOD.word_order_metric(model, data, tokenizer)

Suggestions and Pull Requests are welcome!

Everything can be better! If you have suggestions on improving the codebase or the invariance/sensitivity test. Feel free to reach out!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
BYOD		BYOD
images		images
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_lrs.py		run_lrs.py
run_model.sh		run_model.sh
run_negations.py		run_negations.py
run_tokenization_split.py		run_tokenization_split.py
run_toxicity.py		run_toxicity.py
run_word_order.py		run_word_order.py
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

About

Installation

Dependencies

Usage

Suggestions and Pull Requests are welcome!

About

Releases

Packages

Languages

License

neelsjain/BYOD

Folders and files

Latest commit

History

Repository files navigation

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

About

Installation

Dependencies

Usage

Suggestions and Pull Requests are welcome!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages