Skip to content

Latest commit

 

History

History
132 lines (105 loc) · 7.87 KB

03-advanced.md

File metadata and controls

132 lines (105 loc) · 7.87 KB

Tutorial 3: Advanced

The first two tutorials covered loading datasets and ranking LMs using default parameters. This one shows how to select a transferability metric and rank layers of a single model.

Transferability Metrics

Transferability metrics estimate how well a model transfers knowledge from one task to another. For a pre-trained LM, this means assessing how well its embeddings align with a new dataset. In TransformerRanker, datasets are embedded with various LMs, and the embeddings are evaluated against task labels. Three different metrics are available for scoring the embeddings:

  • k-Nearest Neighbors (k-NN): Uses distance metrics to measure how close embeddings from the same class are. Pairwise distances are calculated, excluding self-distances in the top k search. See k-NN code.
  • H-Score: Measures the feature-wise variance between embeddings of different classes. High variance with low feature redundancy results in high transferability. See H-Score code.
  • LogME: Computes the log marginal likelihood of a linear model on embeddings, optimizing parameters alpha and beta. See LogME code.

We use two state-of-the-art metrics: LogME and an improved H-Score with shrinkage-based adjustments to the covariance matrix calculation. To use LogME, set the estimator parameter when running the ranker:

result = ranker.run(language_models, estimator="logme")

To use embeddings from different layers of a model, set the layer_aggregator parameter.

result = ranker.run(language_models, estimator="logme", estimator="bestlayer")

This configuration scores all layers of a language model and selects the one with the highest transferability score. Models are ranked based on their best-performing layers for the dataset.

Layer Ranking

The bestlayer option can be used to rank layers of a single LM. Here's an example using the large encoder model deberta-v2-xxlarge (1.5 billion parameters and 48 layers) with the CoNLL-03 Named Entity Recognition (NER) dataset:

from datasets import load_dataset
from transformer_ranker import TransformerRanker

# Load the CoNLL dataset
conll = load_dataset('conll2003')

# Use a single language model
language_model = ['microsoft/deberta-v2-xxlarge']

# Initialize the ranker and downsample the dataset
ranker = TransformerRanker(dataset=conll, dataset_downsample=0.2)

# Run it using the 'bestlayer' option
result = ranker.run(language_model, layer_aggregator='bestlayer')

# print scores for each layer
print(result.layer_scores)
Layer Ranking: Review the transferability scores for each layer.

The layer with index -41, which is the seventh from the bottom, gets the highest h-score.

INFO:transformer_ranker.ranker:microsoft/deberta-v2-xxlarge, score: 2.8912 (layer -41)
layer scores: {-1: 2.7377, -2: 2.8024, -3: 2.8312, -4: 2.8270, -5: 2.8293, -6: 2.7952, -7: 2.7894, -8: 2.7777, -9: 2.7490, -10: 2.7020, -11: 2.6537, -12: 2.7227, -13: 2.6930, -14: 2.7187, -15: 2.7494, -16: 2.7002, -17: 2.6834, -18: 2.6210, -19: 2.6126, -20: 2.6459, -21: 2.6693, -22: 2.6730, -23: 2.6475, -24: 2.7037, -25: 2.6768, -26: 2.6912, -27: 2.7300, -28: 2.7525, -29: 2.7691, -30: 2.7436, -31: 2.7702, -32: 2.7866, -33: 2.7737, -34: 2.7550, -35: 2.7269, -36: 2.7723, -37: 2.7586, -38: 2.7969, -39: 2.8551, -40: 2.8692, -41: 2.8912, -42: 2.8530, -43: 2.8646, -44: 2.8655, -45: 2.8210, -46: 2.7836, -47: 2.6945, -48: 2.5153}
Comparison to Linear Probe: Review the results from training a linear probe.
Layer index LogME H-score Dev F1 (Linear) Test F1 (Linear)
-1 0.7320 2.7421 0.9011 0.8674
-2 0.7811 2.8125 0.9035 0.8765
-3 0.7986 2.8460 0.9064 0.8812
-4 0.7993 2.8404 0.9057 0.8786
-5 0.7993 2.8359 0.9063 0.8805
-6 0.7803 2.8073 0.9039 0.8754
-7 0.7749 2.7982 0.9002 0.8739
-8 0.7695 2.7890 0.9017 0.8681
-9 0.7579 2.7614 0.8999 0.8687
-10 0.7415 2.7106 0.8996 0.8688
-11 0.7231 2.6661 0.8979 0.8687
-12 0.7458 2.7311 0.8932 0.8636
-13 0.7303 2.7003 0.8981 0.8634
-14 0.7483 2.7262 0.9011 0.8737
-15 0.7593 2.7564 0.9005 0.8703
-16 0.7300 2.7000 0.8957 0.8688
-17 0.7222 2.6849 0.8944 0.8624
-18 0.6875 2.6224 0.8875 0.8554
-19 0.6816 2.6145 0.8845 0.8582
-20 0.6942 2.6462 0.8890 0.8599
-21 0.7136 2.6780 0.8954 0.8633
-22 0.7275 2.6795 0.9021 0.8721
-23 0.7192 2.6491 0.9043 0.8689
-24 0.7399 2.7007 0.9043 0.8711
-25 0.7306 2.6727 0.9094 0.8754
-26 0.7400 2.6895 0.9090 0.8831
-27 0.7582 2.7315 0.9098 0.8736
-28 0.7642 2.7539 0.9096 0.8777
-29 0.7727 2.7726 0.9154 0.8853
-30 0.7621 2.7496 0.9076 0.8724
-31 0.7746 2.7747 0.9133 0.8844
-32 0.7823 2.7910 0.9115 0.8877
-33 0.7790 2.7797 0.9181 0.8850
-34 0.7746 2.7605 0.9141 0.8832
-35 0.7609 2.7295 0.9135 0.8883
-36 0.7794 2.7719 0.9149 0.8866
-37 0.7695 2.7587 0.9172 0.8875
-38 0.7949 2.7967 0.9176 0.8869
-39 0.8219 2.8569 0.9225 0.8930
-40 0.8276 2.8710 0.9232 0.8960
-41 0.8354 2.8877 0.9274 0.8972
-42 0.8189 2.8541 0.9239 0.8892
-43 0.8267 2.8650 0.9215 0.8887
-44 0.8241 2.8685 0.9163 0.8887
-45 0.8024 2.8297 0.9089 0.8713
-46 0.7792 2.7903 0.9105 0.8656
-47 0.7333 2.7008 0.9006 0.8556
-48 0.6505 2.5113 0.8640 0.8086

Running Time: Layer ranking took 1.5 minutes, with 52 seconds for embedding and 33 seconds for scoring. The dataset is embedded once, and each hidden state is scored independently (n estimations for n LM layers). This was performed on a GPU-enabled (A100) Colab Notebook.

Summary

This markdown explains how to use the estimator and layer_aggregator parameters when running the ranker. The library also supports ranking the layers of a single LM.