Tutorial 3: Advanced

The first two tutorials covered loading datasets and ranking LMs using default parameters. This one shows how to select a transferability metric and rank layers of a single model.

Transferability Metrics

Transferability metrics estimate how well a model transfers knowledge from one task to another. For a pre-trained LM, this means assessing how well its embeddings align with a new dataset. In TransformerRanker, datasets are embedded with various LMs, and the embeddings are evaluated against task labels. Three different metrics are available for scoring the embeddings:

k-Nearest Neighbors (k-NN): Uses distance metrics to measure how close embeddings from the same class are. Pairwise distances are calculated, excluding self-distances in the top k search. See k-NN code.
H-Score: Measures the feature-wise variance between embeddings of different classes. High variance with low feature redundancy results in high transferability. See H-Score code.
LogME: Computes the log marginal likelihood of a linear model on embeddings, optimizing parameters alpha and beta. See LogME code.

We use two state-of-the-art metrics: LogME and an improved H-Score with shrinkage-based adjustments to the covariance matrix calculation. To use LogME, set the estimator parameter when running the ranker:

result = ranker.run(language_models, estimator="logme")

To use embeddings from different layers of a model, set the layer_aggregator parameter.

result = ranker.run(language_models, estimator="logme", estimator="bestlayer")

This configuration scores all layers of a language model and selects the one with the highest transferability score. Models are ranked based on their best-performing layers for the dataset.

Layer Ranking

The bestlayer option can be used to rank layers of a single LM. Here's an example using the large encoder model deberta-v2-xxlarge (1.5 billion parameters and 48 layers) with the CoNLL-03 Named Entity Recognition (NER) dataset:

from datasets import load_dataset
from transformer_ranker import TransformerRanker

# Load the CoNLL dataset
conll = load_dataset('conll2003')

# Use a single language model
language_model = ['microsoft/deberta-v2-xxlarge']

# Initialize the ranker and downsample the dataset
ranker = TransformerRanker(dataset=conll, dataset_downsample=0.2)

# Run it using the 'bestlayer' option
result = ranker.run(language_model, layer_aggregator='bestlayer')

# print scores for each layer
print(result.layer_scores)

Layer Ranking: Review the transferability scores for each layer.

The layer with index -41, which is the seventh from the bottom, gets the highest h-score.

INFO:transformer_ranker.ranker:microsoft/deberta-v2-xxlarge, score: 2.8912 (layer -41)
layer scores: {-1: 2.7377, -2: 2.8024, -3: 2.8312, -4: 2.8270, -5: 2.8293, -6: 2.7952, -7: 2.7894, -8: 2.7777, -9: 2.7490, -10: 2.7020, -11: 2.6537, -12: 2.7227, -13: 2.6930, -14: 2.7187, -15: 2.7494, -16: 2.7002, -17: 2.6834, -18: 2.6210, -19: 2.6126, -20: 2.6459, -21: 2.6693, -22: 2.6730, -23: 2.6475, -24: 2.7037, -25: 2.6768, -26: 2.6912, -27: 2.7300, -28: 2.7525, -29: 2.7691, -30: 2.7436, -31: 2.7702, -32: 2.7866, -33: 2.7737, -34: 2.7550, -35: 2.7269, -36: 2.7723, -37: 2.7586, -38: 2.7969, -39: 2.8551, -40: 2.8692, -41: 2.8912, -42: 2.8530, -43: 2.8646, -44: 2.8655, -45: 2.8210, -46: 2.7836, -47: 2.6945, -48: 2.5153}

Comparison to Linear Probe: Review the results from training a linear probe.

Layer index	LogME	H-score	Dev F1 (Linear)	Test F1 (Linear)
-1	0.7320	2.7421	0.9011	0.8674
-2	0.7811	2.8125	0.9035	0.8765
-3	0.7986	2.8460	0.9064	0.8812
-4	0.7993	2.8404	0.9057	0.8786
-5	0.7993	2.8359	0.9063	0.8805
-6	0.7803	2.8073	0.9039	0.8754
-7	0.7749	2.7982	0.9002	0.8739
-8	0.7695	2.7890	0.9017	0.8681
-9	0.7579	2.7614	0.8999	0.8687
-10	0.7415	2.7106	0.8996	0.8688
-11	0.7231	2.6661	0.8979	0.8687
-12	0.7458	2.7311	0.8932	0.8636
-13	0.7303	2.7003	0.8981	0.8634
-14	0.7483	2.7262	0.9011	0.8737
-15	0.7593	2.7564	0.9005	0.8703
-16	0.7300	2.7000	0.8957	0.8688
-17	0.7222	2.6849	0.8944	0.8624
-18	0.6875	2.6224	0.8875	0.8554
-19	0.6816	2.6145	0.8845	0.8582
-20	0.6942	2.6462	0.8890	0.8599
-21	0.7136	2.6780	0.8954	0.8633
-22	0.7275	2.6795	0.9021	0.8721
-23	0.7192	2.6491	0.9043	0.8689
-24	0.7399	2.7007	0.9043	0.8711
-25	0.7306	2.6727	0.9094	0.8754
-26	0.7400	2.6895	0.9090	0.8831
-27	0.7582	2.7315	0.9098	0.8736
-28	0.7642	2.7539	0.9096	0.8777
-29	0.7727	2.7726	0.9154	0.8853
-30	0.7621	2.7496	0.9076	0.8724
-31	0.7746	2.7747	0.9133	0.8844
-32	0.7823	2.7910	0.9115	0.8877
-33	0.7790	2.7797	0.9181	0.8850
-34	0.7746	2.7605	0.9141	0.8832
-35	0.7609	2.7295	0.9135	0.8883
-36	0.7794	2.7719	0.9149	0.8866
-37	0.7695	2.7587	0.9172	0.8875
-38	0.7949	2.7967	0.9176	0.8869
-39	0.8219	2.8569	0.9225	0.8930
-40	0.8276	2.8710	0.9232	0.8960
-41	0.8354	2.8877	0.9274	0.8972
-42	0.8189	2.8541	0.9239	0.8892
-43	0.8267	2.8650	0.9215	0.8887
-44	0.8241	2.8685	0.9163	0.8887
-45	0.8024	2.8297	0.9089	0.8713
-46	0.7792	2.7903	0.9105	0.8656
-47	0.7333	2.7008	0.9006	0.8556
-48	0.6505	2.5113	0.8640	0.8086

Running Time: Layer ranking took 1.5 minutes, with 52 seconds for embedding and 33 seconds for scoring. The dataset is embedded once, and each hidden state is scored independently (n estimations for n LM layers). This was performed on a GPU-enabled (A100) Colab Notebook.

Summary

This markdown explains how to use the estimator and layer_aggregator parameters when running the ranker. The library also supports ranking the layers of a single LM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03-advanced.md

03-advanced.md

Tutorial 3: Advanced

Transferability Metrics

Layer Ranking

Summary

Files

03-advanced.md

Latest commit

History

03-advanced.md

File metadata and controls

Tutorial 3: Advanced

Transferability Metrics

Layer Ranking

Summary