Skip to content

YuhengHuang42/test_selection_llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

179 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Test Selection of LLM

This is the source code for AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling

Repo Structure

├── balanced-kmeans
├── baseline_config
├── evaluation
├── kneed
├── our_method_config
├── state_collection # Core functionalities for collecting and analyzing internal states of LLMs
│   ├── base_store.py
│   ├── disk_store.py
│   ├── __init__.py
│   ├── reshape_activations.py
│   ├── state_collector.py # The StateCollector collects the intermediate states of LLMs.
│   ├── store_activation_hook.py
│   ├── tensor_store.py
│   └── tensor_types.py
├── strategy
│   ├── cluster_algo.py
│   ├── __init__.py
│   ├── k_center
│   ├── merge_result.py
│   ├── mmd_critic
│   ├── partition.py # Core functionalities for the proposed method AcTracer with class EmbedPartition.
│   ├── test_strategy.py # Baseline methods.
├── eval_partition_config.yaml # Example configurations for AcTracer
├── eval_partition.py # Evaluation scripts for AcTracer
├── evaluate_test_selection.py # Evaluation scripts for baseline methods
├── sub_sample_ablation.py # Evaluation scripts for ablation study
└── utils.py # Utility functions

LLM Evaluation

Reference evaluation package: https://github.com/EleutherAI/lm-evaluation-harness

We utilized lm-evaluation-harness to evaluate the performance of LLM on the test set. Our evaluation example configurations are shown in evaluation_configs folder.

The output format of the evaluation is as follows:

Output Format:

*dataset#1*.jsonl # The file containing inference results for each prompt

[
    {
        "doc_id": 0, # Number
        "doc": {}, # Input question
        "target": string, # ground truth
        "arguments": [
            [string], # The input prompt
            {}, # Generation configuration,
        ]
        "resps": [[string]], # Response
        "filtered_resps": [string] # Filtered response
        "metric": float # score
    }
]

results.json #

{
    "results": {
        "dataset#1" : {},
        "dataset#2" : {}, ...

    },
    "configs": {
        "dataset#1" : {"target_delimiter": string, ...},
        "dataset#2" : {"target_delimiter": string, ...}, 
        ....
    }
}

Experiment Results

We share our experiment results with different sampling rates on each of the methods at: https://drive.google.com/drive/folders/1xcmGgqeQjdNUuKux6iu4JK6YKNJ4uaJa?usp=sharing

Credits

SparseAutoencoder: https://github.com/ai-safety-foundation/sparse_autoencoder

TransformerLens: https://github.com/neelnanda-io/TransformerLens

Kneed: https://github.com/arvkevi/kneed

Balanced-Kmeans: https://github.com/kernelmachine/balanced-kmeans/tree/main

lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness

[FSE'19] Boosting Operational DNN Testing Efficiency through Conditioning

[TOSEM'20] Practical Accuracy Estimation for Efficient Deep Neural Network Testing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published