Note: the implementation is currently lacking a retraining step. I welcome any PRs to fix this. See #1.
This is a PyTorch reimplementation of Computing Shapley Values via Truncated Monte Carlo sampling from What is your data worth? Equitable Valuation of Data by Amirata Ghorbani and James Zou. The original implementation (In Tensorflow) can be found here.
This implementation is currently designed for neural networks, and the only available performance metric is model classification accuracy, but contributions to expand the implementation are welcome.
Computing Shapley values help when you need a system to rank the importance of your training data, which may arise when you need to prune your training data of harmful images, or when you need to provide compensation for data from multiple sources.
It differs from computing the value based on the leave-one-out method (LOO), because Shapley values satisfy three main properties:
- Null Data: If a datum does not change the model performance if it is added to any subset of the training data, then its value is zero.
- Equality: For any data x & y, if x has equal contribution to y when added to any subset of the training data, then x and y have the same Shapley value.
- Additive: If datum x contributes S_x(d_1) and S_x(d_2) to test data 1 and 2, respectively, then the value of x for both points is S_x(d_1) + S_x(d_2).
- Python 3.6 or later
- PyTorch 1.0 or later
- NumPy 1.12 or later
- Pickle
- Tqdm
from tmc import DShap
# Supplied by the user:
model = get_my_model()
train_set, test_set = get_my_datasets()
dshap = DShap(model, train_set, testset, directory='your_directory')
dshap.run(save_every=100, err=0.1, tolerance=0.01)
This outputs a pickle file that contains the sampled Shapley Values. You can convert this into a numpy array of dimensions (Iterations x # of Training Points).