Skip to content

Docs/historical versions #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Aug 29, 2022
Merged

Docs/historical versions #31

merged 35 commits into from
Aug 29, 2022

Conversation

yifan
Copy link
Collaborator

@yifan yifan commented Aug 28, 2022

This pull request injects readthedocs to historical commits, enables readthedocs documentation for historical versions

Yifan Zhang and others added 30 commits August 28, 2022 14:01
Fixed buggy test where evaluation tensor would have more classes than
training tensor, resulting in an out of bounds failure. This only
happened intermittently as the data was randomly generated.
return_predictions now returns actual predictions instead of the gold
labels.
Loading a huggingface model assumed a tokenizer with the same name was
present. This is not required anymore.
Activation saving code is now part of its own module in data.writer instead
of being part of the extractor. Added tests for writer as well.
While creating tensors for use in downstream probes, the API now allows for automatically binarizing the dataset (so there are only two labels in the resulting tensors). Additional, this commit also implements multiclass and binary class datasets while creating the tensors.
Added Contribution guidelines and minor documentation fixes.
This commit implements the Probeless method introduced by the following paper:
 Antverg, Omer and Belinkov, Yonatan "On The Pitfalls Of Analyzing Idividual Neurons in Language Models." ICLR'22
Several options were being ignored when transformers_extractor was used
as a script. This commit fixes this, and a bug which caused activations
of the wrong shape to be saved during decomposition.
This commit adds support for annotating raw text with binary labels depending on presence of words from a given vocab, a regex filter or an arbitrary function that takes a word and returns a boolean label.
Training a linear regression probe resulted in a SyntaxError because of
an incorrect parameter name. This commit fixes this and adds some tests
around the same functionality.
- Moved all scripts to `scripts` folder
- Dependencies are now defined in `setup.cfg` instead of conda/pip
   requirements.txt
- Dev dependencies are now defined in `setup.cfg`
- Updated contribution guidelines and installation instructions
- Switched test runner from `green` to `pytest`
- Updated GitHub actions runner
Data for control tasks can now be created using the functions in `neurox.data.contraol_task`.

Detailed commit history before squash:
* control task module

* control task example in notebook

* test class for ct prep

* seq labeling case sensitivity

* code formatting

* example description in NB

* typo

* moved ct to data package

* rename method, rm dead code

* adapt asserts in tests

* reorder/rename method params and return value
`get_top_words` hardcoded the threshold to deem a "word" to be relevant to a neuron. This commit makes the threshold an additional argument the user can specify when using the function.
This commit implements low-precision activation extraction, saving and loading, which helps with storage space as well as saving/loading times.

* optional dtypes

dtype can explicitly specified in all extraction, writing and loading, probe training and probe eval.
Also, probes are trained with mixed precision

* fix default x_dtype in create tensor

- broken test, probably broken during merge

* more efficient dtype assignment during extraction

* rm dtype from write_activations in writer

* rename x_dtype to dtype

* adapt linear_probe

- rm mixed precision
- convert probe to float() in evaluate_probe if necessary. No inplace operation, creates copy of the object

* give dtype to JSON writer even if it is not needed

* no autocast and mixed prec. in training

* clarify method comment

* typo/format

* rm special case for different writers

* always evaluate in float32 regardless
fdalvi and others added 5 commits August 28, 2022 16:16
All toolkit and test code has been formatted with `ufmt` to enforce
consistency in the codebase and future commits.
* Added GitHub action to check code formatting

* Fix action yaml

* Introduce formatting error to test GH action

* Revert "Introduce formatting error to test GH action"

This reverts commit 96c5795.
This commit implements the Intersection over Union method to rank neurons against various target labels introduced by the following paper:
 Mu, J., & Andreas, J. (2020). Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33, 17153-17163.

Detailed commit history

* add iou probe

* add iou probe

* iou probing

* format

* modify testing

* add comments
This commit implements the Gaussian method for probing neurons against various target labels introduced by the following paper:
 Lucas Torroba Hennigen, Adina Williams, and Ryan Cotterell. Intrinsic probing through dimension
 selection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
 Processing (EMNLP), pp. 197–216, Online, 2020. Association for Computational Linguistics. doi:
 10.18653/v1/2020.emnlp-main.15.a

Detailed commit history:
* gaussian probe
* gaussian probe
* format code
* modift
* modify
* modify
* changes on Gaussian
@fdalvi fdalvi merged commit 291ab81 into fdalvi:docs/historic_versions Aug 29, 2022
@yifan yifan deleted the docs/historical_versions branch August 29, 2022 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants