Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings
Hiroaki Yamagiwa, Momose Oyama, Hidetoshi Shimodaira
COLING 2025
This repository is intended to be run in a Docker environment. If you are not familiar with Docker, please install the packages listed in requirements.txt.
Create a Docker image as follows:
$ bash scripts/docker/build.sh
Set the DOCKER_HOME
environment variable to specify the path of the directory to be mounted as the home directory inside the Docker container.
$ export DOCKER_HOME="path/to/your/docker_home"
Run the Docker container by passing the GPU ID as an argument:
$ bash scripts/docker/run.sh 0
Instead of recomputing the embeddings, you can access the embeddings used in the paper through the following links. Note that sign flip was not applied to the ICA-transformed embeddings to ensure that the skewness of the axes remains positive.
Original, PCA-transformed and ICA-transformed embeddings (Google Drive):
Place the file as follows:
output
└── embeddings
└── glove_dic_and_emb.pkl
Download GloVe embeddings as follows:
$ mkdir -p data/embeddings
$ wget https://nlp.stanford.edu/data/glove.6B.zip
$ unzip glove.6B.zip -d data/embeddings/glove.6B
For more details, please refer to the original repository: stanfordnlp/GloVe.
To compute the PCA-transformed and ICA-transformed embeddings:
$ python src/save_glove_pca_ica_embeddings.py
Instead of recomputing the embeddings, you can access the embeddings used in the paper through the following links. Note that sign flip was not applied to the ICA-transformed embeddings to ensure that the skewness of the axes remains positive.
Original, PCA-transformed and ICA-transformed embeddings (Google Drive):
Place them as follows:
output
└── embeddings
├── EleutherAI-pythia-160m_dic_and_emb.pkl
├── bert-base-uncased_dic_and_emb.pkl
├── gpt2_dic_and_emb.pkl
└── roberta-base_dic_and_emb.pkl
Download One Billion Word Benchmark [3] from the link https://www.statmt.org/lm-benchmark/.
Place it as follows. For embedding computations, we use news.en-00001-of-00100
.
data
└── 1-billion-word-language-modeling-benchmark-r13output
└── training-monolingual.tokenized.shuffled
├── news.en-00001-of-00100
...
We prioritize sentences containing "ultraviolet" for embedding computations:
$ python src/save_contextualized_embeddings.py
To compute the PCA-transformed and ICA-transformed embeddings:
$ python src/save_contextualized_pca_ica_embeddings.py --model_name bert-base-uncased
$ python src/save_contextualized_pca_ica_embeddings.py --model_name roberta-base
$ python src/save_contextualized_pca_ica_embeddings.py --model_name gpt2
$ python src/save_contextualized_pca_ica_embeddings.py --model_name EleutherAI/pythia-160m
$ python src/Fig1_make_heatmap_comparing_ica_with_pca.py
$ python src/Fig2_make_ultraviolet_and_light_bargraphs.py
$ python src/Fig3_make_normalized_values_histograms.py
(a) Components | (b) Component-wise products | (c) Component-wise products (magnified) |
---|---|---|
![]() |
![]() |
![]() |
(Fig. 12 for PCA is also generated.)
$ python src/Fig4_show_products.py
word 53 [chemistry] 68 [biology] 141 [space] 194 [spectrum] 197 [virology] cossim with ultraviolet
0 salts 0.159 0.015 -0.006 -0.004 0.009 0.141
1 proteins 0.063 0.098 0.009 0.009 0.013 0.202
2 spacecraft 0.011 0.007 0.155 0.073 -0.001 0.260
3 light 0.030 0.006 0.003 0.296 -0.004 0.485
4 virus 0.001 0.031 0.006 0.022 0.086 0.191
🚨 If you run experiments with different settings, please rename the axis labels accordingly.
Axis matching:
$ python src/Fig5_calc_axis_matching_for_GloVe_and_contextualzed.py
Heatmap creation:
$ python src/Fig5_make_heatmap_for_GloVe_and_contextualzed.py
Because tokenization differs for each model, we first search for valid sentences containing “ultraviolet” and “light”:
$ python src/Fig6_calc_valid_sentences_for_ultraviolet_and_light_bargraphs.py
🚨 This code ensures that the final embeddings for “ultraviolet” and “light” come from the same sentence across four models, but it does not rigorously handle corner cases where a single sentence might contain multiple occurrences of “ultraviolet” and “light.”
Generate the bar graphs:
$ python src/Fig6_make_ultraviolet_and_light_bargraphs_for_contextualized.py
Initializing the Intruder
class takes about one hour, so you can use a pre-initialized file:
Place it in data/word_intruder_task/intruder_5.pkl
.
Then run src/Fig7_make_intruder_bargraph.ipynb
(verified on VSCode).
$ python src/Fig8_make_component_comparison_for_ica_and_pca.py
(a) Sorted along embeddings | (b) Sorted along axes |
---|---|
![]() |
![]() |
The analogy task takes significant time; precomputed results are available under output/evaluation.
For the word similarity task, we modified the repository word-embeddings-benchmarks [4]. Install it via:
$ cd src/word-embeddings-benchmarks
$ pip install -e .
$ cd ../..
Evaluate word similarity:
$ python src/Fig9a_eval_wordsim.py
Evaluate analogy:
# These are not optimized for speed and may each take about two days to run.
$ python src/Fig9b_eval_analogy.py --emb_type ica --task_name Google
$ python src/Fig9b_eval_analogy.py --emb_type ica --task_name Google
$ python src/Fig9b_eval_analogy.py --emb_type pca --task_name MSR
$ python src/Fig9b_eval_analogy.py --emb_type pca --task_name MSR
$ python src/Fig9_make_plots_for_wordsim_and_analogy.py
For reproducibility related to Fig. 2 (GloVe) and Fig. 6 (GPT-2, Pythia-160m), the outputs from src/Table1to3_make_ultraviolet_and_light_embeddings_csv.py
are available at output/ultraviolet_and_light.
The
Configure your environment to reproduce Fig. 2 (for GloVe) or Fig. 6 (for GPT-2, Pythia-160m).
We save information about the embeddings for “ultraviolet” and “light” and create CSV files for the statistical tests used in the paper:
$ python src/Table1to3_make_ultraviolet_and_light_embeddings_csv.py
For the Mathematica scripts used to compute
python src/Table1to3_save_ultraviolet_and_light_pvalues.py
[1] Yamagiwa et al. Discovering Universal Geometry in Embeddings with ICA. EMNLP. 2023.
[2] Yamagiwa et al. Axis Tour: Word Tour Determines the Order of Axes in ICA-transformed Embeddings. EMNLP. 2024 Findings.
[3] Chelba et al. One billion word benchmark for measuring progress in statistical language modeling. INTER-SPEECH. 2014.
[4] Jastrzebski et al. How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks. arXiv. 2017.
[5] Musil et al. Exploring Interpretability of Independent Components of Word Embeddings with Automated Word Intruder Test. COLING 2024.
If you find our code or model useful in your research, please cite our paper:
@inproceedings{yamagiwa-etal-2025-revisiting,
title = "Revisiting Cosine Similarity via Normalized {ICA}-transformed Embeddings",
author = "Yamagiwa, Hiroaki and
Oyama, Momose and
Shimodaira, Hidetoshi",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.497/",
pages = "7423--7452",
abstract = "Cosine similarity is widely used to measure the similarity between two embeddings, while interpretations based on angle and correlation coefficient are common. In this study, we focus on the interpretable axes of embeddings transformed by Independent Component Analysis (ICA), and propose a novel interpretation of cosine similarity as the sum of semantic similarities over axes. The normalized ICA-transformed embeddings exhibit sparsity, enhancing the interpretability of each axis, and the semantic similarity defined by the product of the components represents the shared meaning between the two embeddings along each axis. The effectiveness of this approach is demonstrated through intuitive numerical examples and thorough numerical experiments. By deriving the probability distributions that govern each component and the product of components, we propose a method for selecting statistically significant axes."
}
See README.Appendix.md for the experiments in the Appendix.
- Since the URLs of published embeddings may change, please refer to the GitHub repository URL instead of the direct URL when referencing in papers, etc.
- This directory was created by Hiroaki Yamagiwa.
- The code for the word intrusion task was created by Momose Oyama.
- The mathematica code for
$p$ -values was created by Hidetoshi Shimodaira.