This is the repository for paper One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations.
The directory structure should look like this:
DualCSE
├── figures/ # Paper figures
│ └── *.png
├── scripts/ # Shell scripts
│ └── *.sh
├── src/ # Our main implementation
│ └── *.py
├── prompt/
│ └── prompt.txt
├── datasets/
│ ├── inli/ # Cloned INLI repo
│ │ ├── INLI Data
│ │ ├── Resources
│ │ └── ...
│ └── impscore/ # Downloaded from ImpScore repo
│ └── all_data.csv
└── ...
Before running the code, please clone the INLI dataset into the project root:
git clone https://github.com/google-deepmind/inli.git datasets/inli
Please download all_data.csv from ImpScore repo into datasets/impscore.
To standardize the dataset format, please perform preprocessing on the downloaded data:
python -m src.prepare_data
To train the model, please execute the following command:
bash scripts/train.sh
After training, you can test the model as the following command:
python -m src.test_rte --run_name RUN_NAME
python -m src.test_implicitness_scoring --run_name RUN_NAME
To run experiments with external LLM APIs, create a .env file in the root directory to store your API keys:
vi .envAdd the following keys:
OPENAI_API_KEY="<your OpenAI API key>"
DEEPSEEK_API_KEY="<your DeepSeek API key>"
GEMINI_API_KEY="<your Gemini API key>"
CLAUDE_API_KEY="<your Claude API key>"
MISTRAL_API_KEY="<your Mistral API key>"To run the LLM baseline and evaluate results, use the following commands:
python llm_baseline.py --model_name gpt-4o --n_shot 0 # zero-shot
python llm_baseline.py --model_name gpt-4o --n_shot 8 # eight-shotTo test accuracy based on .xlsx file:
python test_accuracy_llm.py --model_name gpt-4o --n_shot 0
python test_accuracy_llm.py --model_name gpt-4o --n_shot 8You can replace --model_name with any supported LLM API name configured in your .env.
This project uses the INLI dataset released by Google DeepMind under the following license:
- Source: google-deepmind/inli
- License: CC BY-SA 4.0
The dataset is intended for research and evaluation purposes only.
The prompt.txt file included in this repository is a modified version based on the original prompt published in the paper INLI dataset by Google DeepMind, and is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
All other code in this repository is licensed under the MIT License.