Skip to content

bomin0624/DualCSE

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DualCSE

This is the repository for paper One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations.

🌲 Directory Structure

The directory structure should look like this:

DualCSE
├── figures/            # Paper figures
│   └── *.png
├── scripts/            # Shell scripts
│   └── *.sh
├── src/                # Our main implementation
│   └── *.py
├── prompt/            
│   └── prompt.txt
├── datasets/
│   ├── inli/           # Cloned INLI repo
│   │   ├── INLI Data
│   │   ├── Resources
│   │   └── ...
│   └── impscore/       # Downloaded from ImpScore repo
│       └── all_data.csv
└── ...

📦 Setup

1. Clone the INLI Dataset

Before running the code, please clone the INLI dataset into the project root:

git clone https://github.com/google-deepmind/inli.git datasets/inli

2. Download the Wang's Dataset

Please download all_data.csv from ImpScore repo into datasets/impscore.

3. Preprocess

To standardize the dataset format, please perform preprocessing on the downloaded data:

python -m src.prepare_data

🏋️ Model Training & Testing

To train the model, please execute the following command:

bash scripts/train.sh

After training, you can test the model as the following command:

python -m src.test_rte --run_name RUN_NAME
python -m src.test_implicitness_scoring --run_name RUN_NAME

🔐 API Keys

To run experiments with external LLM APIs, create a .env file in the root directory to store your API keys:

vi .env

Add the following keys:

OPENAI_API_KEY="<your OpenAI API key>"
DEEPSEEK_API_KEY="<your DeepSeek API key>"
GEMINI_API_KEY="<your Gemini API key>"
CLAUDE_API_KEY="<your Claude API key>"
MISTRAL_API_KEY="<your Mistral API key>"

🚀 Running LLM Baselines & Evaluation

To run the LLM baseline and evaluate results, use the following commands:

python llm_baseline.py --model_name gpt-4o --n_shot 0 # zero-shot
python llm_baseline.py --model_name gpt-4o --n_shot 8 # eight-shot

To test accuracy based on .xlsx file:

python test_accuracy_llm.py --model_name gpt-4o --n_shot 0
python test_accuracy_llm.py --model_name gpt-4o --n_shot 8

You can replace --model_name with any supported LLM API name configured in your .env.


📚 Dataset Attribution

This project uses the INLI dataset released by Google DeepMind under the following license:

The dataset is intended for research and evaluation purposes only.

The prompt.txt file included in this repository is a modified version based on the original prompt published in the paper INLI dataset by Google DeepMind, and is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

All other code in this repository is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • Shell 0.7%