Text Embedding Inversion Attacks on Multilingual Language Models

Schematic Overview of a Text Embedding Inversion Attack.

Multilingual Vec2Text supports research in Text Embedding Inversion Security in Language Models, extending Jack Morris' Vec2Text with Ad-hoc Translation and Masking Defense Mechanism. We investigate thoroughly multilingual and cross-lingual text inversion attacks, and defense mechanisms. This repository contains code for the ACL 2024 long paper Text Embedding Inversion Attacks on Multilingual Language Models . The poster is online.

All the trained inversion models are on Huggingface. All the models are trained with T5-base as the external encoder-decoder.

Black-box Encoder	Training Data	Base Model	Corrector Model
GTR-base	5M Natural Questions	yiyic/t5_gtr_base_nq_32_inverter	yiyic/t5_gtr_base_nq_32_corrector
ME5-base	5M Natural Questions	yiyic/t5_me5_base_nq_32_inverter	yiyic/t5_me5_base_nq_32_corrector
ME5-base	5M MTG Spanish	yiyic/t5_me5_base_mtg_es_5m_32_inverter	yiyic/t5_me5_base_mtg_es_5m_32_corrector
ME5-base	5M MTG French	yiyic/t5_me5_base_mtg_fr_5m_32_inverter	yiyic/t5_me5_base_mtg_fr_5m_32_corrector
ME5-base	5M MTG German	yiyic/t5_me5_base_mtg_de_5m_32_inverter	yiyic/t5_me5_base_mtg_de_5m_32_corrector
ME5-base	5M MTG English	yiyic/t5_me5_base_mtg_en_5m_32_inverter	yiyic/t5_me5_base_mtg_en_5m_32_corrector
ME5-base	5M MTG Multilingual	yiyic/t5_me5_base_mtg_en_fr_de_es_5m_32_inverter	yiyic/t5_me5_base_mtg_en_fr_de_es_5m_32_corrector

Overview of Multilingual Vec2Text.

The tutorials for setting up experiments on supercomputer nodes such as LUMI will be in Wiki pages. All the scripts for running experiments will be provided in the GitHub repository. GitHub is still under construction.

Experiments (Inversion attack simulations)

Setup

download the release from releases and unzip.
pip install -r requirements.txt
donwload punkt package from nltk

import nltk
nltk.download("punkt")

Text Embedding Examples

Usage in interactive environment in a server


from eval_samples import * 

model_path="yiyic/t5_me5_base_mtg_en_fr_de_es_5m_32_corrector"

samples = ["jack morris is a phd student at cornell tech in new york city",
"it was the best of times, it was the worst of times, it was the age of wisdom",
"in einer stunde erreichen wir kopenhagen."., 
"comment puis-je vous aider?"
]

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
        model_path, use_less_data=3000)

trainer, device = trainer_attributes(trainer, experiment)
trainer.num_gen_recursive_steps = 10
# set sbeam
# trainer.sequence_beam_width = xx

evaluate_samples(trainer, device, samples)

output:


[pred] jack morris is a phd student at cornell tech in new york city
[true] jack morris is a phd student at cornell tech in new york city



[pred] it was the best of times, it was the worst of times, it was the age of wisdom
[true] it was the best of times, it was the worst of times, it was the age of wisdom



[pred] in einer stunde erreichen wir kopenhagen.
[true] in einer stunde erreichen wir kopenhagen.



[pred] comment puis-je vous aider?
[true] comment puis-je vous aider?

Ad-Hoc Translation (AdTrans)

The codes for AdTrans evaluation is in adTrans.

translate the $\hat{x}$ from training language to target language and evaluate.

python adTrans/translate_test_results.py $results_output_directory$
python adTrans/eval.py $results_output_directory$ 
python adTrans/eval_sum_up.py $results_output_directory$

Inversion Models Limitations

To analyze the impact of multilingual parallel data training, we used MTG benchmark in English, French, German, and Spanish, the texts in the datasets were given as lower-cased, so our trained inversion ME5-based models work mostly for lower-cased texts as well. Along with dataset limitation, the best performing models invert sentences within the length of 32 tokens. We will address these limitation for future work.

Cite our Paper

@inproceedings{chen-etal-2024-text,
    title = "Text Embedding Inversion Security for Multilingual Language Models",
    author = "Chen, Yiyi  and
      Lent, Heather  and
      Bjerva, Johannes",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.422",
    pages = "7808--7827",
    abstract = "Textual data is often represented as real-numbered embeddings in NLP, particularly with the popularity of large language models (LLMs) and Embeddings as a Service (EaaS). However, storing sensitive information as embeddings can be susceptible to security breaches, as research shows that text can be reconstructed from embeddings, even without knowledge of the underlying model. While defence mechanisms have been explored, these are exclusively focused on English, leaving other languages potentially exposed to attacks. This work explores LLM security through multilingual embedding inversion. We define the problem of black-box multilingual and crosslingual inversion attacks, and explore their potential implications. Our findings suggest that multilingual LLMs may be more vulnerable to inversion attacks, in part because English-based defences may be ineffective. To alleviate this, we propose a simple masking defense effective for both monolingual and multilingual models. This study is the first to investigate multilingual inversion attacks, shedding light on the differences in attacks and defenses across monolingual and multilingual settings.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
adTrans		adTrans
images		images
vec2text		vec2text
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
eval_samples.py		eval_samples.py
multilingual_text2vec_poster.pdf		multilingual_text2vec_poster.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Embedding Inversion Attacks on Multilingual Language Models

Experiments (Inversion attack simulations)

Setup

Text Embedding Examples

Ad-Hoc Translation (AdTrans)

Cite our Paper

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

siebeniris/MultiVec2Text

Folders and files

Latest commit

History

Repository files navigation

Text Embedding Inversion Attacks on Multilingual Language Models

Experiments (Inversion attack simulations)

Setup

Text Embedding Examples

Ad-Hoc Translation (AdTrans)

Cite our Paper

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages