Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Codebase for the paper
"Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models"
Hongchuan Zeng, Senyu Han, Lu Chen†, Kai Yu†
(†corresponding authors)

[arXiv:2410.11718v2] [2025.coling-main.707]

🌐 Overview

This project explores the emergence of language-agnostic semantic spaces—a “Lingua Franca”—within Multilingual Large Language Models (MLLMs). We:

Identify key linguistic regions which are neurons crucial for the capacity of each language.
Track the evolution of language-specific and semantic activations across layers.
Introduce two core metrics:
- LRDS (Linguistic Region Development Score)
- SADS (Semantic Alignment Development Score)
Evaluate robustness of MLLMs through neuron-level probing, ablation, and PPL/task-based evaluation.

We use models like BLOOM and LLaMA2 to validate our findings on datasets such as Bible, FLORES, and XLSum.

📦 Features

🔍 Hook-based activation capture at neuron level.
🧠 Automatic detection of functional language-specific key neurons.
🧪 Evaluation of semantic alignment via cosine similarity.
🔥 Ablation analysis on neuron sets and their impact on downstream tasks and perplexity.
📊 Visualization of cross-lingual similarities and neuron contributions.

🚀 Quick Start

Install dependencies:

pip install torch transformers datasets matplotlib seaborn scikit-learn tqdm

Run main experiment (example with BLOOM-7B):

python run.py \
  --sample_num 100 \
  --dataset_name "bible" \
  --model "bigscience/bloom-7b1" \
  --deactivate 1 \
  --evaluate_ppl 1 \
  --evaluate_tasks 1

Arguments:

--sample_num: Number of examples per language.
--dataset_name: Dataset name (bible or flores).
--model: HF model name or local checkpoint path.
--deactivate: Whether to deactivate key neurons for ablation testing.
--evaluate_ppl: Evaluate PPL on XLSum with ablation.
--evaluate_tasks: Run zero-shot evaluation (e.g., XStoryCloze).
--revision: Optional model revision tag.

📊 Key Metrics

Metric	Description
LRDS	Measures average pairwise similarity of hidden states grouped by language.
SADS	Measures average similarity of translations of the same meaning across languages.
Z-Score Neuron Ranking	Identifies language-specific neurons contributing most to language information.

📌 Supported Models

BLOOM-7B1 (bigscience/bloom-7b1)
LLaMA-2 (HF-compatible paths)
Custom LLaMA/Baichuan variants

You can add support for other models by modifying hook registration logic in generate_hidden_states_* and hook classes.

📂 Outputs

Cosine similarity heatmaps (.png)
Key neuron scores per layer/language (.csv)
Perplexity logs per language (.csv)
LM evaluation logs (.json)

📄 Citation

If you find this project useful, please cite:

@inproceedings{zeng-etal-2025-converging,
    title = "Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models",
    author = "Zeng, Hongchuan  and
      Han, Senyu  and
      Chen, Lu  and
      Yu, Kai",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.707/",
    pages = "10602--10617",
}

💬 Contact

For questions, contact:
charlie68@sjtu.edu.cn, chenlusz@sjtu.edu.cn, kai.yu@sjtu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lib		lib
lm_eval		lm_eval
.DS_Store		.DS_Store
README.md		README.md
linguafranca.png		linguafranca.png
run.py		run.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

🌐 Overview

📦 Features

🚀 Quick Start

Install dependencies:

Run main experiment (example with BLOOM-7B):

Arguments:

📊 Key Metrics

📌 Supported Models

📂 Outputs

📄 Citation

💬 Contact

About

Releases

Packages

Languages

X-LANCE/LinguaFranca

Folders and files

Latest commit

History

Repository files navigation

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

🌐 Overview

📦 Features

🚀 Quick Start

Install dependencies:

Run main experiment (example with BLOOM-7B):

Arguments:

📊 Key Metrics

📌 Supported Models

📂 Outputs

📄 Citation

💬 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages