Investigate mapping token embeddings from source to target #481

mshannon-sil · 2024-08-13T20:50:48Z

A recently published paper introduced a strategy called "trans-tokenization", which "focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language." We should investigate whether this approach could improve the performance of adding trained tokens to NLLB.

mshannon-sil added the research Research topics label Aug 13, 2024

mshannon-sil added this to SIL-NLP Research Aug 13, 2024

github-project-automation bot moved this to 🆕 New in SIL-NLP Research Aug 13, 2024

mshannon-sil moved this from 🆕 New to 📋 Backlog in SIL-NLP Research Oct 16, 2024

TaperChipmunk32 self-assigned this Nov 18, 2024

TaperChipmunk32 moved this from 📋 Backlog to 🏗 In progress in SIL-NLP Research Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate mapping token embeddings from source to target #481

Investigate mapping token embeddings from source to target #481

mshannon-sil commented Aug 13, 2024

Investigate mapping token embeddings from source to target #481

Investigate mapping token embeddings from source to target #481

Comments

mshannon-sil commented Aug 13, 2024