char-sim

This is the repository for the Trait2Vec model and the Character Similarity dataset. It contains the code used for training and the evaluation of Trait2Vec (testing and visualizing embeddings). Additionally, we include a collection of scripts for forming, evaluating, and visualizing the character similarity data created alongside it.

Paper | Model | Data

Trait2Vec is a Sentence Transformer model trained on a new character similarity dataset that encodes the similarity between textual trait descriptions of the Phenoscape knowledgebase.

Model

The Trait2Vec model is a Sentence Transformer pre-trained with the CosENT objective. The dependencies are listed in train_environement.yaml.

To train a Trait2Vec model on the full dataset, please change directory to this repo and run the following (the dataset is streamed from Hugging Face if not already downloaded):

conda env create -f train_environment.yaml
conda activate snakemake_env
python train_model.py

The above will estimate a Trait2Vec model and save it to outputs/full_data/model directory. We also provide a pre-trained model in huggingface Trait2Vec.

To train a taxon-specific Trait2Vec model change the dataset parameter. The next line will estimate a Trait2Vec model, with the characiformes dataset, and save it to outputs/characiformes/model directory

python train_model.py --dataset characiformes

Note: If data was manually downloaded, the data_path parameter can be used to specify the file path.

python train_model.py --data_path <file_path>

Data

Trait2Vec was trained on the Character Similarity dataset. The data is a collection of textual trait description pairs and the corresponding Jaccard, maxIC and SimGIC ontology-based similarities. The ontological representations of the corresponding traits, that induces the similarity, is extracted from the Phenoscape knowledgebase. The pipeline to extract the data from Phenoscape and process it is listed in the Snakefile. We recommend to download the data directly from the Hugging Face repo. Please see the Character Similarity dataset repo for more details on the data.

Paper, Website, and Docs

TBD

Citation

@software{Garcia_Character_Similarity_2025, author = {Garcia, Juan J. and Balhoff, James P. and Kar, Soumyashree and Lapp, Hilmar}, month = nov, title = {{char-sim}}, url = {https://github.com/Imageomics/char-sim}, version = {1.0.0}, year = {2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
data/trait_descriptors		data/trait_descriptors
embed_model		embed_model
notebooks		notebooks
scripts		scripts
sparql		sparql
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
environment.yaml		environment.yaml
train_environment.yaml		train_environment.yaml
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

char-sim

Paper | Model | Data

Table of Contents

Model

Data

Paper, Website, and Docs

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

Imageomics/char-sim

Folders and files

Latest commit

History

Repository files navigation

char-sim

Paper | Model | Data

Table of Contents

Model

Data

Paper, Website, and Docs

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages