Skip to content

Imageomics/char-sim

Repository files navigation

char-sim

This is the repository for the Trait2Vec model and the Character Similarity dataset. It contains the code used for training and the evaluation of Trait2Vec (testing and visualizing embeddings). Additionally, we include a collection of scripts for forming, evaluating, and visualizing the character similarity data created alongside it.

Trait2Vec is a Sentence Transformer model trained on a new character similarity dataset that encodes the similarity between textual trait descriptions of the Phenoscape knowledgebase.

Table of Contents

  1. Model
  2. Data
  3. Paper(TBD)
  4. Citation(TBD)

Model

The Trait2Vec model is a Sentence Transformer pre-trained with the CosENT objective. The dependencies are listed in train_environement.yaml.

To train a Trait2Vec model on the full dataset, please change directory to this repo and run the following (the dataset is streamed from Hugging Face if not already downloaded):

conda env create -f train_environment.yaml
conda activate snakemake_env
python train_model.py

The above will estimate a Trait2Vec model and save it to outputs/full_data/model directory. We also provide a pre-trained model in huggingface Trait2Vec.

To train a taxon-specific Trait2Vec model change the dataset parameter. The next line will estimate a Trait2Vec model, with the characiformes dataset, and save it to outputs/characiformes/model directory

python train_model.py --dataset characiformes

Note: If data was manually downloaded, the data_path parameter can be used to specify the file path.

python train_model.py --data_path <file_path>

Data

Trait2Vec was trained on the Character Similarity dataset. The data is a collection of textual trait description pairs and the corresponding Jaccard, maxIC and SimGIC ontology-based similarities. The ontological representations of the corresponding traits, that induces the similarity, is extracted from the Phenoscape knowledgebase. The pipeline to extract the data from Phenoscape and process it is listed in the Snakefile. We recommend to download the data directly from the Hugging Face repo. Please see the Character Similarity dataset repo for more details on the data.

Paper, Website, and Docs

TBD

Citation

@software{Garcia_Character_Similarity_2025, author = {Garcia, Juan J. and Balhoff, James P. and Kar, Soumyashree and Lapp, Hilmar}, month = nov, title = {{char-sim}}, url = {https://github.com/Imageomics/char-sim}, version = {1.0.0}, year = {2025} }

About

Pipeline to create model for comparing character state descriptions including ontology similarity

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5