GitHub - JunhoKim94/MELT: MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science (EMNLP 2024(Findings))

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

This repository is about the paper, MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science, accepted in Findings of EMNLP 2024. In this project, we are interested in expanding the material-aware entities to continue pre-training the PLMs.

Requirements

Python 3
Transformers 4.6.1
Numpy
pytorch

Pre-processing

Prepare the pre-training corpora (e.g., scientific papers) in raw_data folder. We upload the sampled pre-training corpora in raw_data folder (train_sampled.txt).

Run bash scripts/bash preprocess.sh to normalize and split the raw sentences with max lengths.

--train_file: A directory containing raw text examples.
--output_train_norm_file: A directory containing pre-processed examples.

Run bash scripts/find_entities.sh to preprocess the positions of material-aware entities in the pre-processed sentences.

--preprocessed_data_path: A directory containing pre-processed examples.
--entity_path: A directory containing material-aware entities, which are expanded by ChemDataExtractor and Mat2Vec.
--output_folder_path: A directory containing output datasets.

Pre-training

To continued pre-train PLMs, run bash scripts/pretrain.sh for distillation.

--masking_strategy: Set the masking strategy. Choose strategies from: random, material, curriculum
--lr: Set the learning rate.
--batch_size: Set the batch size for conducting at once.
--step_batch_size: Set the batch size for updating per each step (If the memory of GPU is enough, set the batch_size and step_batch_size the same.)
--data_path: A directory containing pre-processed examples.
--masking_ratio : Set the masking ratio for the Material-aware Entity Masking
--curriculum_num : Set the number of curriculum for curriculum-based Entity Learning
--model_save_path: Set the directory for saving the pre-trained models

Fine-tuning

Run the following files with the pre-trained weights using argument name --load_weight

MatSciNLP: bash scripts/run_matscinlp.sh
NER (SOFC-NER, SOFC-Filling, MatScholar): bash scripts/run_ner.sh
Classification (Glass Science): bash scripts/run_cls.sh

Contact Info

For help or issues using MELT, please submit a GitHub issue.

For personal communication related to MELT, please contact Junho Kim <monocrat@korea.ac.kr>.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
cls		cls
mat2vec		mat2vec
matscinlp		matscinlp
ner		ner
pretraining		pretraining
raw_data		raw_data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
normalize_text.py		normalize_text.py
requirements.txt		requirements.txt
vocab_mappings.txt		vocab_mappings.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

Requirements

Pre-processing

Pre-training

Fine-tuning

Contact Info

About

Releases

Packages

Languages

License

JunhoKim94/MELT

Folders and files

Latest commit

History

Repository files navigation

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

Requirements

Pre-processing

Pre-training

Fine-tuning

Contact Info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages