Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esm2 on Sagemaker Hyperpod #387

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Esm2 on Sagemaker Hyperpod #387

wants to merge 8 commits into from

Conversation

awsankur
Copy link
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

awsankur added 4 commits July 3, 2024 18:04
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
@awsankur awsankur requested review from KeitaW and amanshanbhag July 25, 2024 06:32
@KeitaW
Copy link
Collaborator

KeitaW commented Jul 25, 2024

Do we have any SMHP specific feature in this test case?
If not we may organize the test case per scheduler:

23.esm
├── kubernetes
└── slurm

see also #381


| Model | device_batch_size | num_nodes | torch.compile | Instance | Throughput |
|:------:|:-----------------:|:---------:|:-------------:| :------------: | :------------: |
| ESM2 | 8 | 2 | No | g5.12xlarge | 160 samples/s |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set up instruction advise to use 24xl but actually 12xl was used?

## What is ESM-2?
[ESM-2](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1) is a pLM trained using unsupervied masked language modelling on 250 Million protein sequences by researchers at [Facebook AI Research (FAIR)](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1). It is available in several sizes, ranging from 8 Million to 15 Billion parameters. The smaller models are suitable for various sequence and token classification tasks. The FAIR team also adapted the 3 Billion parameter version into the ESMFold protein structure prediction algorithm. They have since used ESMFold to predict the struture of [more than 700 Million metagenomic proteins](https://esmatlas.com/about).

ESM-2 is a powerful pLM. We will demonstrate how to use QLoRA to fine-tune ESM-2 on g5.24xlarge instances. We will use ESM-2 to predict [subcellular localization](https://academic.oup.com/nar/article/50/W1/W228/6576357?login=false). Understanding where proteins appear in cells can help us understand their role in disease and find new drug targets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test case demonstrating pretraining? or finetuning? I believe latter but the title states former.

@perifaws
Copy link
Contributor

@awsankur @KeitaW are we good on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants