Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NeMo NeVA Model #343

Open
athitten opened this issue May 1, 2024 · 6 comments
Open

Support NeMo NeVA Model #343

athitten opened this issue May 1, 2024 · 6 comments
Labels
enhancement New feature or request high priority nemo Issues needed to support NVIDIA NeMo models. neva operators

Comments

@athitten
Copy link

athitten commented May 1, 2024

🚀 Feature

NeMo's NeVa (LLaVa) is a multimodal language model

Initial examine:
Found 49 distinct operations, of which 39 (79.6%) are supported

Work items

Running the model

Required data

First download the freely available data and place it in a data directory.

NeMo installation

Dependencies
python3 -m pip install --no-deps \
  huggingface-hub==0.23.2
NeMo branch

To keep the whole thunder team on the same NeMo revisions, and to prevent having a bunch of "modify file to call thunder.jit()" instructions, we temporarily maintain our own branch for thunder. You can grab it by cloning https://github.com/tfogal/NeMo.git. Make sure you have checked out the tfogal/thunder-nemo branch.

To install NeMo, run python3 -m pip install -e . from the root of the checked-out directory.

Running the network

rm -fr foo-neva-train; mkdir -p foo-neva-train
HYDRA_FULL_ERROR=1 \
THUNDER_ANNOTATE_TRACES=1 \
NEMO_THUNDER_NEVA=thunder \
python3 ./examples/multimodal/multimodal_llm/neva/neva_pretrain.py \
    trainer.precision=bf16-mixed \
    model.megatron_amp_O2=True \
    model.mcore_gpt=False \
    trainer.num_nodes=1 \
    trainer.devices=1 \
    trainer.val_check_interval=10 \
    trainer.limit_val_batches=5 \
    trainer.log_every_n_steps=1 \
    ++exp_manager.max_time_per_run=00:00:03:00 \
    trainer.max_steps=20 \
    model.micro_batch_size=2 \
    model.global_batch_size=4 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    exp_manager.create_checkpoint_callback=False \
    model.data.data_path=./data/multimodal/tiny-neva/dummy.json \
    model.data.image_folder=./data/multimodal/tiny-neva/images \
    model.tokenizer.library=sentencepiece \
    model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model \
    model.num_layers=2 \
    model.hidden_size=5120 \
    model.ffn_hidden_size=13824 \
    model.num_attention_heads=40 \
    model.normalization=rmsnorm \
    model.data.num_workers=0 \
    model.data.conv_template=llama_2 \
    model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14 \
    model.mm_cfg.llm.from_pretrained=null \
    model.use_flash_attention=false \
    exp_manager.exp_dir=./foo-neva-train

Note that the latest version of the tfogal/thunder-nemo branch allows running with dynamo+thunder by setting NEMO_THUNDER_NEVA=dynamo.

cc @apaz-cli @tfogal

@athitten athitten added the enhancement New feature or request label May 1, 2024
@tfogal tfogal added the nemo Issues needed to support NVIDIA NeMo models. label May 1, 2024
@tfogal tfogal changed the title Support NeMo NeVa Model Support NeMo NeVA Model Jun 12, 2024
@IvanYashchuk
Copy link
Collaborator

Can you share the script for the examine call?

@tfogal
Copy link
Collaborator

tfogal commented Jul 10, 2024

Can you share the script for the examine call?

@athitten when you have a minute

@athitten
Copy link
Author

athitten commented Aug 6, 2024

Adding the updated command to use megatron_amp_O2=True and model.mcore_gpt = True (NeMo models will be defaulting to using models from Megatron, hence this setting). With megatron_amp_O2=True, having precision=bf16 should do mixed precision training with main copy of weights in FP32, but just to be safe also specifying precision=bf16-mixed.

python3 ./examples/multimodal/multimodal_llm/neva/neva_pretrain.py trainer.precision=bf16-mixed model.megatron_amp_O2=True model.mcore_gpt=True  trainer.num_nodes=1 trainer.devices=1 trainer.val_check_interval=10 trainer.limit_val_batches=5 trainer.log_every_n_steps=1 ++exp_manager.max_time_per_run=00:00:03:00 trainer.max_steps=20 model.micro_batch_size=2 model.global_batch_size=4 model.tensor_model_parallel_size=1 model.pipeline_model_parallel_size=1 exp_manager.create_checkpoint_callback=False model.data.data_path=./data/multimodal/tiny-neva/dummy.json model.data.image_folder=./data/multimodal/tiny-neva/images model.tokenizer.library=sentencepiece model.tokenizer.model=./data/multimodal/tiny-neva/tokenizer_add_special.model model.num_layers=2 model.hidden_size=5120 model.ffn_hidden_size=13824 model.num_attention_heads=40 model.normalization=rmsnorm model.data.num_workers=0 model.data.conv_template=llama_2 model.mm_cfg.vision_encoder.from_pretrained=openai/clip-vit-large-patch14 model.mm_cfg.llm.from_pretrained=null model.use_flash_attention=false exp_manager.exp_dir=./foo-neva-train

@athitten
Copy link
Author

athitten commented Aug 6, 2024

This might be helpful: The full config with default values for all parameters can be found: here. Only the parameters we specify in the run command get overwritten by the specified values and others default to values mentioned in the config.

@tfogal
Copy link
Collaborator

tfogal commented Aug 9, 2024

Adding the updated command

Thanks, @athitten !
I have edited the original issue to mostly reflect the updated command. Unfortunately #753 blocks setting model.mcore_gpt=True, so for now that one's still False... but let's prioritize that one!

@athitten
Copy link
Author

athitten commented Aug 9, 2024

Yes its important to prioritize getting thunder working with mcore_gpt=True as it will be default for NeMo models once we deprecate the legacy path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority nemo Issues needed to support NVIDIA NeMo models. neva operators
Projects
None yet
Development

No branches or pull requests

4 participants