-
-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* feat(spellcheck): ⚡ Add feature: push to Argilla from an HF dataset * fix(spellcheck): 🐛 Add fixes to T5 script * fix(spellcheck): 🐛 Add guardrail to prevent compiuting metrics with empty strings * feat(spellcheck): 🎨 Training pipeline using Metaflow * feat(spellcheck): 🎨 LLM QLoRA TRL training script - Mistral - 7B - Instruct * perf(spellcheck): 🧪 Normalize evaluation algorithm "flavour" -> "flavor" - "ï" -> "i" - "â" -> "a" - "oe" * feat(spellcheck): 🎨 Implement LLM training with Sagemaker & Metaflow Scripts are customed to handle training in the cloud using Sagemaker Training Jobs * feat(spellcheck): ⚡ Mistral 7b instruct v3 trained * feat(spellcheck): 🎨 Update guidelines: accents * refactor(spellcheck): ✨ Update Logging to consider script and src code for defining the level * feat(spellcheck): ✨ Dataset processing methods & pipeline created * build(spellcheck): ✨ Dataset processing (oe, percentage alignment): v3.1 * feat(spellcheck): ♻️ Training Mistral-7B-Instruct: instruction + unidecode normalization + scheduler linear * Delete previous training lllm dag * feat(spellcheck): ⚡ Add eval normalization: remove "\n" * refactor(spellcheck): 👷 Foundational LLMs re-evaluated on the benchmark Prompt was intentionally overfitted on the benchmark to create later the synthetic training dataset. Examples from benchmark are removed from the prompt. * Modify overfitted prompt * refactor(spellcheck): 🏷️ Refactor Argilla extraction: modules + unitests + dag * refactor(spellcheck): ⬆️ Refactor training job: add parameters to argparse + add cometML logs * feat(spellcheck): 🎨 Fine-tune Mistral-7b with guidelines + training scripts improvments * feat(spellcheck): ✨ Add args for training + Train Mistral-7b-base * fix(spellcheck): 🐛 Correction error in Mistral-7B-Base fine-tuning script * feat(spellcheck): 🎨 DPO dataset extraction and push * fix(spellcheck): 🔖 small fixes * feat(spellcheck): ⚡ DPO training script * refactor(spellcheck): 🚧 Refactor training pipeline: WIP * Update get_logger for Metaflow logging * feat(spellcheck): ✨ Double the benchmark size: extraction and push to Argilla pipeline (WIP) * docs(spellcheck): 📝 Document benchmark generation pipeline * feat(spellcheck): 🐛 Remove legacy metadata in Argilla * refactor(spellcheck): 🚧 Refactor training pipeline (WIP) * refactor(spellcheck): 🚧 Refactor training script (WIP) * refactor(spellcheck): 🚧 Refactor training pipeline * chore(spellcheck): ✨ Update Python from 3.9 to 3.10 * refactor(spellcheck): ⚡ LLM training pipeline refactored * feat(spellcheck): ✨ Pretraining before fine-tuning (WIP) * feat(spellcheck): 🚑 Pretraining + Finetuning Mistral-7B * refactor(spellcheck): ✨ Refactor training * feat(spellcheck): 🚧 Batch processing (WIP) * feat(spellcheck): 🚧 Batch job with vllm and GCP (WIP) * feat(spellcheck): ⚡ Batch job operational on GCP * refactor(spellcheck): ✨ Clean code and add logging to batch job * fix(spellcheck): 📦 Forgot to add batch dep requirements
- Loading branch information
1 parent
60b07a2
commit 24adbb2
Showing
50 changed files
with
5,070 additions
and
1,471 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-devel | ||
|
||
ENV PYTHONUNBUFFERED=1 \ | ||
PYTHONDONTWRITEBYTECODE=1 \ | ||
PIP_DISABLE_PIP_VERSION_CHECK=on \ | ||
PYTHONPATH="/app/src" | ||
|
||
WORKDIR /app | ||
|
||
COPY ./src /app | ||
|
||
COPY ./scripts/batch/. /app | ||
|
||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
# Set the entrypoint to the batch job script | ||
ENTRYPOINT ["python", "main.py"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
python scripts/dags/extract_from_argilla.py run \ | ||
--deploy_to_hf true \ | ||
--local_path data/dataset/deployed_data.parquet \ | ||
--argilla_dataset_name training_dataset \ | ||
--dataset_hf_repo openfoodfacts/spellcheck-dataset \ | ||
--dataset_revision v4 \ | ||
--dataset_test_size 0.1 \ | ||
--dataset_version v4.3 \ | ||
--status submitted |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
python scripts/dags/training/training.py run \ | ||
--do_human_eval False \ | ||
--evaluation_data_version v8.0 \ | ||
--training_data_version v5.2 \ | ||
--experiment_tag eval_loss --experiment_tag mistral-7b-v0.3 --experiment_tag eval-normalization --experiment_tag test | ||
|
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
estimator: | ||
entry_point: "pretraining_llm.py" # train script | ||
source_dir: "scripts/training/llm/" # directory containing training script and requirements requirements. | ||
dependencies: | ||
- "src/" # Additional local library | ||
output_path: "s3://open-food-facts-robotoff/spellcheck/model-training/" # s3 path to save the artifacts | ||
code_location: "s3://open-food-facts-robotoff/spellcheck/model-training/" # s3 path to stage the code during the training job | ||
base_job_name: "mistral-7b-v03" # name of the training job | ||
instance_count: 1 # the number of instances used for training | ||
instance_type: "ml.g5.2xlarge" # instances type used for the training job | ||
transformers_version: "4.36" # transformers version used in the training job | ||
pytorch_version: "2.1" # pytorch_version version used in the training job | ||
py_version: "py310" # python version used in the training job | ||
disable_output_compression: true # not compress output to save training time and cost | ||
volume_size: 300 # the size of the EBS volume in GB | ||
|
||
hyperparameters: | ||
# Data | ||
training_data: "openfoodfacts/spellcheck-corpus" | ||
train_split: "train" | ||
|
||
# Trainer | ||
output_dir: "/opt/ml/model" | ||
pretrained_model_name: "mistralai/Mistral-7B-v0.3" | ||
num_train_epochs: 1 | ||
per_device_train_batch_size: 4 | ||
learning_rate: 0.0002 # Paper https://arxiv.org/pdf/2210.11416 | ||
warmup_steps: 0 | ||
warmup_ratio: 0.1 | ||
weight_decay: 0.1 | ||
gradient_checkpointing: true | ||
seed: 42 | ||
optim: "adamw_torch_fused" # The optimizer to use: adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision or adafactor. | ||
lr_scheduler_type: "cosine" | ||
gradient_accumulation_steps: 8 | ||
bf16: true | ||
tf32: true | ||
fp16: false | ||
logging_steps : 1 | ||
save_total_limit: 1 | ||
report_to: "none" # Important to avoid superposition of Trainer callback and our custom callback | ||
max_seq_length: 2048 | ||
packing: true | ||
dataset_text_field: "ingredients_text" | ||
# add_special_tokens: true # Add bos token and other special token from the tokenizer | ||
# append_concat_token: true # If true, appends eos_token_id at the end of each sample being packed. | ||
|
||
# Saving | ||
merge_weights: true | ||
max_shard_size: "2GB" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
estimator: | ||
entry_point: "refactored_llm.py" # train script | ||
source_dir: "scripts/training/llm/" # directory containing training script and requirements requirements. | ||
dependencies: | ||
- "src/" # Additional local library | ||
output_path: "s3://open-food-facts-robotoff/spellcheck/model-training/" # s3 path to save the artifacts | ||
code_location: "s3://open-food-facts-robotoff/spellcheck/model-training/" # s3 path to stage the code during the training job | ||
base_job_name: "mistral-7b-v03" # name of the training job | ||
instance_count: 1 # the number of instances used for training | ||
instance_type: "ml.g5.2xlarge" # instances type used for the training job | ||
transformers_version: "4.36" # transformers version used in the training job | ||
pytorch_version: "2.1" # pytorch_version version used in the training job | ||
py_version: "py310" # python version used in the training job | ||
disable_output_compression: true # not compress output to save training time and cost | ||
volume_size: 300 # the size of the EBS volume in GB | ||
|
||
additional_conf: | ||
s3_evaluation_uri: "s3://open-food-facts-robotoff/spellcheck/evaluation_output/" | ||
|
||
hyperparameters: | ||
# Data | ||
training_data: "openfoodfacts/spellcheck-dataset" | ||
evaluation_data: "openfoodfacts/spellcheck-benchmark" | ||
train_split: "train+test" | ||
eval_split: "train" | ||
train_text_feature: "original" | ||
train_label_feature: "reference" | ||
eval_text_feature: "original" | ||
eval_label_feature: "reference" | ||
train_data_revision: "v5" | ||
eval_data_revision: "v8" | ||
|
||
# TrainingArguments | ||
output_dir: "/opt/ml/model" | ||
pretrained_model_name: "mistralai/Mistral-7B-v0.3" | ||
num_train_epochs: 0.01 | ||
per_device_train_batch_size: 8 | ||
per_device_eval_batch_size: 4 | ||
learning_rate: 0.0002 # Paper https://arxiv.org/pdf/2210.11416 | ||
warmup_steps: 0 | ||
warmup_ratio: 0.1 | ||
weight_decay: 0.1 | ||
gradient_checkpointing: true | ||
seed: 42 | ||
optim: "adamw_torch_fused" # The optimizer to use: adamw_hf, adamw_torch, adamw_torch_fused, adamw_apex_fused, adamw_anyprecision or adafactor. | ||
lr_scheduler_type: "cosine" | ||
gradient_accumulation_steps: 4 | ||
bf16: true | ||
tf32: true | ||
fp16: false | ||
logging_steps : 5 | ||
evaluation_strategy: "steps" | ||
save_strategy: "steps" | ||
eval_steps: 10 # Careful, need to be a multiple of eval-steps: 500 by default | ||
save_total_limit: 1 | ||
report_to: "none" # Important to avoid superposition of Trainer callback and our custom callback | ||
|
||
# SFTConfig | ||
max_seq_length: 1024 | ||
packing: true | ||
dataset_text_field: "text" | ||
# add_special_tokens: true # Add bos token and other special token from the tokenizer | ||
# append_concat_token: true # If true, appends eos_token_id at the end of each sample being packed. | ||
|
||
# Saving | ||
merge_weights: true | ||
max_shard_size: "2GB" | ||
|
||
# Inference | ||
max_new_token: 1024 | ||
batch_size: 1 | ||
|
||
#Data processing | ||
batched: false | ||
# instruction_template |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.