Enable pygrain v3 #330

aireenmei · 2024-01-12T15:17:26Z

Note that grain use an optimized packing algorithm to select samples to pack, resulting in denser packing, so we see higher loss overall compared with the original tfds pipeline. In convergence test with grain, loss started >11 and goes down to ~3: https://pantheon.corp.google.com/kubernetes/service/us-east5/v5e-256-bodaborg/default/aireen-v5e256-0104-1257/logs?e=13802955&mods=allow_workbench_image_override&project=tpu-prod-env-multipod

convergence test on the same branch but with tfds pipeline, loss started ~10 and goes down to ~2.6: https://pantheon.corp.google.com/kubernetes/service/us-east5/v5e-256-bodaborg/default/aireen-v5e256-0104-1236/logs?e=13802955&mods=allow_workbench_image_override&project=tpu-prod-env-multipod

This request for trimming long sequences instead of dropping is also addressed. (#274)

Configs for these use cases:

Pretrain with grain (end_to_end/test_convergence_1b_params_grain.sh): run gcsfuse_setup.sh to mount gcs bucket, then train with "dataset_type=c4-array_record". When "dataset_type=c4-array_record" is set, the newly saved training checkpoint always contains data iterator.
When resume training from ckpt, user can choose to resume data iterator or not by setting load_data_iterator_from_checkpoint, will raise error when no data iterator found in ckpt.
When finetune, user set load_parameters_path to load only parameters, data iterator won't be loaded.
When decode, user set load_parameters_path to load only parameters, data iterator won't be loaded, no data iterator in decode ckpt.

Items I will work on in the next PRs:

add tests (unit tests and github workflow)
For supporting Llama2, adding add_bos and add_eos to the way grain loads tokenizer currently results in OOM on GPU "Test train.py with per_device_batch_size < 1". I will add later, appreciate any suggestion.

rwitten

Let's talk live to figure out a path forwards here. This adds too much duplication and complexity to MaxText IMO.

rwitten · 2024-01-13T00:27:43Z

MaxText/checkpointing.py

@@ -15,12 +15,14 @@
 """

 """Create an Orbax CheckpointManager with specified (Async or not) Checkpointer."""
+# pylint: disable=line-too-long


shouldn't do this!

rwitten · 2024-01-13T00:29:55Z

MaxText/checkpointing.py

  )
  max_logging.log("Checkpoint manager created!")
  return mngr

+def create_orbax_checkpoint_manager_pygrain(


I'm very worried about this level of duplication making MaxText harder to understand and use. I think we have to hold back on this change until we're ready to always recommend Grain based on just this added mental load for users.

(I do think we can land lots of things separately.)

rwitten · 2024-01-13T00:31:00Z

MaxText/checkpointing.py

-    first_checkpoint_path: if there is no checkpoint in the checkpoint manager,
-      return the Params from the first_checkpoint_path if they exist. This
-      enables loading just the parameters and is intended for finetuning.
+    load_parameters_path: This enables loading just the parameters and is intended 


Thanks for fixing this name. It was driving me crazy the last time I read the code.

rwitten · 2024-01-13T00:31:33Z

MaxText/checkpointing.py

-    max_logging.log(f"restoring state from this run's directory latest step \
-        {latest_step}")
-    return checkpoint_manager.restore(latest_step, abstract_unboxed_pre_state,
+  # Set restore_args based whether to load data iterator


I'm very stressed by this code and don't feel comfortable pushing this in our "simple" reference for customers.

rwitten · 2024-01-13T00:32:09Z

MaxText/configs/base.yml

@@ -118,13 +117,22 @@ dataset_path: ""
 vocab_size: 32_768 # powers of 2 for sharding
 assets_path: "assets"
 vocab_relative_path: "tokenizer"  # Assumes we're allowed
-dataset_name: 'c4/en:3.0.1'
+# When using c4-array_record dataset_type, use subfolder path as dataset_name
+# array_record files should be located in <dataset_path>/<dataset_name>/*.array_record*


Is there a script for generating this data?

rwitten · 2024-01-13T00:32:39Z

MaxText/input_pipeline.py

@@ -80,6 +85,37 @@ def _normalize_features(features):
      num_parallel_calls=AUTOTUNE)


+def length_trim(ds, max_len):
+  """"Trim to Max length"""


Thank you -- this should merge on a separate CR!

aireenmei and others added 23 commits January 4, 2024 03:47

Rebase aireen/grain_1028 to Jan2 main branch

1b480e3

fix

3898bd8

fix

3e23a37

fix axis in shift_data

7909a1e

minor change and cleanup

eb054a7

minor change

baabb9c

fix for convergence test

a2166d5

fix rebase

c56371d

minor update

525985a

fix

de45c27

fix pylint and refactor

b1c39c8

fix pylint

85efc33

fix unit tests

c953c52

checkpoint logic update and refactor

3b8c72c

Merge branch 'main' into aireen/grain_jan4

6c93a99

fix pylint

4fb02f8

fix

86ab584

fix standalone_dataloader.py

8671954

Merge branch 'main' into aireen/grain_jan4

d496327

fix checkpoint

d50ee6c

fix pylint

f99c914

add_bos, add_eos when tokenizer.load_tokenizer

9ad6f40

remove unused arg from get_batch_sharded_data_pipeline

4caee8e

aireenmei marked this pull request as ready for review January 12, 2024 16:18

aireenmei requested a review from rwitten as a code owner January 12, 2024 16:18

aireenmei assigned rwitten Jan 12, 2024

aireenmei requested a review from khatwanimohit January 12, 2024 16:21

rwitten reviewed Jan 13, 2024

View reviewed changes

rwitten removed their assignment Jan 18, 2024

aireenmei closed this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable pygrain v3 #330

Enable pygrain v3 #330

aireenmei commented Jan 12, 2024 •

edited

Loading

rwitten left a comment

rwitten Jan 13, 2024

rwitten Jan 13, 2024

rwitten Jan 13, 2024

rwitten Jan 13, 2024

rwitten Jan 13, 2024

rwitten Jan 13, 2024

Enable pygrain v3 #330

Enable pygrain v3 #330

Conversation

aireenmei commented Jan 12, 2024 • edited Loading

rwitten left a comment

Choose a reason for hiding this comment

rwitten Jan 13, 2024

Choose a reason for hiding this comment

rwitten Jan 13, 2024

Choose a reason for hiding this comment

rwitten Jan 13, 2024

Choose a reason for hiding this comment

rwitten Jan 13, 2024

Choose a reason for hiding this comment

rwitten Jan 13, 2024

Choose a reason for hiding this comment

rwitten Jan 13, 2024

Choose a reason for hiding this comment

aireenmei commented Jan 12, 2024 •

edited

Loading