diff --git a/mixture_of_experts_pretraining/README.md b/mixture_of_experts_pretraining/README.md index f4b9fc75e..70f555c5a 100644 --- a/mixture_of_experts_pretraining/README.md +++ b/mixture_of_experts_pretraining/README.md @@ -210,6 +210,13 @@ python ~/xpk/xpk.py workload create \ --num-slices= \ --command="bash script.sh" ``` +Note that the dataset path defaults as follows in [`dataset/c4_mlperf.yaml`](config/dataset/c4_mlperf.yaml) +``` +train_dataset_path: gs://mlperf-llm-public2/c4/en_json/3.0.1 +eval_dataset_path: gs://mlperf-llm-public2/c4/en_val_subset_json +``` +You can freely overwrite the workload command by adding +`dataset.train_dataset_path=/path/to/train/dir dataset.eval_dataset_path=/path/to/eval/dir`, and the path should support both local directory and gcs buckets. ## Run Experiments in GCE @@ -326,6 +333,14 @@ EOF " ``` +Note that the dataset path defaults as follows in [`dataset/c4_mlperf.yaml`](config/dataset/c4_mlperf.yaml) +``` +train_dataset_path: gs://mlperf-llm-public2/c4/en_json/3.0.1 +eval_dataset_path: gs://mlperf-llm-public2/c4/en_val_subset_json +``` +You can freely overwrite the workload command by adding +`dataset.train_dataset_path=/path/to/train/dir dataset.eval_dataset_path=/path/to/eval/dir`, and the path should support both local directory and gcs buckets. + #### Logging The workload starts only after all worker SSH connections are established, then it is safe and recommended to manually exit. The provided scripts may exceed the SSH connection timeout without manully exit, causing unexpected command retries, which may lead to some error message stating that command error since the TPU devices are currently in use. However, this should not disrupt your existing workload.