Skip to content

Commit

Permalink
Add Tiktoken as tokenizer (#30)
Browse files Browse the repository at this point in the history
* minor

* minor

* forking nanotron

* tiktoken
  • Loading branch information
xzyaoi authored Mar 15, 2024
1 parent b98e10e commit 0997e56
Show file tree
Hide file tree
Showing 122 changed files with 581 additions and 1,434 deletions.
85 changes: 8 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,13 @@
# ⚡️ Nanotron
# ⚡️ FMEngine

The objective of this library is to provide easy distributed primitives in order to train a variety of models efficiently using 3D parallelism. For more information about the internal design of the library or 3D parallelism in general, please check out [[docs.md]](./docs/docs.md) and [[3d_parallelism.md]](./docs/3d_parallelism.md).
FMEngine is our opinionated take on foundation model training framework. The first version of FMEngine is built on top of `PyTorch` and `DeepSpeed` and is designed to be a drop-in replacement for `DeepSpeed` with a few additional features. In the `v2` version we forked from HuggingFace's `nanotron` and added some features to make it easier to use.


# Philosophy

- Make it fast. At least as fast as other open source versions.
- Make it minimal. We don't actually need to support all techniques and all versions of 3D parallelism. What matters is that we can efficiently use the "best" ones.
- Make everything explicit instead of transparent. As we move forward, making things transparent works well when it works well but is a horrible debugging experience if one doesn't understand the implications of techniques used. In order to mitigate this, we choose to be explicit in the way it does things

# Core Features

We support the following:
- 3D parallelism, including one-forward-one-backward pipeline engine
- ZeRO-1 optimizer
- FP32 gradient accumulation
- Parameter tying/sharding

# Installation

Requirements:
- Python >= 3.10
- PyTorch >= 2.0.0
- Flash-Attention >= 2.5.0

To install (in a new env):
```bash
pip install torch
pip install packaging; pip install "flash-attn>=2.5.0" --no-build-isolation
git clone git@github.com:huggingface/nanotron.git
cd nanotron
pip install -e .
```

Also nice to have `transformers` `datasets` `python-etcd` `tensorboardX`: `pip install transformers datasets python-etcd tensorboardX`

We also support a set of flavors that you can install using `pip install -e [$FLAVOR]`:
- `dev`: Used is you are developping in `nanotron`. It installs in particular our linter mechanism. On top of that you have to run `pre-commit install` afterwards.
- `test`: We use `pytest` in order to run out testing suite. In order to run tests in parallel, it will install `pytest-xdist`, which you can leverage by running `pytest -n 12 tests` (12 is the number of parallel test)


# Quick examples

In the `/examples` directory, you can find a few example configuration file, and a script to run it.

You can run a sample training using:
```bash
torchrun --nproc_per_node=8 run_train.py --config-file examples/debug_run_train.yaml
```

And run a sample generation using:
```bash
torchrun --nproc_per_node=8 run_generation.py --ckpt-path checkpoints/text/4
```

# Development guidelines

If you plan on developing on `nanotron`, we suggest you install the `dev` flavor: `pip install -e ".[dev]"`

We use pre-commit to run a bunch of callbacks on each commit, mostly normalization code in order for the codebase to stay consistent. Please do run `pre-commit install`.

For the linting:
```bash
pre-commit install
pre-commit run --config .pre-commit-config.yaml --all-files
```

Features we would like to add:
- [ ] Support `torch.compile`
- [ ] Support `torch.distributed.rpc`
- [ ] More optimized kernels
- [ ] Support Zero3
- [ ] Other PP schedules (such as Interleaved 1f1b...)
- [ ] Ring attention / Sequence Parallelism
- [ ] 3D Parallel MoEs
- [ ] Supporting more architectures (Mamba..)
- [ ] ...

# Credits

We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for `Megatron-LM/apex`, Microsoft for `DeepSpeed`, HazyResearch for `flash-attn`
We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration:

- HuggingFace for `nanotron`,
- Nvidia for `Megatron-LM/apex`,
- Microsoft for `DeepSpeed`,
- HazyResearch for `flash-attn`
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ data:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 1
hf_dataset_config_name: default
hf_dataset_or_datasets: cerebras/SlimPajama-627B
hf_dataset_or_datasets: DKYoon/SlimPajama-6B
hf_dataset_splits: train
text_column_name: text
num_loading_workers: 1
Expand Down Expand Up @@ -51,7 +51,7 @@ model:
sliding_window: 4096
tie_word_embeddings: true
use_cache: true
vocab_size: 32000
vocab_size: 102000
optimizer:
accumulate_grad_in_fp32: true
adam_beta1: 0.9
Expand All @@ -70,16 +70,17 @@ optimizer:
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
dp: 1
pp: 1
pp_engine: 1f1b
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_type: openai
tokenizer_max_length: null
tokenizer_name_or_path: mistralai/Mistral-7B-v0.1
tokenizer_name_or_path: cl100k_base
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
Expand Down
Loading

0 comments on commit 0997e56

Please sign in to comment.