Lightning Thunder is a deep learning compiler for PyTorch. It makes PyTorch programs faster both on single accelerators or in distributed settings.
The main goal for Lightning Thunder is to allow optimizing user programs in the most extensible and expressive way possible.
NOTE: Lightning Thunder is alpha and not ready for production runs. Feel free to get involved, expect a few bumps along the way.
Install the nvFuser nightly, which will also install the matching PyTorch nightly:
pip install --pre 'nvfuser-cu121[torch]' --extra-index-url https://pypi.nvidia.com
Install Thunder:
pip install git+https://github.com/Lightning-AI/lightning-thunder.git
or install from the local repo:
pip install .
Here is a simple example of how Thunder lets you compile and run PyTorch code:
import torch
import thunder
def foo(a, b):
return a + b
jfoo = thunder.jit(foo)
a = torch.full((2, 2), 1)
b = torch.full((2, 2), 3)
result = jfoo(a, b)
print(result)
# prints
# tensor(
# [[4, 4]
# [4, 4]])
The compiled function jfoo
takes and returns PyTorch tensors, just like the original function, so modules and functions compiled by Thunder can be used as part of larger PyTorch programs.
Thunder is in its early stages, it should not be used for production runs yet.
However, it can already deliver outstanding performance on models supported by LitGPT, such as Mistral, Llama2, Gemma, Falcon, and derivatives.
Run training loop for Llama, single-GPU:
python examples/lit-gpt/train.py
Run training loop for Llama, multi-GPU, using FSDP:
python examples/lit-gpt/train_fsdp.py
See README.md for details on running LitGPT with Thunder.
Given a python callable or PyTorch module, Thunder can generate an optimized program that:
- Computes its forward and backward passes
- Coalesces operations into efficient fusion regions
- Dispatches computations to optimized kernels
- Distributes computations optimally across machines
To do so, Thunder ships with:
- A JIT for acquiring Python programs targeting PyTorch and custom operations
- A multi-level IR to represent operations as a trace of a reduced op-set
- An extensible set of transformations on the trace, such as
grad
, fusions, distributed (likeddp
,fsdp
), functional (likevmap
,vjp
,jvp
) - A way to dispatch operations to an extensible collection of executors
Thunder is written entirely in Python. Even its trace is represented as valid Python at all stages of transformation. This allows unprecedented levels of introspection and extensibility.
Thunder doesn't generate code for accelerators directly. It acquires and transforms user programs so that it's possible to optimally select or generate device code using fast executors like:
- torch.compile
- nvFuser
- cuDNN
- Apex
- TransformerEngine
- PyTorch eager
- custom kernels, including those written with OpenAI Triton
Modules and functions compiled with Thunder fully interoperate with vanilla PyTorch and support PyTorch's autograd. Also, Thunder works alongside torch.compile to leverage its state-of-the-art optimizations.
Docs are currently not hosted publicly. However you can build them locally really quickly:
make docs
and point your browser to the generated docs at docs/build/index.html
.
You can set up your environment for developing Thunder by installing the development requirements:
pip install -r requirements/devel.txt
Install Thunder as an editable package (optional):
pip install -e .
Now you run tests:
pytest thunder/tests
Thunder is very thoroughly tested, so expect this to take a while.
Lightning Thunder is released under the Apache 2.0 license. See LICENSE file for details.