Skip to content

Commit

Permalink
Improve OSS documentation (facebookincubator#334)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: facebookincubator#334

While reading the OSS documentation [here](https://facebookincubator.github.io/AITemplate/index.html), I've made some content improvements.

Reviewed By: chenyang78

Differential Revision: D43609113

fbshipit-source-id: 9698ed86d64e315bb1d1fa33084c7ae17f667a9f
  • Loading branch information
aakhundov authored and facebook-github-bot committed Feb 27, 2023
1 parent bdefae8 commit 24de3cf
Show file tree
Hide file tree
Showing 11 changed files with 137 additions and 116 deletions.
30 changes: 20 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
[![Deploy docs to Pages](https://github.com/facebookincubator/AITemplate/actions/workflows/pages.yaml/badge.svg)](https://github.com/facebookincubator/AITemplate/actions/workflows/pages.yaml)



AITemplate (AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving. AITemplate highlights include:

- High performance: close to roofline fp16 TensorCore (NVIDIA GPU) / MatrixCore (AMD GPU) performance on major models, including ResNet, MaskRCNN, BERT, VisionTransformer, Stable Diffusion, etc.
- Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easy extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.
- Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easily extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.


## More about AITemplate

Expand All @@ -24,49 +24,56 @@ AITemplate provides unique advanced horizontal fusion. AITemplate can fuse paral

### Vertical Fusion

AITemplate provides strong vertical fusion. AITemplate can fuse a large range of operations into TensorCore/MatrixCore operations, such as elementwise operations, reduction operations, and layout permutation operations. AITemplate also provides back-to-back style TensorCore / MatrixCore operation fusion.
AITemplate provides strong vertical fusion. AITemplate can fuse a large range of operations into TensorCore/MatrixCore operations, such as elementwise operations, reductions, and layout permutations. AITemplate also provides back-to-back style TensorCore / MatrixCore operation fusion.

### Memory Fusion

AITemplate provides innovative memory fusions. AITemplate can fuse GEMM, LayerNorm, and other operators, followed by memory operations such as concatenation, split, and slice into a single operator.

### Working w/wo PyTorch

The AITemplate-generated Python runtime can take PyTorch tensors as inputs and outputs without an extra copy. For environments without PyTorch, the AITemplate Python/C++ runtime is self-contained.

### Extensions without suffering

AITemplate provides a straightforward approach for making an extension in codegen. To add a new operator or a new fused kernel into AITemplate, most of the time one only needs to add two Python files: one for a graph node definition and another for the backend codegen. The CUDA/HIP kernel in a text header file can be directly utilized in the codegen.


## FX2AIT

FX2AIT is a Python-based tool that converts PyTorch models into AITemplate (AIT) engine for lightning-fast inference serving. Using FX2AIT's built-in AITLowerer, partial AIT acceleration can be achieved for models with unsupported operators in AITemplate.

Key features of FX2AIT include:

* Easy Conversion: FX2AIT requires only a PyTorch model and input for conversion, generating an "AITModule" output for inference serving.
* Expanded Support: AITemplate does not support all PyTorch operators. FX2AIT's AITLowerer offers a solution for partial AIT conversion for models with unsupported operators. Check the example/03_lowering_split for more information.
* Expanded Support: AITemplate does not support all PyTorch operators. FX2AIT's AITLowerer offers a solution for partial AIT conversion for models with unsupported operators. Check the `fx2ait/fx2ait/example/03_lowering_split` for more information.

More info can be found from https://github.com/facebookincubator/AITemplate/tree/main/fx2ait.


## Installation

**Hardware requirement:**
**Hardware requirements:**
- **NVIDIA**: AIT is only tested on SM80+ GPUs (Ampere etc). Not all kernels work with old SM75/SM70 (T4/V100) GPUs.
- **AMD**: AIT is only tested on CDNA2 (MI-210/250) GPUs. There may be compiler issues for old CDNA1 (MI-100) GPUs.

## Clone the code
### Clone the code

When cloning the code, please use the following command to also clone the submodules:
```
git clone --recursive https://github.com/facebookincubator/AITemplate
```

### Docker Image

We highly recommend using AITemplate with Docker to avoid accidentally using a wrong version of NVCC or HIPCC.
- CUDA: `./docker/build.sh cuda`
- ROCM: `DOCKER_BUILDKIT=1 ./docker/build.sh rocm`

This will build a docker image with tag `ait:latest`.

### From Source

The following command will create a Python wheel for AITemplate. Please ensure you have correct CUDA/ROCm compiler installed.
- CUDA: CUDA 11.6
- ROCm: We tested on ROCm 5.2.3 with a customized build HIPCC with the command in docker/Dockerfile.rocm#L87-L96
Expand All @@ -93,7 +100,8 @@ There are a few tutorials for onboarding:


## Examples & Performance
AITemplate provides the following model templates & reference performance data on A100/MI-250

AITemplate provides the following model templates & reference performance data on A100/MI-250:

- [01_ResNet-50](examples/01_resnet-50/) with PyTorch Image Models (TIMM)
- [02_MaskRCNN-FPN](examples/02_detectron2/) with Detectron2
Expand All @@ -117,21 +125,23 @@ Long-term plan:
- Composable Kernel CPU extension on AVX2/AVX-512 for AMD Epyc CPU.

## Contributing

Check our [contributing guide](CONTRIBUTING.md) to learn about how to contribute to the project.

## The Team

AITemplate is currently maintained by Meta engineers: [Ying Zhang](https://github.com/ipiszy), [Yang Chen](https://github.com/chenyang78), [Terry Chen](https://github.com/terrychenism), [Mu-Chu Lee](https://github.com/muchulee8), [Max Podkorytov](https://github.com/tenpercent), [Adnan Akhundov](https://github.com/aakhundov).

AITemplate is co-created by Meta engineers: [Bing Xu](https://github.com/antinucleon), [Ying Zhang](https://github.com/ipiszy), [Hao Lu](https://github.com/hlu1), [Yang Chen](https://github.com/chenyang78), and [Terry Chen](https://github.com/terrychenism), with major contributions coming from more talented engineers. A non-exhaustive list to mention is Mike Iovine, Mu-Chu Lee, Scott Wolchok, Oleg Khabinov, Shirong Wu, Huaming Li, Hui Guo, Zhijing Li, Max Podkorytov. We also want to thank the discussions with Andrew Tulloch, Yinghai Lu, Lu Fang.
AITemplate is co-created by Meta engineers: [Bing Xu](https://github.com/antinucleon), [Ying Zhang](https://github.com/ipiszy), [Hao Lu](https://github.com/hlu1), [Yang Chen](https://github.com/chenyang78), and [Terry Chen](https://github.com/terrychenism), with major contributions coming from more talented engineers. A non-exhaustive list to mention is Mike Iovine, Mu-Chu Lee, Scott Wolchok, Oleg Khabinov, Shirong Wu, Huaming Li, Hui Guo, Zhijing Li, Max Podkorytov. We also want to thank Andrew Tulloch, Yinghai Lu, Lu Fang for the valuable discussions.

FX2AIT and Aten2AIT are co-created and maintained by Meta engineers: [Wei Wei](https://github.com/frank-wei), [Shirong Wu](https://github.com/wushirong) and [Zhijing Li](https://github.com/tissue3).


## Acknowledgement
## Acknowledgements

AITemplate team works deeply with NVIDIA [CUTLASS](https://github.com/NVIDIA/cutlass) Team (Led by Andrew Kerr, Haicheng Wu) and AMD [Composable Kernel](https://github.com/ROCmSoftwarePlatform/composable_kernel) Team (Led by Chao Liu, Jing Zhang). We co-designed many advanced GPU optimizations specialized for each platform, and nothing is possible without our close collaboration.
AITemplate team works deeply with NVIDIA [CUTLASS](https://github.com/NVIDIA/cutlass) Team (led by Andrew Kerr, Haicheng Wu) and AMD [Composable Kernel](https://github.com/ROCmSoftwarePlatform/composable_kernel) Team (led by Chao Liu, Jing Zhang). We co-designed many advanced GPU optimizations specialized for each platform, and nothing is possible without our close collaboration.


## License

AITemplate is licensed under the [Apache 2.0 License](https://github.com/facebookincubator/AITemplate/blob/main/LICENSE).
13 changes: 9 additions & 4 deletions docs/source/arch/philosophy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@ Design Philosophy
KISS (Keep it simple and stupid)
--------------------------------

AITemplate avoids deep IR lowering stacks to reduce the system's complexity. A highly modularized, multiple backend codegen system written in pure Python directly attacks the pain point in high-performance GPU inference.
AITemplate avoids deep IR lowering stacks to reduce the system's complexity.
A highly modularized, multiple backend codegen system written in pure Python directly attacks the pain point in high-performance GPU inference.

Pragmatism
----------

AITemplate provides a PyTorch-style frontend to enable engineers to manually match the PyTorch model & weights to AITemplate for optimization. Using it is less painful than debugging different lowering IR stacks, especially for complex models such as MaskRCNN.
AITemplate provides a PyTorch-style frontend to enable engineers to manually match the PyTorch model & weights to AITemplate for optimization.
Using it is less painful than debugging different lowering IR stacks, especially for complex models such as MaskRCNN.


We believe most of the neural network workload can be decoupled. For example, most of the network can be decoupled into Encoder, Decoder, and Decoder logics. For encoder and decoder, it is a computation bounded problem. For decoder logic, it may involve more control flows. By using divide and conquer, we left the decoder logic part to C++ or Python rather than build a unified language / IR stack to play as the silver bullet.
We believe most of the neural network workload can be decoupled.
For example, most of the network can be decoupled into Encoder, Decoder, and Decoder logics.
For encoder and decoder, it is a computation-bounded problem.
For decoder logic, it may involve more control flows.
By using divide and conquer, we left the decoder logic part to C++ or Python rather than build a unified language / IR stack as a silver bullet.
11 changes: 6 additions & 5 deletions docs/source/debughints.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
Debug Hints
===========

AITemplate is a new project under active development. We have a rich test set to avoid bugs but don't be surprised if there is anything unexpected.
AITemplate is a new project under active development.
We have a rich test set to avoid bugs but don't be surprised if there is anything unexpected.

Here are some helpful tips when we learned during the development AITemplate:
Here are some helpful tips we learned during the development of AITemplate:

1. Once the codegen for op which requires profiling is changed, remember to delete old profilers (usually located at workdir), and flush the cache by either deleting ~/.aitemplate or setting environment variable FLUSH_PROFILE_CACHE=1
1. Once the codegen for op which requires profiling is changed, remember to delete old profilers (usually located at workdir), and flush the cache by either deleting `~/.aitemplate` or setting the environment variable `FLUSH_PROFILE_CACHE=1`.

2. Check the pseudo code/visualization generated by each optimization pass if some optimization is harmful.
2. Check the pseudo code/visualization generated by each optimization pass if some optimization behaves in unexpected way.

3. Always do the numerical test, from small to large, to make sure the entire model is correct.

4. Try to make the new fusion subgraph work in a manual way, then try to add an automatic pass to rewrite the graph with the fused subgraph.
4. Try to make the new fusion subgraph work in a manual way, then try to add an automatic pass to rewrite the graph with the fused subgraph.
4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@

AITemplate Documentation
======================================
========================

AITemplate (AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving. AITemplate highlights include:

* High performance: close to roofline fp16 TensorCore (NVIDIA GPU) / MatrixCore (AMD GPU) performance on major models, including ResNet, MaskRCNN, BERT, VisionTransformer, Stable Diffusion, etc.
* Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easy extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.
* Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easily extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.


.. toctree::
Expand Down
31 changes: 15 additions & 16 deletions docs/source/install/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Using Docker
The easiest way to get started is to use Docker. Using docker is able to avoid performance regression caused by incorrect version of NVCC and HIPCC.
To use docker, we provide a bash script to build the docker image.

- CUDA:
- CUDA:
.. code-block:: bash
./docker/build.sh cuda
- ROCM:
- ROCM:
.. code-block:: bash
DOCKER_BUILDKIT=1 ./docker/build.sh rocm
Expand All @@ -31,13 +31,13 @@ To launch the docker container
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ait:latest
AITemplate will be installed in as a Python package to Python 3.8. There will be also a copy of source code and examples at `/AITemplate`
AITemplate will be installed as a Python package in Python 3.8. There will be also a copy of the source code and examples at `/AITemplate`.


Install as standard Python package
----------------------------------
Installing as a Standard Python Package
---------------------------------------

Before start installing AITemplate, first make sure you have correct hardware and software environment.
Before installing AITemplate, first make sure you have correct hardware and software environment.

- Hardware
- NVIDIA: AIT is only tested on SM80+ GPUs (Ampere etc).
Expand All @@ -52,24 +52,23 @@ Before start installing AITemplate, first make sure you have correct hardware an
- AMD: ROCm 5.2, with HIPCC 10736 (commit `b0f4678b9058a4ae00200dfb1de0da5f2ea84dcb`)

.. warning::
- Incorrect compiler version will lead performance regression.
- Instruction for building HIPCC 10736 can be founded in `docker/Dockerfile.rocm`
- Incorrect compiler version may lead to performance regression.
- Instruction for building HIPCC 10736 can be founded in `docker/Dockerfile.rocm`.


When clone the code, please use the following command to clone the submodules:
```
git clone --recursive https://github.com/facebookincubator/AITemplate
```
When cloning the code, please use the following command to clone the submodules:

.. code-block:: bash
git clone --recursive https://github.com/facebookincubator/AITemplate
.. warning::
Please check all submodules are cloned correctly before go to next step.
Please check that all submodules are cloned correctly before the next step.

Then build Python wheel package and install.
Then build the Python wheel package and install it:

.. code-block:: bash
cd python
python setup.py bdist_wheel
pip install dist/aitemplate-0.0.1-py3-none-any.whl
9 changes: 5 additions & 4 deletions docs/source/reference/env.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Environment Variables
=====================
AITemplate uses environment variables to control the behavior of codegen and profiling. All the environment variables used in AITemplate are listed here.
AITemplate uses environment variables to control the behavior of codegen and profiling.
The environment variables used in AITemplate are listed here.

Codegen
-------
Expand All @@ -11,7 +12,7 @@ Codegen

**AIT_NDEBUG**: If set to "1", compile with `NDEBUG`, disabling debug assertions. Recommended for production builds. "1" by default.

**AIT_COMPILER_OPT**: The optimization level for a compiler, which is directly passed to the host compiler command line. AITemplate host code may be very light in certain cases, so there is nothing to optimize for a host compiler. Thus, there is no need to make host compiler to do time costly optimizations. It may be very useful to use "-O0" value for debugging GPU kernels. "-O3" by default.
**AIT_COMPILER_OPT**: The optimization level for a compiler, which is directly passed to the host compiler command line. AITemplate host code may be very light in certain cases, so there is nothing to optimize for a host compiler. Thus, there is no need to make host compiler perform time costly optimizations. It may be very useful to use "-O0" value for debugging GPU kernels. "-O3" by default.

Profiling
---------
Expand All @@ -31,11 +32,11 @@ Profiling
OSS CI
------

**CI_FLAG**: It is set to "CIRCLECI" in OSS CI to indicate we're in OSS CI environment. The behavior of the profiler and codegen is different in CI to speed up testing. Profiling itself for gemm/conv ops is disabled in CI. But we still compiles two random profilers to make sure the profiler codegen is not broken.
**CI_FLAG**: It is set to "CIRCLECI" in OSS CI to indicate we're in OSS CI environment. The behavior of the profiler and codegen is different in CI to speed up testing. Profiling itself for gemm/conv ops is disabled in CI. But we still compile two random profilers to make sure the profiler codegen is not broken.

**AIT_BUILD_DOCS**: If set to "1", it will create a fake CUDA target to enable doc building in Github Actions.

Miscellaneous
-------------

**LOGLEVEL**: It is used to control the logging level in python. It's default to "INFO". "DEBUG" is useful for debugging.
**LOGLEVEL**: It is used to control the logging level in Python. The default value is "INFO". "DEBUG" is useful for debugging.
Loading

0 comments on commit 24de3cf

Please sign in to comment.