diff --git a/README.md b/README.md
index 1184b0312..0d6e1686d 100644
--- a/README.md
+++ b/README.md
@@ -6,11 +6,11 @@
 [![Deploy docs to Pages](https://github.com/facebookincubator/AITemplate/actions/workflows/pages.yaml/badge.svg)](https://github.com/facebookincubator/AITemplate/actions/workflows/pages.yaml)
 
 
-
 AITemplate (AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving. AITemplate highlights include:
 
 - High performance: close to roofline fp16 TensorCore (NVIDIA GPU) / MatrixCore (AMD GPU) performance on major models, including ResNet, MaskRCNN, BERT, VisionTransformer, Stable Diffusion, etc.
-- Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easy extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.
+- Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easily extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.
+
 
 ## More about AITemplate
 
@@ -24,42 +24,48 @@ AITemplate provides unique advanced horizontal fusion. AITemplate can fuse paral
 
 ### Vertical Fusion
 
-AITemplate provides strong vertical fusion. AITemplate can fuse a large range of operations into TensorCore/MatrixCore operations, such as elementwise operations, reduction operations, and layout permutation operations. AITemplate also provides back-to-back style TensorCore / MatrixCore operation fusion.
+AITemplate provides strong vertical fusion. AITemplate can fuse a large range of operations into TensorCore/MatrixCore operations, such as elementwise operations, reductions, and layout permutations. AITemplate also provides back-to-back style TensorCore / MatrixCore operation fusion.
 
 ### Memory Fusion
 
 AITemplate provides innovative memory fusions. AITemplate can fuse GEMM, LayerNorm, and other operators, followed by memory operations such as concatenation, split, and slice into a single operator.
 
 ### Working w/wo PyTorch
+
 The AITemplate-generated Python runtime can take PyTorch tensors as inputs and outputs without an extra copy. For environments without PyTorch, the AITemplate Python/C++ runtime is self-contained.
 
 ### Extensions without suffering
 
 AITemplate provides a straightforward approach for making an extension in codegen. To add a new operator or a new fused kernel into AITemplate, most of the time one only needs to add two Python files: one for a graph node definition and another for the backend codegen. The CUDA/HIP kernel in a text header file can be directly utilized in the codegen.
 
+
 ## FX2AIT
+
 FX2AIT is a Python-based tool that converts PyTorch models into AITemplate (AIT) engine for lightning-fast inference serving. Using FX2AIT's built-in AITLowerer, partial AIT acceleration can be achieved for models with unsupported operators in AITemplate.
 
 Key features of FX2AIT include:
 
 * Easy Conversion: FX2AIT requires only a PyTorch model and input for conversion, generating an "AITModule" output for inference serving.
-* Expanded Support: AITemplate does not support all PyTorch operators. FX2AIT's AITLowerer offers a solution for partial AIT conversion for models with unsupported operators. Check the example/03_lowering_split for more information.
+* Expanded Support: AITemplate does not support all PyTorch operators. FX2AIT's AITLowerer offers a solution for partial AIT conversion for models with unsupported operators. Check the `fx2ait/fx2ait/example/03_lowering_split` for more information.
 
 More info can be found from https://github.com/facebookincubator/AITemplate/tree/main/fx2ait.
 
+
 ## Installation
 
-**Hardware requirement:**
+**Hardware requirements:**
   - **NVIDIA**: AIT is only tested on SM80+ GPUs (Ampere etc). Not all kernels work with old SM75/SM70 (T4/V100) GPUs.
   - **AMD**:  AIT is only tested on CDNA2 (MI-210/250) GPUs. There may be compiler issues for old CDNA1 (MI-100) GPUs.
 
-## Clone the code
+### Clone the code
+
 When cloning the code, please use the following command to also clone the submodules:
 ```
 git clone --recursive https://github.com/facebookincubator/AITemplate
 ```
 
 ### Docker Image
+
 We highly recommend using AITemplate with Docker to avoid accidentally using a wrong version of NVCC or HIPCC.
 - CUDA: `./docker/build.sh cuda`
 - ROCM: `DOCKER_BUILDKIT=1 ./docker/build.sh rocm`
@@ -67,6 +73,7 @@ We highly recommend using AITemplate with Docker to avoid accidentally using a w
 This will build a docker image with tag `ait:latest`.
 
 ### From Source
+
 The following command will create a Python wheel for AITemplate. Please ensure you have correct CUDA/ROCm compiler installed.
 - CUDA: CUDA 11.6
 - ROCm: We tested on ROCm 5.2.3 with a customized build HIPCC with the command in docker/Dockerfile.rocm#L87-L96
@@ -93,7 +100,8 @@ There are a few tutorials for onboarding:
 
 
 ## Examples & Performance
-AITemplate provides the following model templates & reference performance data on A100/MI-250
+
+AITemplate provides the following model templates & reference performance data on A100/MI-250:
 
 - [01_ResNet-50](examples/01_resnet-50/) with PyTorch Image Models (TIMM)
 - [02_MaskRCNN-FPN](examples/02_detectron2/) with Detectron2
@@ -117,21 +125,23 @@ Long-term plan:
 - Composable Kernel CPU extension on AVX2/AVX-512 for AMD Epyc CPU.
 
 ## Contributing
+
 Check our [contributing guide](CONTRIBUTING.md) to learn about how to contribute to the project.
 
 ## The Team
 
 AITemplate is currently maintained by Meta engineers: [Ying Zhang](https://github.com/ipiszy), [Yang Chen](https://github.com/chenyang78), [Terry Chen](https://github.com/terrychenism), [Mu-Chu Lee](https://github.com/muchulee8), [Max Podkorytov](https://github.com/tenpercent), [Adnan Akhundov](https://github.com/aakhundov).
 
-AITemplate is co-created by Meta engineers: [Bing Xu](https://github.com/antinucleon), [Ying Zhang](https://github.com/ipiszy), [Hao Lu](https://github.com/hlu1), [Yang Chen](https://github.com/chenyang78), and [Terry Chen](https://github.com/terrychenism), with major contributions coming from more talented engineers. A non-exhaustive list to mention is Mike Iovine, Mu-Chu Lee, Scott Wolchok, Oleg Khabinov, Shirong Wu, Huaming Li, Hui Guo, Zhijing Li, Max Podkorytov. We also want to thank the discussions with Andrew Tulloch, Yinghai Lu, Lu Fang.
+AITemplate is co-created by Meta engineers: [Bing Xu](https://github.com/antinucleon), [Ying Zhang](https://github.com/ipiszy), [Hao Lu](https://github.com/hlu1), [Yang Chen](https://github.com/chenyang78), and [Terry Chen](https://github.com/terrychenism), with major contributions coming from more talented engineers. A non-exhaustive list to mention is Mike Iovine, Mu-Chu Lee, Scott Wolchok, Oleg Khabinov, Shirong Wu, Huaming Li, Hui Guo, Zhijing Li, Max Podkorytov. We also want to thank Andrew Tulloch, Yinghai Lu, Lu Fang for the valuable discussions.
 
 FX2AIT and Aten2AIT are co-created and maintained by Meta engineers: [Wei Wei](https://github.com/frank-wei), [Shirong Wu](https://github.com/wushirong) and [Zhijing Li](https://github.com/tissue3).
 
 
-## Acknowledgement
+## Acknowledgements
 
-AITemplate team works deeply with NVIDIA [CUTLASS](https://github.com/NVIDIA/cutlass) Team (Led by Andrew Kerr, Haicheng Wu) and AMD [Composable Kernel](https://github.com/ROCmSoftwarePlatform/composable_kernel) Team (Led by Chao Liu, Jing Zhang). We co-designed many advanced GPU optimizations specialized for each platform, and nothing is possible without our close collaboration.
+AITemplate team works deeply with NVIDIA [CUTLASS](https://github.com/NVIDIA/cutlass) Team (led by Andrew Kerr, Haicheng Wu) and AMD [Composable Kernel](https://github.com/ROCmSoftwarePlatform/composable_kernel) Team (led by Chao Liu, Jing Zhang). We co-designed many advanced GPU optimizations specialized for each platform, and nothing is possible without our close collaboration.
 
 
 ## License
+
 AITemplate is licensed under the [Apache 2.0 License](https://github.com/facebookincubator/AITemplate/blob/main/LICENSE).
diff --git a/docs/source/arch/philosophy.rst b/docs/source/arch/philosophy.rst
index 2eefb8f5d..d1ac35db4 100644
--- a/docs/source/arch/philosophy.rst
+++ b/docs/source/arch/philosophy.rst
@@ -5,12 +5,17 @@ Design  Philosophy
 KISS (Keep it simple and stupid)
 --------------------------------
 
-AITemplate avoids deep IR lowering stacks to reduce the system's complexity. A highly modularized, multiple backend codegen system written in pure Python directly attacks the pain point in high-performance GPU inference.
+AITemplate avoids deep IR lowering stacks to reduce the system's complexity.
+A highly modularized, multiple backend codegen system written in pure Python directly attacks the pain point in high-performance GPU inference.
 
 Pragmatism
 ----------
 
-AITemplate provides a PyTorch-style frontend to enable engineers to manually match the PyTorch model & weights to AITemplate for optimization. Using it is less painful than debugging different lowering IR stacks, especially for complex models such as MaskRCNN.
+AITemplate provides a PyTorch-style frontend to enable engineers to manually match the PyTorch model & weights to AITemplate for optimization.
+Using it is less painful than debugging different lowering IR stacks, especially for complex models such as MaskRCNN.
 
-
-We believe most of the neural network workload can be decoupled. For example, most of the network can be decoupled into Encoder, Decoder, and Decoder logics. For encoder and decoder, it is a computation bounded problem. For decoder logic, it may involve more control flows. By using divide and conquer, we left the decoder logic part to C++ or Python rather than build a unified language / IR stack to play as the silver bullet.
\ No newline at end of file
+We believe most of the neural network workload can be decoupled.
+For example, most of the network can be decoupled into Encoder, Decoder, and Decoder logics.
+For encoder and decoder, it is a computation-bounded problem.
+For decoder logic, it may involve more control flows.
+By using divide and conquer, we left the decoder logic part to C++ or Python rather than build a unified language / IR stack as a silver bullet.
diff --git a/docs/source/debughints.rst b/docs/source/debughints.rst
index 074254a75..0bd07d3c1 100644
--- a/docs/source/debughints.rst
+++ b/docs/source/debughints.rst
@@ -1,14 +1,15 @@
 Debug Hints
 ===========
 
-AITemplate is a new project under active development. We have a rich test set to avoid bugs but don't be surprised if there is anything unexpected.
+AITemplate is a new project under active development.
+We have a rich test set to avoid bugs but don't be surprised if there is anything unexpected.
 
-Here are some helpful tips when we learned during the development AITemplate:
+Here are some helpful tips we learned during the development of AITemplate:
 
-1. Once the codegen for op which requires profiling is changed, remember to delete old profilers (usually located at workdir), and flush the cache by either deleting ~/.aitemplate or setting environment variable FLUSH_PROFILE_CACHE=1
+1. Once the codegen for op which requires profiling is changed, remember to delete old profilers (usually located at workdir), and flush the cache by either deleting `~/.aitemplate` or setting the environment variable `FLUSH_PROFILE_CACHE=1`.
 
-2. Check the pseudo code/visualization generated by each optimization pass if some optimization is harmful.
+2. Check the pseudo code/visualization generated by each optimization pass if some optimization behaves in unexpected way.
 
 3. Always do the numerical test, from small to large, to make sure the entire model is correct.
 
-4. Try to make the new fusion subgraph work in a manual way, then try to add an automatic pass to rewrite the graph with the fused subgraph.
\ No newline at end of file
+4. Try to make the new fusion subgraph work in a manual way, then try to add an automatic pass to rewrite the graph with the fused subgraph.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 775d33792..9dbcdcc9a 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,11 +1,11 @@
 
 AITemplate Documentation
-======================================
+========================
 
 AITemplate (AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving. AITemplate highlights include:
 
 * High performance: close to roofline fp16 TensorCore (NVIDIA GPU) / MatrixCore (AMD GPU) performance on major models, including ResNet, MaskRCNN, BERT, VisionTransformer, Stable Diffusion, etc.
-* Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easy extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.
+* Unified, open, and flexible. Seamless fp16 deep neural network models for NVIDIA GPU or AMD GPU. Fully open source, Lego-style easily extendable high-performance primitives for new model support. Supports a significantly more comprehensive range of fusions than existing solutions for both GPU platforms.
 
 
 .. toctree::
diff --git a/docs/source/install/index.rst b/docs/source/install/index.rst
index 48244cfa7..2528bd036 100644
--- a/docs/source/install/index.rst
+++ b/docs/source/install/index.rst
@@ -7,11 +7,11 @@ Using Docker
 The easiest way to get started is to use Docker.  Using docker is able to avoid performance regression caused by incorrect version of NVCC and HIPCC.
 To use docker, we provide a bash script to build the docker image.
 
-- CUDA: 
+- CUDA:
     .. code-block:: bash
 
         ./docker/build.sh cuda
-- ROCM: 
+- ROCM:
     .. code-block:: bash
 
         DOCKER_BUILDKIT=1 ./docker/build.sh rocm
@@ -31,13 +31,13 @@ To launch the docker container
 
         docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ait:latest
 
-AITemplate will be installed in as a Python package to Python 3.8. There will be also a copy of source code and examples at `/AITemplate`
+AITemplate will be installed as a Python package in Python 3.8. There will be also a copy of the source code and examples at `/AITemplate`.
 
 
-Install as standard Python package
-----------------------------------
+Installing as a Standard Python Package
+---------------------------------------
 
-Before start installing AITemplate, first make sure you have correct hardware and software environment.
+Before installing AITemplate, first make sure you have correct hardware and software environment.
 
 - Hardware
     - NVIDIA: AIT is only tested on SM80+ GPUs (Ampere etc).
@@ -52,24 +52,23 @@ Before start installing AITemplate, first make sure you have correct hardware an
     - AMD: ROCm 5.2, with HIPCC 10736 (commit `b0f4678b9058a4ae00200dfb1de0da5f2ea84dcb`)
 
 .. warning::
-    - Incorrect compiler version will lead performance regression.
-    - Instruction for building HIPCC 10736 can be founded in `docker/Dockerfile.rocm`
+    - Incorrect compiler version may lead to performance regression.
+    - Instruction for building HIPCC 10736 can be founded in `docker/Dockerfile.rocm`.
 
 
-When clone the code, please use the following command to clone the submodules:
-```
-git clone --recursive https://github.com/facebookincubator/AITemplate
-```
+When cloning the code, please use the following command to clone the submodules:
+
+    .. code-block:: bash
+
+        git clone --recursive https://github.com/facebookincubator/AITemplate
 
 .. warning::
-    Please check all submodules are cloned correctly before go to next step.
+    Please check that all submodules are cloned correctly before the next step.
 
-Then build Python wheel package and install.
+Then build the Python wheel package and install it:
 
     .. code-block:: bash
 
         cd python
         python setup.py bdist_wheel
         pip install dist/aitemplate-0.0.1-py3-none-any.whl
-
-
diff --git a/docs/source/reference/env.rst b/docs/source/reference/env.rst
index 50847a6ae..9e9f7769a 100644
--- a/docs/source/reference/env.rst
+++ b/docs/source/reference/env.rst
@@ -1,6 +1,7 @@
 Environment Variables
 =====================
-AITemplate uses environment variables to control the behavior of codegen and profiling. All the environment variables used in AITemplate are listed here.
+AITemplate uses environment variables to control the behavior of codegen and profiling.
+The environment variables used in AITemplate are listed here.
 
 Codegen
 -------
@@ -11,7 +12,7 @@ Codegen
 
 **AIT_NDEBUG**: If set to "1", compile with `NDEBUG`, disabling debug assertions. Recommended for production builds. "1" by default.
 
-**AIT_COMPILER_OPT**: The optimization level for a compiler, which is directly passed to the host compiler command line. AITemplate host code may be very light in certain cases, so there is nothing to optimize for a host compiler. Thus, there is no need to make host compiler to do time costly optimizations. It may be very useful to use "-O0" value for debugging GPU kernels. "-O3" by default.
+**AIT_COMPILER_OPT**: The optimization level for a compiler, which is directly passed to the host compiler command line. AITemplate host code may be very light in certain cases, so there is nothing to optimize for a host compiler. Thus, there is no need to make host compiler perform time costly optimizations. It may be very useful to use "-O0" value for debugging GPU kernels. "-O3" by default.
 
 Profiling
 ---------
@@ -31,11 +32,11 @@ Profiling
 OSS CI
 ------
 
-**CI_FLAG**: It is set to "CIRCLECI" in OSS CI to indicate we're in OSS CI environment. The behavior of the profiler and codegen is different in CI to speed up testing. Profiling itself for gemm/conv ops is disabled in CI. But we still compiles two random profilers to make sure the profiler codegen is not broken.
+**CI_FLAG**: It is set to "CIRCLECI" in OSS CI to indicate we're in OSS CI environment. The behavior of the profiler and codegen is different in CI to speed up testing. Profiling itself for gemm/conv ops is disabled in CI. But we still compile two random profilers to make sure the profiler codegen is not broken.
 
 **AIT_BUILD_DOCS**: If set to "1", it will create a fake CUDA target to enable doc building in Github Actions.
 
 Miscellaneous
 -------------
 
-**LOGLEVEL**: It is used to control the logging level in python. It's default to "INFO". "DEBUG" is useful for debugging.
+**LOGLEVEL**: It is used to control the logging level in Python. The default value is "INFO". "DEBUG" is useful for debugging.
diff --git a/docs/source/runtime/cxx_design.rst b/docs/source/runtime/cxx_design.rst
index 5ef18f889..d4608409f 100644
--- a/docs/source/runtime/cxx_design.rst
+++ b/docs/source/runtime/cxx_design.rst
@@ -1,29 +1,30 @@
-==================
+================
 C++ Runtime Note
-==================
+================
 
 `Model` v.s. `ModelContainer`
-==============================
+=============================
 
-These are the two main classes involved in the C++ runtime implementation.
+These are the two main classes involved in the C++ runtime implementation:
 
-* The bulk of the runtime implementation is in `Model`.
-* `ModelContainer` stores a set of shared constants and a collection of `Model`s. Almost all functions in `model_interface.h` forward to a method on `ModelContainer`. When `Run` is invoked, `ModelContainer` looks for an available `Model`, or blocks until one is available (see the section on asynchronous predictions). It then forwards the run request to the runtime.
+* The bulk of the runtime implementation is in the `Model` class.
+* The `ModelContainer` class stores a set of shared constants and a collection of `Model` instances. Almost all functions in `model_interface.h` forward to a method in `ModelContainer`. When `Run` is invoked, `ModelContainer` looks for an available `Model`, or blocks until one becomes available (see the section on asynchronous predictions). It then forwards the run request to the runtime.
 
 Code Structure
 ==============
 
 Some important files:
 
-1. `include/model_interface.h`: The interface that we expose in the compiled .so
+1. `include/model_interface.h`: The interface that we expose in the compiled `.so`.
 2. `include/model_container.h`: The bulk of the `ModelContainer` implementation.
 
 Some files are generated at compile time. These include:
 
-* `model-generated.h`: The implementation for `Model`.
-* `model_container_base.cu`: A small part of the implementation for `ModelContainer` needs to be codegened. So `ModelContainer` inherits from `ModelContainerBase`, and `ModelContainerBase`'s implementation lives in this file. See `model_container.h` for more details.
+* `model-generated.h`: The implementation of the `Model`.
+* `model_container_base.cu`: A small part of the implementation for `ModelContainer` that needs to be generated. `ModelContainer` inherits from `ModelContainerBase`, and `ModelContainerBase`'s implementation lives in this file. See `model_container.h` for more details.
 
-All codegen templates can be found in `backend/main_templates.py`. The codegen implementation is in `backend/codegen.py`.
-
-Note that many of the headers in this directory rely on generated code and thus cannot be `#include`d in external projects. The exception is `model_interface.h`.
+All codegen templates can be found in `backend/main_templates.py`.
+The codegen implementation is in `backend/codegen.py`.
 
+Note that many of the headers in this directory rely on generated code and thus cannot be `#include` -d in external projects.
+`model_interface.h` is an exception.
diff --git a/docs/source/runtime/py_design.rst b/docs/source/runtime/py_design.rst
index c143123de..5c9d630e0 100644
--- a/docs/source/runtime/py_design.rst
+++ b/docs/source/runtime/py_design.rst
@@ -1,6 +1,6 @@
-=====================
+===================
 Python Runtime Note
-=====================
+===================
 
 Python `Model`
 ==============
@@ -16,7 +16,7 @@ This class represents a contiguous blob of memory that AIT will use as a tensor.
 * `shape: List[int]`: The shape of the tensor.
 * `dtype: str`: The tensor's dtype; one of `"float32", "float16", "int32", "int64"`. Note that most ops only support float16 at this stage.
 
-If using AITemplate with PyTorch, `AITData`s can be constructed with the `torch_to_ait_data` utility:
+When using AITemplate with PyTorch, `AITData` can be constructed with the `torch_to_ait_data` utility:
 
 .. code-block:: python
 
@@ -30,7 +30,7 @@ If PyTorch is not available, `Model` provides a set of functions for copying, al
 `run`
 -----
 
-`run` takes a set of inputs and outputs as `AITData`s. Both arguments can be passed as either an ordered list or a dictionary (mapping name to tensor).
+`run` takes inputs and outputs as collections of `AITData` instances. Both arguments can be passed as either an ordered list or a dictionary (mapping name to tensor).
 
 .. code-block:: python
 
@@ -55,9 +55,9 @@ If PyTorch is not available, `Model` provides a set of functions for copying, al
       outputs[output_name_to_idx[name]] = ait_outputs[name]
 
     module.run(inputs, outputs)
-      
 
-One important caveat is that the output must be its **maximum** size. This is because of dynamic shapes - the size of the output may vary, but its shape is not inferred until inference time. The maximum shape can be queried with the `get_output_maximum_shape()`:
+
+One important caveat is that the output must have the **maximum** possible size. This is because of dynamic shapes: the size of the output may vary, but its shape is not inferred until inference time. The maximum shape can be queried with the `get_output_maximum_shape()`:
 
 .. code-block:: python
 
@@ -67,7 +67,7 @@ One important caveat is that the output must be its **maximum** size. This is be
     max_shape = module.get_output_maximum_shape("output")
 
 
-`Model.run` returns a dictionary of output `AITData`s with (possibly dynamic) shapes that the runtime inferred.
+`Model.run` returns a dictionary of output `AITData` instances with (possibly dynamic) shapes that inferred in the runtime.
 
 Nullptr Inputs/Outputs
 ----------------------
@@ -102,7 +102,7 @@ Constants are read-only and *shared* with all runtimes in the `ModelContainer`.
 `run_with_tensors`
 ------------------
 
-`run_with_tensors` is a convenience method with the same interface as `run`, except it can take lists of `torch.Tensor`s:
+`run_with_tensors` is a convenience method with the same interface as `run`, except it can take lists (or dicts) of `torch.Tensor` instances:
 
 .. code-block:: python
 
@@ -115,9 +115,14 @@ Constants are read-only and *shared* with all runtimes in the `ModelContainer`.
 Streams and Asynchronous Predictions
 ------------------------------------
 
-A pointer to a stream can optionally be passed to `run`. If none is given, the prediction happens on the default stream 0. If the `sync` argument is set to `True`, the stream is synchronized before `run()` returns. `sync` is `True` by default.
+A pointer to a stream can optionally be passed to `run`.
+If none is given, the prediction happens on the default stream 0.
+If the `sync` argument is set to `True`, the stream is synchronized before `run()` returns.
+`sync` is `True` by default.
 
-Multiple predictions can happen at the same time (on the same or different streams). Under the hood, there is a fixed-size pool of runtime objects. When all the runtimes are used, `run()` blocks until one is available.
+Multiple predictions can happen at the same time (on the same or different streams).
+Under the hood, there is a fixed-size pool of runtime objects.
+When all the runtimes are used, `run()` blocks until one becomes available.
 The size of this pool can be configured with the `num_runtimes` option in `Model`'s constructor.
 
 CUDA Graph
diff --git a/docs/source/tutorial/how_to_add_op.rst b/docs/source/tutorial/how_to_add_op.rst
index 160745336..988f5375e 100644
--- a/docs/source/tutorial/how_to_add_op.rst
+++ b/docs/source/tutorial/how_to_add_op.rst
@@ -1,17 +1,17 @@
 How to add an operator to the AIT codegen
-========================================= 
+=========================================
 
 This tutorial will demonstrate how to add a new operator to the AIT codegen.
-Full source code can be founded at `examples/07_how_to_run_pt_model/how_to_run_pt_model.py`
+Full source code can be found at `examples/07_how_to_run_pt_model/how_to_run_pt_model.py`.
 
 
 0. Prerequisites
------------------
+----------------
 
-We need to import necessary Python modules
+We need to import necessary Python modules:
 
 .. code-block:: python
-  
+
   from typing import Any, Dict, List
 
   import jinja2
@@ -26,9 +26,9 @@ We need to import necessary Python modules
 
 
 1. Define the operator graph node
-----------------------------------
+---------------------------------
 
-Graph node is usually defined at `aitemplate/compiler/ops`.
+Graph nodes are usually defined at `aitemplate/compiler/ops`.
 
 .. code-block:: python
 
@@ -72,15 +72,15 @@ Graph node is usually defined at `aitemplate/compiler/ops`.
 .. note::
 
   - `_attrs` in Operator is the most important data structure for codegen.
-  - `_attrs["op"]` is the identity of operator category, which is used to find the corresponding codegen function in backend, must be **unique**.
+  - `_attrs["op"]` is the identity of operator category, which is used to find the corresponding codegen function in the backend; must be **unique**.
 
 2. Define the necessary templates for Codegen
-----------------------------------------------
+---------------------------------------------
 
 In AIT, there are 4 important templates for codegen:
 
 - `FUNC_TEMPLATE`: the template for generating the function body of the operator, and invoke GPU kernel in the body.
-- `FUNC_SIGNATURE_TEMPLATE`: the template for generating the function signature of the operator. The signature defined name, and arguments of the function.
+- `FUNC_SIGNATURE_TEMPLATE`: the template for generating the function signature of the operator. The signature defines the name and arguments of the function.
 - `FUNC_CALL_TEMPLATE`: the template for generating the function call of the operator. The call will be used during inference to invoke the GPU kernel with given arguments.
 - `FUNC_DECL`: the template for forward declaration of the operator function. This is usually an alias of `FUNC_SIGNATURE_TEMPLATE`.
 
@@ -128,7 +128,7 @@ In AIT, there are 4 important templates for codegen:
   )
 
 3. Create the GPU kernels
---------------------------
+-------------------------
 
 In this example we use a simplest add one kernel. The kernel can be written by hand (as what programmer is expected to do), or generated by other tools.
 
@@ -166,10 +166,10 @@ In this example we use a simplest add one kernel. The kernel can be written by h
   )
 
 4. Define the codegen function
--------------------------------
+------------------------------
 
-The codegen function is the function that render the templates we defined into valid C++ code string.
-The codegen function will take `func_attrs` from graph node, and fill into the jinja2 template.
+The codegen function is the function that renders the templates we defined into valid C++ code string.
+The codegen function will take `func_attrs` from the graph node, and fill in the jinja2 template.
 
 .. code-block:: python
 
@@ -213,10 +213,10 @@ The codegen function will take `func_attrs` from graph node, and fill into the j
           ).strip()
     )
 
-5.1 Register the codegen function to CUDA backend
----------------------------------------------------
+5.1 Register the codegen function in CUDA backend
+-------------------------------------------------
 
-CUDA backend functions is usually defined at `aitemplate/backend/cuda/`.
+CUDA backend functions are usually defined at `aitemplate/backend/cuda/`.
 
 .. code-block:: python
 
@@ -240,10 +240,9 @@ CUDA backend functions is usually defined at `aitemplate/backend/cuda/`.
     return gen_function_call(func_attrs, indent, is_cuda=True)
 
 5.2 (Optional) Register the codegen function to ROCm backend
---------------------------------------------------------------
-
-ROCm backend functions is usually defined at `aitemplate/backend/rocm/`.
+------------------------------------------------------------
 
+ROCm backend functions are usually defined at `aitemplate/backend/rocm/`.
 
 .. code-block:: python
 
@@ -269,7 +268,7 @@ ROCm backend functions is usually defined at `aitemplate/backend/rocm/`.
 
 
 6. Compile and verify the results with PyTorch
-------------------------------------------------
+----------------------------------------------
 
 .. code-block:: python
 
@@ -299,4 +298,3 @@ ROCm backend functions is usually defined at `aitemplate/backend/rocm/`.
       outputs = {"Y": y}
       module.run_with_tensors(inputs, outputs)
       print(torch.allclose(y, y_pt, atol=1e-2, rtol=1e-2))
-
diff --git a/docs/source/tutorial/how_to_infer_pt.rst b/docs/source/tutorial/how_to_infer_pt.rst
index 67891c46a..8aa68c9c7 100644
--- a/docs/source/tutorial/how_to_infer_pt.rst
+++ b/docs/source/tutorial/how_to_infer_pt.rst
@@ -1,16 +1,16 @@
 How to inference a PyTorch model with AIT
-==========================================
+=========================================
 
 This tutorial will demonstrate how to inference a PyTorch model with AIT.
-Full source code can be founded at `examples/07_how_to_run_pt_model/how_to_run_pt_model.py`
+Full source code can be found at `examples/07_how_to_run_pt_model/how_to_run_pt_model.py`.
 
 0. Prerequisites
------------------
+----------------
 
-We need to import necessary Python modules
+We need to import necessary Python modules:
 
 .. code-block:: python
-  
+
   from collections import OrderedDict
 
   import torch
@@ -23,9 +23,9 @@ We need to import necessary Python modules
 
 
 1. Define a PyTorch module
----------------------------
+--------------------------
 
-Here we define a PyTorch model which is commonly seen in Transformers.
+Here we define a PyTorch model which is commonly seen in Transformers:
 
 .. code-block:: python
 
@@ -46,7 +46,7 @@ Here we define a PyTorch model which is commonly seen in Transformers.
       return hidden_states
 
 2. Define an AIT module
-------------------------
+-----------------------
 
 We can define a similar AIT module as follows:
 
@@ -69,15 +69,16 @@ We can define a similar AIT module as follows:
 .. warning::
   The `nn.Module` API in AIT looks similar to PyTorch, but it is not the same.
 
-  The fundamental difference is that AIT module is a container to build graph, while PyTorch module is a container to store parameters for eager.
-  Which means, each AIT module's `forward` method can be only called once, and the graph is built during the first call. If you want to share parameters, needs to call `compiler.ops` instead. The `compiler.ops` is similar to `functional` in PyTorch.
+  The fundamental difference is that AIT module is a container to build a graph, while PyTorch module is a container to store parameters for eager.
+  Which means, each AIT module's `forward` method can be only called once, and the graph is built during the first call.
+  If you want to share parameters, you need to use the `compiler.ops` instead. The `compiler.ops` is similar to `functional` in PyTorch.
+
+  AITemplate supports automatic fusion of linear followed by other operators. However in many cases, especially for quick iterations, we use manual `specialization` to specify the fused operator. For example, `specialization="fast_gelu"` will fuse linear with the `fast_gelu` operator.
 
-  AITemplate supports automatically fusion on linear followed by other operators. However in many case especially for quick iterations, we use manual `specialization` to specify the fused operator. For example, `specialization="fast_gelu"` will fuse linear with `fast_gelu` operator.
-  
 3. Define a helper function to map PyTorch parameters to AIT parameters
--------------------------------------------------------------------------
+-----------------------------------------------------------------------
 
-In AIT, all names must follow C variable naming standard because the name will be used in codegen process.
+In AIT, all names must follow the C variable naming standard, because the names will be used in the codegen process.
 
 .. code-block:: python
 
@@ -93,12 +94,12 @@ In AIT, all names must follow C variable naming standard because the name will b
 
 .. warning::
 
-  - Different to PyTorch, it is required to call ait_model **.name_parameter_tensor()** method to provide each parameter a name with direct map to PyTorch.
-  - Because all names in AIT must follow C variable naming standard, you can easier replace `.` to `_` or use a regular expression to make sure the name in valid.
-  - For network with conv + bn subgraph, we currently haven't provide automatic pass to fold it. Refer our ResNet and Detectron2 examples to see how we handle CNN layout transform and BatchNorm folding.
+  - Different to PyTorch, it is required to call ait_model **.name_parameter_tensor()** method to provide each parameter with a name with a direct map to PyTorch.
+  - Because all names in AIT must follow the C variable naming standard, you can easily replace `.` by `_` or use a regular expression to make sure the name in valid.
+  - For networks with conv + bn subgraph, we currently don't provide an automatic pass to fold it. Please refer to our ResNet and Detectron2 examples to see how we handle CNN layout transform and BatchNorm folding.
 
 4. Create PyTorch module, inputs/outputs
------------------------------------------
+----------------------------------------
 
 .. code-block:: python
 
@@ -115,7 +116,7 @@ In AIT, all names must follow C variable naming standard because the name will b
   y_pt = pt_model(x)
 
 5. Create AIT module, inputs/outputs
--------------------------------------
+------------------------------------
 
 .. code-block:: python
 
@@ -139,12 +140,12 @@ In AIT, all names must follow C variable naming standard because the name will b
 .. warning::
 
   - Similar to MetaTensor, LazyTensor and a lot of other lazy evaluation frameworks, AIT's Tensor records the computation graph, and the graph is built when the Tensor is compiled.
-  - For input tensor, it is required to set the attribute **is_input=True**
-  - For output tensor, it is required to set the attribute **Y._attrs["is_output"] = True**
-  - For input and output tensors, it is better to provide **name** attributes to use in runtime
+  - For input tensor, it is required to set the attribute **is_input=True**.
+  - For output tensor, it is required to set the attribute **Y._attrs["is_output"] = True**.
+  - For input and output tensors, it is better to provide the **name** attributes to use in runtime.
 
-6. Compile AIT module in to runtime, and do verification
---------------------------------------------------------
+6. Compile AIT module into runtime and do verification
+------------------------------------------------------
 
 .. code-block:: python
 
@@ -180,9 +181,9 @@ In AIT, all names must follow C variable naming standard because the name will b
     print(f"PyTorch eager time: {pt_t} ms/iter")
 
 
-In this example, AIT will automatically fuse GELU and elementwise add into TensorCore/MatrixCore gemm operation. On RTX-3080 for this example, AIT is about 1.15X fast than PyTorch Eager in this example.
+In this example, AIT will automatically fuse GELU and elementwise addition into the TensorCore/MatrixCore gemm operation. On RTX-3080, in the example AIT is about 1.15X faster than PyTorch Eager.
 
 .. note::
 
-  - In this example, we fold parameters (weights) into AIT runtime, which the final dynamic library will contains parameters.
-  - If during compile we don't provide parameters, for example the total parameters size is greater than 2GB, we can always call `set_constant` function in runtime. Check runtime API for details.
\ No newline at end of file
+  - In this example, we fold the parameters (`weights`) into AIT runtime. The final dynamic library will contain them as parameters.
+  - If during the compile time we don't provide the parameters (for example, because the total parameters size is greater than 2GB), we can always call `set_constant` function in the runtime. Please check the runtime API for the details.
diff --git a/docs/source/tutorial/how_to_visualize.rst b/docs/source/tutorial/how_to_visualize.rst
index 5af7c89a5..b1d646118 100644
--- a/docs/source/tutorial/how_to_visualize.rst
+++ b/docs/source/tutorial/how_to_visualize.rst
@@ -1,5 +1,5 @@
 How to visualize an AIT model
-==============================
+=============================
 
 Visualization is important for understanding the behavior of a model optimization.
 In AIT, we modify the codegen a little bit, from generating CUDA/HIP C++ code to HTML/Javascript code,
@@ -9,7 +9,7 @@ then we can generate a visualization of the model.
 The following code will generate a visualization of our first example.
 
 1. Define the AIT Model
-------------------------
+-----------------------
 
 .. code-block:: python
 
@@ -71,7 +71,7 @@ The following code will generate a visualization of our first example.
   graph = apply_optimizations(output_tensor)
 
 3. Generate visualization
---------------------------
+-------------------------
 
 .. code-block:: python
 
@@ -82,4 +82,4 @@ The visualization will be generated in the "ait_model.html" file. This file can
 
 .. raw:: html
 
-  <iframe src="ait_model.html" width="100%" height="600px"></iframe>
\ No newline at end of file
+  <iframe src="ait_model.html" width="100%" height="600px"></iframe>