Skip to content

Commit bcb3212

Browse files
hiworldwzjwangzaijun
and
wangzaijun
authored
remove the support for triton==2.0.0 (#395)
Co-authored-by: wangzaijun <wangzaijun@sensetime.com>
1 parent 390ac96 commit bcb3212

15 files changed

+1040
-1516
lines changed

README.md

+9-7
Original file line numberDiff line numberDiff line change
@@ -112,17 +112,19 @@ You can use the official Docker container to run the model more easily. To do th
112112
python setup.py install
113113
~~~
114114

115-
The code has been tested on a range of GPUs including A100, A800, 4090, and H800. If you are running the code on A100, A800, etc., we recommend using triton==2.1.0 or triton==2.0.0.dev20221202. If you are running the code on H800, etc., it is necessary to compile and install the source code of [triton==2.1.0](https://github.com/openai/triton/tree/main) from the GitHub repository. If the code doesn't work on other GPUs, try modifying the triton kernel used in model inference.
116-
- Install Triton Package
117115

118-
use triton==2.1.0 (Better performance, but the code is under continuous development and may be unstable.)
116+
117+
- Install Triton Package
118+
119+
The code has been tested on a range of GPUs including V100, A100, A800, 4090, and H800. If you are running the code on A100, A800, etc., we recommend using triton==2.1.0.
120+
119121
~~~shell
120-
pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
122+
pip install triton==2.1.0 --no-deps
121123
~~~
122-
123-
use triton==2.0.0.dev20221202 (This version has a memory leak bug. Refer to the [issue #209](https://github.com/ModelTC/lightllm/issues/209) for the fix method. )
124+
If you are running the code on H800 or V100., we recommend using triton-nightly, triton-nightly has a significant CPU bottleneck, leading to high decode latency at low concurrency levels. You can observe [this issue](https://github.com/openai/triton/issues/3619) and [fix PR](https://github.com/openai/triton/pull/3638).You can try modifying and compiling the
125+
source code yourself to resolve this issue.
124126
~~~shell
125-
pip install triton==2.0.0.dev20221202
127+
pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps
126128
~~~
127129

128130
### RUN LLaMA

lightllm/models/bloom/layer_infer/transformer_layer_infer.py

+15-31
Original file line numberDiff line numberDiff line change
@@ -63,37 +63,21 @@ def _context_attention_kernel(
6363
self, q, kv, infer_state: InferStateInfo, layer_weight: BloomTransformerLayerWeight, out=None
6464
) -> torch.Tensor:
6565
o_tensor = torch.empty_like(q) if out is None else out
66-
import triton
67-
if triton.__version__ >= "2.1.0":
68-
context_attention_fwd(
69-
q.view(-1, self.tp_q_head_num_, self.head_dim_),
70-
infer_state.mem_manager.kv_buffer[self.layer_num_][:, 0 : self.tp_k_head_num_, :],
71-
infer_state.mem_manager.kv_buffer[self.layer_num_][
72-
:, self.tp_k_head_num_ : self.tp_k_head_num_ + self.tp_v_head_num_, :
73-
],
74-
o_tensor.view(-1, self.tp_q_head_num_, self.head_dim_),
75-
infer_state.b_req_idx,
76-
layer_weight.tp_alibi,
77-
infer_state.b_start_loc,
78-
infer_state.b_seq_len,
79-
infer_state.b_ready_cache_len,
80-
infer_state.max_len_in_batch,
81-
infer_state.req_manager.req_to_token_indexs,
82-
)
83-
elif triton.__version__ == "2.0.0":
84-
context_attention_fwd(
85-
q.view(-1, self.tp_q_head_num_, self.head_dim_),
86-
kv[:, 0 : self.tp_k_head_num_, :],
87-
kv[:, self.tp_k_head_num_ : self.tp_k_head_num_ + self.tp_v_head_num_, :],
88-
o_tensor.view(-1, self.tp_q_head_num_, self.head_dim_),
89-
layer_weight.tp_alibi,
90-
infer_state.b_start_loc,
91-
infer_state.b_seq_len,
92-
infer_state.max_len_in_batch,
93-
)
94-
else:
95-
assert False
96-
66+
context_attention_fwd(
67+
q.view(-1, self.tp_q_head_num_, self.head_dim_),
68+
infer_state.mem_manager.kv_buffer[self.layer_num_][:, 0 : self.tp_k_head_num_, :],
69+
infer_state.mem_manager.kv_buffer[self.layer_num_][
70+
:, self.tp_k_head_num_ : self.tp_k_head_num_ + self.tp_v_head_num_, :
71+
],
72+
o_tensor.view(-1, self.tp_q_head_num_, self.head_dim_),
73+
infer_state.b_req_idx,
74+
layer_weight.tp_alibi,
75+
infer_state.b_start_loc,
76+
infer_state.b_seq_len,
77+
infer_state.b_ready_cache_len,
78+
infer_state.max_len_in_batch,
79+
infer_state.req_manager.req_to_token_indexs,
80+
)
9781
return o_tensor
9882

9983
def _token_attention_kernel(

0 commit comments

Comments
 (0)