You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I tries to use the bidirectional LSTM operator, I found some bugs and fixed parts of them. So, I've created gdh1995@9ae1832 to show my patch; and those I can't fix are also described in this issues.
when passed inputs[BI_LSTM_BW_INPUT_H_STATE] is empty, it copies shape from outputs[BI_LSTM_BW_OUTPUT_OUTPUT], but the later may also be empty if only curr_param->merge_outputs
it doesn't support empty c_state inputs, while unidirectional LSTM does.
when it creates new c_state outputs during iterating forwards, it uses dtype from outputs[BI_LSTM_FW_OUTPUT_OUTPUT], but I think a better source should be inputs[BI_LSTM_FW_INPUT_C_STATE]
so does it during iterating backwards
it sets lstm_ovxlib.internal_dtype, which should have been lstmunit_ovxlib.internal_dtype
the order of tensors in lstmcell_reshape_output_tensors_bw is D'C'B'A' when input words are A, B, C, D
in PyTorch, the order is still A'B'C'D', so a merged output tensor is [ [A, A'], [B, B'], [C, C'], [D, D'] ]
the above judgment is tested on torch==1.9.0+cu111 and Python 3.7 x64 for Windows.
Left bugs
When I test the operator, I noticed some kernel functions of lstmunit_activation don't work under some special combination of I/O data types. As far as I've catched, they are:
for example, after I created a graph with only 1 bidi-LSTM layer using the cl kernel, run it, and re-created the graph using the evis kernel, the new running result (under a input tensor of a same shape) is absolutely the same with the cl kernel.
while the version in x86_64_linux works well
The cl kernel (src/tim/vx/internal/src/kernel/evis/lstmunit_activation_cl.c) doesn't support F32toU8_F32_SIGMOID
the failed kernel is GEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F32, U8 , F32, SIGMOID, S)
the kernel will only output very small numbers (all items in the uint8-quantified tensor are either 0 or 1)
here's my test case:
by default the unidirectional LSTM layer creates float16 tensors as outputs of FullConnect nodes
I want to quantify weights and LSTM input+output tensors into uint8, and then the hidden state is also uint8-quantified
so lstmunit_activate receives several float16 inputs and is expected to generate a uint8-quantified tensor
both the version in A311D-6.4.10.2 and the x86_64_linux one will give wrong outputs.
Here's a conclusion of the 2 kernel bugs (uint8+fp16 means only hidden states are uint8, activation inputs and c_states are still float16):
| simulator (x86_64_linux) | A311D
| evis cl cpu | evis cl cpu
---------------------------------------------------------------------------
unidi | fp16 | ok ok ok | not-write ok ok
| uint8+fp16 | ok too-small ok | not-write too-small ok
| uint8+fp16 | ok too-small ok | not-write too-small ok
---------------------------------------------------------------------------
bidi | fp16 | ok ok ok | not-write ok ok
| uint8+fp16 | ok too-small ok | not-write too-small ok
A311D + evis: never write the tensor at all
A311D + cl:
fp16+fp16 (S_F32toF32_F32_SIGMOID): ok
uint8+fp16 (S_F32toU8_F32_SIGMOID): value error: all are too small
The text was updated successfully, but these errors were encountered:
gdh1995
changed the title
[bug report] Some bugs or inconvenience in vsi_nn_op_bidirectional_sequence_lstm.c
[bug report] Some bugs or inconvenience in bidirectional_sequence_lstm and lstmunit_activation
May 23, 2022
Sorry for that take a long time to get update for you. I discussed with team on this, we think the implementation inside "internal" is not robust and we don't want to maintain it anymore internally.
For bidirectional_sequence_lstm support, we would like to compose it in tim-vx. If you think it is ok, I can share more detail about the information about unidrectional_sequence_lstm to github.
Sorry my English is not so good and I can't get what you meant.
If vsi_nn_op_bidirectional_sequence_lstm.c will not be updated, it doesn't matter for me - I can maintain a fork of TIM-VX in my private workspace and pull new changes periodic.
I still expect the buggy behavior of EVIS and CL kernels can be fixed - because according to their names, they might run faster than the CPU versions.
I know few about LSTM variants, so I guesse info about unidirectional-lstm is too hard for me. If you want to add a new version of bidi-lstm, I'm afraid I can't understand it easily but I'm glad to try it.
When I tries to use the bidirectional LSTM operator, I found some bugs and fixed parts of them. So, I've created gdh1995@9ae1832 to show my patch; and those I can't fix are also described in this issues.
Those fixed issues
The below are detailed problems of https://github.com/VeriSilicon/TIM-VX/blob/f8741b4704ab9caffd49f6fc15f72616962ba1d1/src/tim/vx/internal/src/ops/vsi_nn_op_bidirectional_sequence_lstm.c :
inputs[BI_LSTM_BW_INPUT_H_STATE]
is empty, it copies shape fromoutputs[BI_LSTM_BW_OUTPUT_OUTPUT]
, but the later may also be empty if onlycurr_param->merge_outputs
c_state
inputs, while unidirectional LSTM does.c_state
outputs during iterating forwards, it usesdtype
fromoutputs[BI_LSTM_FW_OUTPUT_OUTPUT]
, but I think a better source should beinputs[BI_LSTM_FW_INPUT_C_STATE]
lstm_ovxlib.internal_dtype
, which should have beenlstmunit_ovxlib.internal_dtype
lstmcell_reshape_output_tensors_bw
isD'C'B'A'
when input words areA, B, C, D
A'B'C'D'
, so a merged output tensor is[ [A, A'], [B, B'], [C, C'], [D, D'] ]
torch==1.9.0+cu111
and Python 3.7 x64 for Windows.Left bugs
When I test the operator, I noticed some kernel functions of
lstmunit_activation
don't work under some special combination of I/O data types. As far as I've catched, they are:On a
main
branch of TIMVX and the A311D-6.4.10.2 package (https://github.com/VeriSilicon/TIM-VX/releases/tag/v1.1.42), theevis
kernel (src/tim/vx/internal/src/kernel/evis/lstmunit_activation_evis.c
) never worksGEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F16, F16, F16, SIGMOID, S)
GEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F16, U8, F16, SIGMOID, S)
cl
kernel, run it, and re-created the graph using theevis
kernel, the new running result (under a input tensor of a same shape) is absolutely the same with thecl
kernel.x86_64_linux
works wellThe
cl
kernel (src/tim/vx/internal/src/kernel/evis/lstmunit_activation_cl.c
) doesn't supportF32toU8_F32_SIGMOID
GEN_LSTMUNIT_STRUCT_ITEMS(0, 0, 0, 0, 0, F32, U8 , F32, SIGMOID, S)
uint8-quantified
tensor are either0
or1
)float16
tensors as outputs of FullConnect nodesuint8
, and then the hidden state is alsouint8-quantified
lstmunit_activate
receives severalfloat16
inputs and is expected to generate auint8-quantified
tensorA311D-6.4.10.2
and thex86_64_linux
one will give wrong outputs.Here's a conclusion of the 2 kernel bugs (
uint8+fp16
means only hidden states areuint8
, activation inputs and c_states are still float16):A311D + evis: never write the tensor at all
A311D + cl:
The text was updated successfully, but these errors were encountered: