This project implements the optimization techniques proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition to reduce the memory consumption for computing transducer loss.
HINT: You can find ASR training code using this repo in https://github.com/k2-fsa/icefall. You can also find decoding code in icefall.
It produces same output as torchaudio
for the same input, so optimized_transducer
should be equivalent to
torchaudio.functional.rnnt_loss().
This project is more memory efficient (See https://github.com/csukuangfj/transducer-loss-benchmarking for benchmark results)
Also, torchaudio
accepts only output from nn.Linear
, but
we also support output from log-softmax
(You can set the option
from_log_softmax
to True
in this case).
It also supports a modified version of transducer. See below for what the meaning of modified transducer is.
How does it differ from warp-transducer
It borrows the methods of computing alpha and beta from warp-transducer
. Therefore,
optimized_transducer
produces the same alpha
and beta
as warp-transducer
for the same input.
However, warp-transducer
produces different gradients for CPU and CUDA
when using the same input. See HawkAaron/warp-transducer#93.
I also created a colab notebook
to reproduce that issue.
This project produces consistent gradient on CPU and CUDA for the same input, just like
what torchaudio
is doing. (We borrow the gradient computation formula from torchaudio
).
optimized_transducer
uses less memory than that of warp-transducer
(See https://github.com/csukuangfj/transducer-loss-benchmarking for benchmark results).
It also supports a modified version of transducer. See below for what the meaning of modified transducer is.
In modified transducer, we limit the maximum number of symbols per frame to 1. The following figure compares the formula for forward and backward procedures between standard transducer and modified transducer.
Note: Modified transducer is proposed independently by @danpovey. We were later informed that the idea already existed in Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping
You can install it via pip
:
pip install optimized_transducer
To check that optimized_transducer
was installed successfully, please run
python3 -c "import optimized_transducer; print(optimized_transducer.__version__)"
which should print the version of the installed optimized_transducer
, e.g., 1.2
.
It has been tested on Ubuntu 18.04. It should also work on macOS and other unixes systems. It may work on Windows, though it is not tested.
Use
pip install --verbose optimized_transducer
Use
export OT_MAKE_ARGS="-j"
pip install --verbose optimized_transducer
It will pass -j
to make
.
It has been tested on PyTorch >= 1.5.0. It may work on PyTorch < 1.5.0
Use
export OT_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DOT_WITH_CUDA=OFF"
export OT_MAKE_ARGS="-j"
pip install --verbose optimized_transducer
It will pass -DCMAKE_BUILD_TYPE=Release -DOT_WITH_CUDA=OFF
to cmake
.
Python >= 3.6 is known to work. It may work for Python 2.7, though it is not tested.
Please file an issue at https://github.com/csukuangfj/optimized_transducer/issues and describe your problem there.
optimized_transducer
expects that the output shape of the joint network is
NOT (N, T, U, V)
, but is (sum_all_TU, V)
, which is a concatenation
of 2-D tensors: (T_1 * U_1, V)
, (T_2 * U_2, V)
, ..., (T_N, U_N, V)
.
Note: (T_1 * U_1, V)
is just the reshape of a 3-D tensor (T_1, U_1, V)
.
Suppose your original joint network looks somewhat like the following:
encoder_out = torch.rand(N, T, D) # from the encoder
decoder_out = torch.rand(N, U, D) # from the decoder, i.e., the prediction network
encoder_out = encoder_out.unsqueeze(2) # Now encoder out is (N, T, 1, D)
decoder_out = decoder_out.unsqueeze(1) # Now decoder out is (N, 1, U, D)
x = encoder_out + decoder_out # x is of shape (N, T, U, D)
activation = torch.tanh(x)
logits = linear(activation) # linear is an instance of `nn.Linear`.
loss = torchaudio.functional.rnnt_loss(
logits=logits,
targets=targets,
logit_lengths=logit_lengths,
target_lengths=target_lengths,
blank=blank_id,
reduction="mean",
)
You need to change it to the following:
encoder_out = torch.rand(N, T, D) # from the encoder
decoder_out = torch.rand(N, U, D) # from the decoder, i.e., the prediction network
encoder_out_list = [encoder_out[i, :logit_lengths[i], :] for i in range(N)]
decoder_out_list = [decoder_out[i, :target_lengths[i]+1, :] for i in range(N)]
x = [e.unsqueeze(1) + d.unsqueeze(0) for e, d in zip(encoder_out_list, decoder_out_list)]
x = [p.reshape(-1, D) for p in x]
x = torch.cat(x)
activation = torch.tanh(x)
logits = linear(activation) # linear is an instance of `nn.Linear`.
loss = optimized_transducer.transducer_loss(
logits=logits,
targets=targets,
logit_lengths=logit_lengths,
target_lengths=target_lengths,
blank=blank_id,
reduction="mean",
from_log_softmax=False,
)
Caution: We used from_log_softmax=False
in the above example since logits
is the output of nn.Linear
.
Hint: If logits
is the output of log-softmax
, you should use from_log_softmax=True
.
In most cases, you should pass the output of nn.Linear
to compute the loss, i.e.,
use from_log_softmax=False
, to save memory.
If you want to do some operations on the output of log-softmax
before feeding it
to optimized_transducer.transducer_loss()
, from_log_softmax=True
is helpful in
this case. But be aware that this will increase the memory usage.
To use the modified transducer, pass an additional argument one_sym_per_frame=True
to optimized_transducer.transducer_loss()
.
For more usages, please refer to
- https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/python/optimized_transducer/transducer_loss.py
- https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/python/tests/test_cuda.py
- https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/python/tests/test_compute_transducer_loss.py
- https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/python/tests/test_max_symbol_per_frame.py
As a developer, you don't need to use pip install optimized_transducer
.
To make development easier, you can use
git clone https://github.com/csukuangfj/optimized_transducer.git
cd optimized_transducer
mkdir build
cd build
cmake -DOT_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Release ..
make -j
export PYTHONPATH=$PWD/../optimized_transducer/python:$PWD/lib:$PYTHONPATH
I usually create a file path.sh
inside the build
directory, containing
export PYTHONPATH=$PWD/../optimized_transducer/python:$PWD/lib:$PYTHONPATH
so what you need to do is
cd optimized_transducer/build
source path.sh
# Then you are ready to run Python tests
python3 optimized_transducer/python/tests/test_compute_transducer_loss.py
# You can also use "import optimized_transducer" in your Python projects
To run all Python tests, use
cd optimized_transducer/build
ctest --output-on-failure
Alternatively one can "make" all available tests
make -j test