Notes from Andrew Ng's deeplearning.ai Deep Learning Specialization Lecture videos on Coursera
use AlexNet as encoder, cut the softmax and then use an encoder for captioning
language model: calculates probability of a sentence can think of Language Model (LM) as a decoder network with this in mind we can then say that Machine Translation (MT) is a decoder + LM conditional language model e.g. P(English translation | French sentence)
If we've started with
Jane is ...
Then the next most likley greedy selection might be going and a better global sentence will be tossed in greedy search.
Beam width = B
Beam width is the number of candidates considered
Let's say B=3
Instantiating 3 copies of the network in each step (i.e. one for each candidate).
If B=1
then beam search degenerates to greedy search
Multiplications can result in numerical underflow, maximizing the sum of log probabilities is the same.
This objective function tends to prefer shorter sentences as there's less multiplications of less than 1 values.
Normalized log-likelihood
How to choose beam width B? computational complexity vs. result quality. Beam width of 10 likely for production systems, 100-1,000 more common for research projects. There's diminishing returns to increasing B. Beam search is an approximate/heuristic algorithm unlike BFS or DFS so it's faster but not guaranteed to find the optimal answer.
If most of error is attributed to beam search then it may be warranted to incerase B
Image captioning and MT use bleu score
LSTMs used more commonly
Calculating attention
Use softmax
S
is the hidden state from t-1
Runs in quadratic time (i.e. quadratic in # of words in the sentence)
Microphone detects small changes in air pressure
Spectrograms Phoneme representation not necessary with end-to-end deep learning 300-3,000 hours of audio in academia Industrial applications use 10,000+ hours or sometimes 100,000+ of transcribed audio
Can use attention models for speech recognition
Connectionist Temporal Classification (CTC) cost for speech recognition