Skip to content

Latest commit

 

History

History
110 lines (70 loc) · 4.25 KB

seq2seq.md

File metadata and controls

110 lines (70 loc) · 4.25 KB

Sequence 2 Sequence Models

Notes from Andrew Ng's deeplearning.ai Deep Learning Specialization Lecture videos on Coursera

Basic models

Sequence to Sequence model

use AlexNet as encoder, cut the softmax and then use an encoder for captioning Image captioning

language model: calculates probability of a sentence can think of Language Model (LM) as a decoder network with this in mind we can then say that Machine Translation (MT) is a decoder + LM conditional language model e.g. P(English translation | French sentence)

MT as conditional LM

Most likely translation

If we've started with

Jane is ...

Then the next most likley greedy selection might be going and a better global sentence will be tossed in greedy search. Why not greedy search

Beam Search

Beam width = B Beam width is the number of candidates considered Let's say B=3 Beam search

Instantiating 3 copies of the network in each step (i.e. one for each candidate).

Beam search 3rd step

If B=1 then beam search degenerates to greedy search

Multiplications can result in numerical underflow, maximizing the sum of log probabilities is the same.

Numerical underflow

This objective function tends to prefer shorter sentences as there's less multiplications of less than 1 values. Length normalization

Normalized log-likelihood

How to choose beam width B? computational complexity vs. result quality. Beam width of 10 likely for production systems, 100-1,000 more common for research projects. There's diminishing returns to increasing B. Beam search is an approximate/heuristic algorithm unlike BFS or DFS so it's faster but not guaranteed to find the optimal answer.

Error analysis in beam search

Error analysis in beam search

If most of error is attributed to beam search then it may be warranted to incerase B Error analysis in beam search 2

Bleu score

Bleu score on unigrams Bleu precision

Bleu score on bigrams Bleu bigram

Bleu score on n-grams Bleu n-gram

Bleu

Image captioning and MT use bleu score

Attention

Long sequences

Compute attention weights Attention

LSTMs used more commonly

Context

Context for step 2

Calculating attention Use softmax S is the hidden state from t-1 Runs in quadratic time (i.e. quadratic in # of words in the sentence) Computing attention

Attention visualization

Speech recognition

Microphone detects small changes in air pressure

Spectrograms Phoneme representation not necessary with end-to-end deep learning 300-3,000 hours of audio in academia Industrial applications use 10,000+ hours or sometimes 100,000+ of transcribed audio

Speech recognition

Can use attention models for speech recognition Attention model for speech recognition

Connectionist Temporal Classification (CTC) cost for speech recognition

CTC

Trigger word detection Trigger word