Skip to content

Latest commit

 

History

History
98 lines (63 loc) · 3.11 KB

README.md

File metadata and controls

98 lines (63 loc) · 3.11 KB

DataLoader for Encoder to Decoder Model

Efficient data loader for text dataset using torch.utils.data.Dataset, collate_fn and torch.utils.data.DataLoader.

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Seq2Seq Model image from Seq2Seq model in TensorFlow Post

  1. Add <start> token in decoder input, <end> token in target output words of model.
I am a Student => Je suis etudiant
encoder input : 'I', 'am', 'a', 'Student'
decoder input : '<start>', 'Je', 'suis', 'etudiant'
target 		  : 'Je', 'suis', 'etudiant', '<end>'

Please See this example.

Der weltweit zweitgrößte Anbieter von Besucherattraktionen zielt darauf ab , seinen 30 Millionen Besuchern auf der ganzen Welt durch seine globalen und lokalen Marken sowie das Engagement und die Leidenschaft seiner Führungskräfte und Mitarbeiter ein einzigartiges , unvergessliches und lohnenswertes Erlebnis zu bieten .
print(trg_seqs[0])
tensor([   1,   49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,
         229,  524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357,
        2454,  117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,
           3,    8,    3,   14,    3, 2676,  127, 1207,   28,    0,    0,    0,
           0])

print(target[0])
tensor([  49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,  229,
         524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357, 2454,
         117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,    3,
           8,    3,   14,    3, 2676,  127, 1207,   28,    2,    0,    0,    0,
           0])
  1. Add replace UNK Token Mechanism in OOV(out of vocabulary) Problem.
sequence.extend([word2id[token] if token in word2id else word2id['<unk>'] for token in tokens])
  1. Add trg_max, src_max to avoid cuda memory leak.
  • src_max : maximum length source domain.
  • trg_max : maximum length target domain.

This can avoid memory leak when getting high dimension of input sequence length.


Prerequesites


Usage

1. Clone the repository

$ git clone https://github.com/graykode/enc2dec-dataloader.git
$ cd enc2dec-dataloader

2. Download nltk tokenizer

$ pip install nltk
$ python
$ import nltk
$ nltk.download('punkt')

3. Build word2id dictionary

$ python build_vocab.py

4. Check DataLoader

For usage, please see example.ipynb.