DataLoader for Encoder to Decoder Model

Efficient data loader for text dataset using torch.utils.data.Dataset, collate_fn and torch.utils.data.DataLoader.

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Seq2Seq Model image from Seq2Seq model in TensorFlow Post

Add <start> token in decoder input, <end> token in target output words of model.

I am a Student => Je suis etudiant
encoder input : 'I', 'am', 'a', 'Student'
decoder input : '<start>', 'Je', 'suis', 'etudiant'
target 		  : 'Je', 'suis', 'etudiant', '<end>'

Please See this example.

Der weltweit zweitgrößte Anbieter von Besucherattraktionen zielt darauf ab , seinen 30 Millionen Besuchern auf der ganzen Welt durch seine globalen und lokalen Marken sowie das Engagement und die Leidenschaft seiner Führungskräfte und Mitarbeiter ein einzigartiges , unvergessliches und lohnenswertes Erlebnis zu bieten .

print(trg_seqs[0])
tensor([   1,   49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,
         229,  524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357,
        2454,  117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,
           3,    8,    3,   14,    3, 2676,  127, 1207,   28,    0,    0,    0,
           0])

print(target[0])
tensor([  49, 2267,    3, 4091,   68,    3, 2651,  152,  419,    8,  331,  229,
         524, 1680,  212,   49,  299, 1235,  156,  944, 3192,   14,  357, 2454,
         117,   23, 4624,   14,   50, 3648, 1819,    3,   14,  317,  171,    3,
           8,    3,   14,    3, 2676,  127, 1207,   28,    2,    0,    0,    0,
           0])

Add replace UNK Token Mechanism in OOV(out of vocabulary) Problem.

sequence.extend([word2id[token] if token in word2id else word2id['<unk>'] for token in tokens])

Add trg_max, src_max to avoid cuda memory leak.

src_max : maximum length source domain.
trg_max : maximum length target domain.

This can avoid memory leak when getting high dimension of input sequence length.

Prerequesites

PyThon 2.7 or 3.5+
PyTorch 0.1.12
NLTK

Usage

1. Clone the repository

$ git clone https://github.com/graykode/enc2dec-dataloader.git
$ cd enc2dec-dataloader

2. Download nltk tokenizer

$ pip install nltk
$ python
$ import nltk
$ nltk.download('punkt')

3. Build word2id dictionary

$ python build_vocab.py

4. Check DataLoader

For usage, please see example.ipynb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DataLoader for Encoder to Decoder Model

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Prerequesites

Usage

1. Clone the repository

2. Download nltk tokenizer

3. Build word2id dictionary

4. Check DataLoader

Files

README.md

Latest commit

History

README.md

File metadata and controls

DataLoader for Encoder to Decoder Model

Update Seq2Seq Dataloader from yunjey/seq2seq-dataloader.

Prerequesites

Usage

1. Clone the repository

2. Download nltk tokenizer

3. Build word2id dictionary

4. Check DataLoader