GitHub - habibrahmanbd/Data-Augmentation-in-NMT: CMPUT 566 Group Project for MOTH

Data Augmentation in Neural Machine Translation: A Case-study for English to Portuguese Translation

We have tested several data augmentation for better training for this project, and hence better translation results. We have translated English sentences into Portuguese. The input was a single English sentence, and the output was a single Portuguese sentence that translated the English sentence.

Graduate Students

Dependencies

tensorflow
keras
GPU
Linux Machine

# Quick Installation steps
pip3 install -r requirements.txt

Directory Structure

.
├── baselines.py                                                                 # Baseline Result Calculation
├── bleu_score.py                                                                # Script to calculate Blue Score
├── data_fn.py                                                                   # Script to Convert Pred. Text in Gold Format
├── datasets                                                                     # Contains the Project Dataset and Result
│   ├── baseline_datasets                                                        # Dataset for Baseline models
│   │   ├── amazon.txt                                                           # Baseline Data of Amazon
│   │   └── worst.txt                                                            # Baseline Data of Worst
│   ├── F1_Score                                                                 # Folder for print F1 Score
│   │   └── results_RNN.txt
│   ├── gold_rnn                                                                 # Gold Format Prediction of RNN
│   │   ├── dataset1_m.txt
│   │   ├── dataset1_trial1_h.txt
│   ├── gold_transformer                                                         # Gold Format Prediction of Transformer
│   │   ├── dataset1_trial1.txt
│   │   ├── dataset3_trial3.txt
│   │   └── test.txt                                                             # Test for Transformer
│   ├── modified_datasets                                                        # Modified Dataset
│   │   ├── dataset_1.txt                                                        # Data Argumentation 1
│   │   ├── dataset_2.txt                                                        # Data Argumentation 2
│   │   └── dataset_3.txt                                                        # Data Argumentation 3
│   ├── RNN_Result                                                               # RNN Results
│   │   ├── dev_best
│   │   ├── predict.habib1.txt
│   │   ├── predict.habib1.updated.txt
│   │   ├── predict.maisha3.gold_format.txt
│   │   └── ...
│   ├── staple-2020                                                              # Staple 2020 Original Dataset for en_pt
│   │   ├── en_pt
│   │   │   ├── dev.en_pt.2020-02-20.gold.txt
│   │   │   └── ...
│   │   └── README.txt
│   ├── testing_datasets                                                         # Dataset for Testing and Validation
│   │   ├── dev_best.txt
│   │   ├── dev.txt
│   │   └── test.txt
│   └── Transformer_Result                                                       # Transformer Result
│       ├── result_dataset_1_trial1.csv
│       └── ...
├── images                                                                       # Necessary Images of this Projects
│   ├── Figure_1.png
├── RNN.sh                                                                       # Run RNN to Generate Final Result
├── text_processing.py                                                           # Text Processing, Model Training, Output Prediction, etc
├── tokenized                                                                    # Dump Tokenize Data, Results
│   ├── dump
│   │   ├── test_eng_enc_seq.pickle
│   │   └── test_port_enc_seq.pickle
│   ├── English
│   │   ├── eng_tok1.pickle
│   └── Portuguese
│       ├── 1.pickle
│       ├── ...
├── tokenizer.py
├── transformer                                                                          # Folder For Transformer
│   ├── CMPUT566_Eng_Por_Translation_Transformer_Model_dataset123.ipynb
│   └── Restore_checkpoints_dataset123.ipynb

Instructions

Model with RNN

Step 1: Create Modified Datasets

python3 create_modified_datasets.py

Step 2: Run Script for RNN Result in Bleu Score

./RNN.sh

Step 3: Run Command for Weighted F1 Macro of RNN

git clone https://github.com/duolingo/duolingo-sharedtask-2020.git
cd duolingo-sharedtask-2020

Weighted Score for Dataset 1

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/staple-2020/en_pt/test.en_pt.2020-02-20.gold.txt  --predfile ../CMPUT566-MOTH/datasets/gold_rnn/dataset1.txt

Weighted Score for Dataset 2

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/staple-2020/en_pt/test.en_pt.2020-02-20.gold.txt  --predfile ../CMPUT566-MOTH/datasets/gold_rnn/dataset2.txt

Weighted Score for Dataset 3

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/staple-2020/en_pt/test.en_pt.2020-02-20.gold.txt  --predfile ../CMPUT566-MOTH/datasets/gold_rnn/dataset3.txt

Model with Transformer

Step 1: Create Modified Datasets

python3 create_modified_datasets.py

Step 2: Create Training Checkpoints

python3 cmput566_eng_por_translation_transformer_model_dataset123.py

Step 3: Create Dataset for Bleu & F1 score

python3 restore_checkpoints_dataset123.py

Step 4: Run Command for Weighted F1 Macro of RNN

git clone https://github.com/duolingo/duolingo-sharedtask-2020.git
cd duolingo-sharedtask-2020

Weighted Score for Dataset 1

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset1_trial1.txt

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset1_trial2.txt

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset1_trial3.txt

Weighted Score for Dataset 2

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset2_trial1.txt

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset2_trial2.txt

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset2_trial3.txt

Weighted Score for Dataset 3

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset3_trial1.txt

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset3_trial2.txt

python3 staple_2020_scorer.py --goldfile ../CMPUT566-MOTH/datasets/gold_transformer/test.txt  --predfile ../CMPUT566-MOTH/datasets/gold_transformer/dataset3_trial3.txt

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
datasets		datasets
images		images
rnn_lstm		rnn_lstm
tokenized		tokenized
transformer		transformer
README.md		README.md
RNN.sh		RNN.sh
baselines.py		baselines.py
bleu_score.py		bleu_score.py
bleu_score_rnn.py		bleu_score_rnn.py
create_modified_datasets.py		create_modified_datasets.py
data_fn.py		data_fn.py
gold_style_pred_rnn.py		gold_style_pred_rnn.py
model_RNN.py		model_RNN.py
model_train_rnn.py		model_train_rnn.py
predict_RNN.py		predict_RNN.py
predict_rnn.py		predict_rnn.py
preprocess_before_promt.py		preprocess_before_promt.py
requirements.txt		requirements.txt
text_processing.py		text_processing.py
tokenizer.py		tokenizer.py
trial1.ipynb		trial1.ipynb
weighted_macro_f1.py		weighted_macro_f1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Augmentation in Neural Machine Translation: A Case-study for English to Portuguese Translation

Graduate Students

Dependencies

Directory Structure

Instructions

Model with RNN

Step 1: Create Modified Datasets

Step 2: Run Script for RNN Result in Bleu Score

Step 3: Run Command for Weighted F1 Macro of RNN

Model with Transformer

Step 1: Create Modified Datasets

Step 2: Create Training Checkpoints

Step 3: Create Dataset for Bleu & F1 score

Step 4: Run Command for Weighted F1 Macro of RNN

Transformer Codes are available to run in Google Colab, Code: https://drive.google.com/drive/folders/1nv0kY3KEnn3eh_SJok21bZE9CLZF9C_E?usp=sharing

About

Releases

Packages

Contributors 5

Languages

habibrahmanbd/Data-Augmentation-in-NMT

Folders and files

Latest commit

History

Repository files navigation

Data Augmentation in Neural Machine Translation: A Case-study for English to Portuguese Translation

Graduate Students

Dependencies

Directory Structure

Instructions

Model with RNN

Step 1: Create Modified Datasets

Step 2: Run Script for RNN Result in Bleu Score

Step 3: Run Command for Weighted F1 Macro of RNN

Model with Transformer

Step 1: Create Modified Datasets

Step 2: Create Training Checkpoints

Step 3: Create Dataset for Bleu & F1 score

Step 4: Run Command for Weighted F1 Macro of RNN

Transformer Codes are available to run in Google Colab, Code: https://drive.google.com/drive/folders/1nv0kY3KEnn3eh_SJok21bZE9CLZF9C_E?usp=sharing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages