Today, many studies focus on applying neural networks to software engineering tasks such as comment generation, code search, clone detection, and so on. Among them, the program translation task requires the model to translate the source code to the target code without changing its functionality. This task requires the model to understand the source code semantics and generate code based on the specifications of the target programming language.
This repository is created to investigate the program translation baseline Transformer. The CodeTrans dataset is shown in CodeXGLUE/CodeTrans.
In addition, our model has some feature, such as:
- simple modification of parameters
- gradient accumulation
tf.function
acceleration- multi-GPU training
- mixed precision (float16 and float32)
It should be noted that the gradient accumulation function is copied from OpenNMT-tf.
- tensorflow 2
- tokenizers
- numpy
- tree-sitter
Besides, if evaluating the output, pycharm is required. (To be honest, my programming skills are limited)
./data
folder is used to store datasets, vocabulary, references, model's ckpt, and predicted code../evaluator
folder holds the evaluation metrics. The evaluation metrics are from CodeTrans../network
and./util
folders store the model and preprocessing files.
I believe you can see the config
dict in train.py
. Just change the value corresponding to the key listed in config
.
Note that the "swap datasets by dictionary order": False
refers to translate the name of a programming language with a small dictionary order to another programming language.
- Save the dataset with the files like
keyword.file_name.language
to./data/dataset_name/source/
. Wherekeyword
in[train, valid, test]
,language
is the program language that the tree-sitter can parse. - run
prepare_data.py
to preprocess dataset. - run
train.py
to create the transformer model and generate output. - run
metric_eval.py
to evaluate the output in terms of BLEU, EM, CodeBLEU metrics.
Where step 4 needs to be run in pycharm, select the folder evaluator/CodeBLEU
and mark directory as sources root.
Note that I did not set up warmup because of the high learning rate with few training steps.
Java to C#
model | layer | hidden | learning rate | BLEU | Exact Match | CodeBLEU |
---|---|---|---|---|---|---|
Transformer-baseline | 12 | 768 | - | 55.84 | 33.0 | 63.74 |
Transformer | 12 | 768 | 1e-4 | 50.64 | 31.3 | 58.24 |
Transformer | 12 | 768 | 5e-5 | 53.01 | 35.2 | 60.98 |
C# to Java
model | layer | hidden | learning rate | BLEU | Exact Match | CodeBLEU |
---|---|---|---|---|---|---|
Transformer-baseline | 12 | 768 | - | 50.47 | 37.9 | 61.59 |
Transformer | 12 | 768 | 1e-4 | 45.01 | 31.4 | 53.06 |
Transformer | 12 | 768 | 5e-5 | 45.91 | 33.0 | 53.89 |
My research is in program translation, and I hope I can graduate successfully.