Skip to content

Latest commit

 

History

History

poet

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

POET

This is the official repo for the paper Reasoning Like Program Executors.

Pre-training Corpus

You can find the pre-training SQL corpus from here, the pre-training Math corpus from here.

The pre-training SQL corpus can be synthesized following the same procedure as done in TAPEX with the expand_numbers_in_text function below:

def expand_numbers_in_text(text, delim=" ", ignore_chars=[","], reverse_num=False):
    number_pattern = r"[-+]?[.]?[\d]+(,\d+)*[\.]?\d*(?:[eE][-+]?\d+)?%?"
    num_char_spans = [(m.start(0), m.end(0)) for m in re.finditer(number_pattern, text)]
    if len(num_char_spans) == 0: return text
    out_text = ""
    last_e = -1
    for i, (s, e) in enumerate(num_char_spans):
        out_text += text[:s] if i == 0 else text[last_e:s]
        num_str = delim.join([c for c in list(text[s:e]) if c not in ignore_chars])
        out_text += num_str if not reverse_num else num_str[::-1]
        last_e = e
    out_text += text[last_e:]  # append rest
    return out_text

The pre-training Math corpus can be synthesized by the script synthesize_math_corpus.py. The pre-training Logic corpus can be synthesized by the script synthesize_logic_corpus.py.

For all BART-based experiments, we use the fairseq implementation, which means that you can prepare the dataset as the following format:

|- dataset
    |- train.src
    |- train.tgt
    |- valid.src
    |- valid.tgt

After necessary preprocessing (you can follow the official guide in fairseq machin translation task), you can use the following script to train the model:

fairseq-train dataset/bin/ \
    --save-dir models \
    --tensorboard-logdir tensorboard_logs \
    --restore-file BART-large/model.pt \
    --arch bart_large \
    --task translation \
    --maximize-best-checkpoint-metric \
    --criterion label_smoothed_cross_entropy  \
    --source-lang src  \
    --target-lang tgt  \
    --label-smoothing 0.1  \
    --max-tokens 1536 \
    --validate-interval	50 \
    --save-interval	50 \
    --save-interval-updates	3001 \
    --validate-interval-updates 3001 \
    --keep-interval-updates 5 \
    --update-freq 16 \
    --warmup-updates 500  \
    --max-update 20000  \
    --total-num-update 20000  \
    --required-batch-size-multiple 1  \
    --dropout 0.1  \
    --attention-dropout 0.1  \
    --relu-dropout 0.0  \
    --weight-decay 0.01  \
    --optimizer adam  \
    --adam-betas "(0.9, 0.999)"  \
    --adam-eps 1e-08  \
    --clip-norm 0.1  \
    --lr-scheduler polynomial_decay  \
    --lr 3e-5  \
    --ddp-backend no_c10d  \
    --num-workers 1  \
    --reset-meters  \
    --reset-optimizer \
    --reset-dataloader \
    --share-all-embeddings \
    --layernorm-embedding \
    --share-decoder-input-output-embed  \
    --skip-invalid-size-inputs-valid-test  \
    --log-format json  \
    --log-interval 10  \
    --patience 10 \
    --keep-best-checkpoints 1 \
    --report-accuracy \
    --no-epoch-checkpoints \
    --no-last-checkpoints \
    --no-save-optimizer-state

Pre-trained Model Weights

You can find all the available POET model weights at Huggingface Hub. For all these models, you can try to fine-tune them as the vanilla models. And these models are pre-trained on the following format of natural context and sentence:

[sentence] col : [natural context]

where [sentence] is usually the question in the task, and [natural context] is usually the passage in the task. Please refer to our paper for more details.