Skip to content

HPC-Fortran2CPP/beyond-code-pairs-translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Introduction

Beyond traditional source–target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran→C++ and C++→CUDA, the pipeline yields 11.7k and 3.93k dialogues, respectively. Fine-tuning open-weight LLMs on both code-pair and dialogue data improves translation performance. We further investigate different strategies for fine-tuning data construction and analyze their impact on translation quality, highlighting the effectiveness and flexibility of our generated dialogue data.

Content

  • src/ folder contains the source code that is used to run the multi-agent dialogue generation.
  • data/ folder contains the data that is used to do the model fine-tuning.
  • sft/ folder contains the configs that is used to fine-tune the models.
  • analysis/ folder contains the analysis results of the F2C and CPP2Cuda translation tasks.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages