Skip to content

Latest commit

 

History

History
55 lines (32 loc) · 1.91 KB

Report.md

File metadata and controls

55 lines (32 loc) · 1.91 KB

Codesim

Codesim is a reproduction of Needle: Detecting Code Plagiarism on Student Submissions. It supports performing a similarity check on two single-file C++ programs.

Algorithm

Steps

  1. Compile source file with clang++;

  2. Use nm command to list symbols in the object file;

  3. Use objdump command to read all the function body in the object file;

  4. Use Needle algorithm to diff these two sets of functions.

Compile

Codesim will compile input program with the following command.

clang++ --std=c++17 -pedantic -O2 {{filename}}

And then list symbols in the object file.

nm --demangle --defined-only -P {{object}}

Notice that we will extract all the symbols in the text (code) section, and then filter all the functions whose names start with std:: and some other special functions like _start.

Finally, dump all the compiled functions defined by the user.

objdump -d {{object}}

Needle

Needle treats a function as a list of instructions (ignore the operands), and define a similarity function between $f_1, f_2$:

$$ \sigma(f_1, f_2) = \max_{k \in { 1, 2, \dots, |f_2| } } LCS(f_1, f_2[k : k + \omega]) $$

This means calculate the max LCS between function $f_1$ and a window of $f_2$ with the size of $\omega$.

Then, create a weighted flow network graph like this

flow

The maximum cost divided by the total size of the first program is the final result. You can see more details in the original paper.

The differences from the original paper are that we extract the full instructions not ignoring the operands. We use the final the maximum cost divided by the total size multiplied by $1 + \exp(-\alpha + \beta)$ (you can enable this by passing argument --norm).