Codesim is a reproduction of Needle: Detecting Code Plagiarism on Student Submissions. It supports performing a similarity check on two single-file C++ programs.
-
Compile source file with
clang++
; -
Use
nm
command to list symbols in the object file; -
Use
objdump
command to read all the function body in the object file; -
Use Needle algorithm to diff these two sets of functions.
Codesim will compile input program with the following command.
clang++ --std=c++17 -pedantic -O2 {{filename}}
And then list symbols in the object file.
nm --demangle --defined-only -P {{object}}
Notice that we will extract all the symbols in the text (code) section, and then filter all the functions whose names start with std::
and some other special functions like _start
.
Finally, dump all the compiled functions defined by the user.
objdump -d {{object}}
Needle treats a function as a list of instructions (ignore the operands), and define a similarity function between
This means calculate the max LCS between function
Then, create a weighted flow network graph like this
The maximum cost divided by the total size of the first program is the final result. You can see more details in the original paper.
The differences from the original paper are that we extract the full instructions not ignoring the operands. We use the final the maximum cost divided by the total size multiplied by --norm
).