Plagiarism is a serious problem in research. In this project, we will implement a very simple plagiarism detector. Our input will be a corpus of existing documents and a potentially plagiarized document. Our output will be a set of URLs from which the document was plagiarized from and the ratio upto which it was plagiarized in the form of statistics.
The objective of our project is to find plagiarism in the general text as well as the source code of any programming language which shall help the agencies like patent organisations to trigger copied data and code. It will prevent the copying of research ideas and topics.
- Rabin-Karp Pattern Matching Algorithm:
The Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm).
- Natural Language Processing Algorithms