Arxiv Plagiarism Checker LLM

title	emoji	colorFrom	colorTo	sdk	app_port	pinned
Arxiv Plagiarism Checker LLM	🚀	pink	pink	docker	7860	true

Arxiv Plagiarism Checker LLM

Demo - Link

Dataset - Link

Arxiv author's plagiarism check just by entering the arxiv author

Docs & Working

INPUT - Authors Name OUTPUT - Plagiarism Check Results

You can get MIT authors List from here - Link

Dataset & Embeddings

We have used the arxiv dataset for the year 2023 & 2024 and then we have used the OpenAI Embeddings to generate the embeddings for the documents.

Install gsutil - Link

# Single year files
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/19*/ ./papers_from_2019/

#single file
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2310/2310.00001v1.pdf .

Tech Stack

Gradio
ChromaDB
SERP API
OpenAI GPT Embeddings & LLM Models

We have collected the data from arxiv GCP cloud for the year of 2023 & 2024 and then we have used the text-embedding-3-large to generate the embeddings for the documents. This amount to about 10GB.
Document Text Extraction is done in 2 formats with metdata

Document Level
Paragraph Level
MetaData

Meta data example

{
  "id": "2106.09680",
  "title": "Accuracy, Interpretability, and Differential Privacy via Explainable Boosting",
  "summary": "We show that adding differential privacy to Explainable Boosting Machines\n(EBMs), a recent method for training interpretable ML models, yields\nstate-of-the-art accuracy while protecting privacy. Our experiments on multiple\nclassification and regression datasets show that DP-EBM models suffer\nsurprisingly little accuracy loss even with strong differential privacy\nguarantees. In addition to high accuracy, two other benefits of applying DP to\nEBMs are: a) trained models provide exact global and local interpretability,\nwhich is often important in settings where differential privacy is needed; and\nb) the models can be edited after training without loss of privacy to correct\nerrors which DP noise may have introduced.",
  "source": "http://arxiv.org/pdf/2106.09680",
  "authors": "Harsha Nori Rich Caruana Zhiqi Bu Judy Hanwen Shen Janardhan Kulkarni",
  "references": ""
}

Embeddings are generated for the documents and paragraphs using OpenAI Models
Authors are then searched on the Google SERP API and the documents (Top 10) are then compared individually with the embeddings of the documents.
Retreived documents & Top 3 simialar papers from Google SERP API on the topic
- Metadata and text is extracted
Once Extracted Unique Lines and Paragraphs are extracted and then compared by using LLM - GPT 4 Preview Model - 128K
Unique Lines are then compared with the document embeddings and the paragraphs are compared with the paragraph embeddings.
Top 3 Similar Text and respective documents are then returned to the user as Plagiarised Content.

Research Points

Miro RoadMap Link
Notion Link

Top Plagiarism Checkers API

ProWritingAid API V2 - Free Plan
Unicheck - Request Demo
Copyleaks - Request Demo
EDEN AI - Free Plan

Requirements

Python 3.9+
Gradio
GPT Keys

Installation

pip install -r requirements.txt

Usage

We are using a gradio app to implement the plagiarism checker

python app.py or gradio app.py

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
images		images
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
docs.md		docs.md
embeddings.py		embeddings.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arxiv Plagiarism Checker LLM

Docs & Working

Dataset & Embeddings

Tech Stack

Research Points

Top Plagiarism Checkers API

Requirements

Installation

Usage

About

Releases

Packages

Languages

License

gamingflexer/arxiv-plagiarism-checker-llm

Folders and files

Latest commit

History

Repository files navigation

Arxiv Plagiarism Checker LLM

Docs & Working

Dataset & Embeddings

Tech Stack

Research Points

Top Plagiarism Checkers API

Requirements

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages