title | emoji | colorFrom | colorTo | sdk | app_port | pinned |
---|---|---|---|---|---|---|
Arxiv Plagiarism Checker LLM |
🚀 |
pink |
pink |
docker |
7860 |
true |
Demo - Link
Dataset - Link
Arxiv author's plagiarism check just by entering the arxiv author
INPUT - Authors Name OUTPUT - Plagiarism Check Results
You can get MIT authors List from here - Link
We have used the arxiv dataset for the year 2023 & 2024 and then we have used the OpenAI Embeddings to generate the embeddings for the documents.
- Install gsutil - Link
# Single year files
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/19*/ ./papers_from_2019/
#single file
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2310/2310.00001v1.pdf .
- Gradio
- ChromaDB
- SERP API
- OpenAI GPT Embeddings & LLM Models
-
We have collected the data from arxiv GCP cloud for the year of 2023 & 2024 and then we have used the text-embedding-3-large to generate the embeddings for the documents. This amount to about 10GB.
-
Document Text Extraction is done in 2 formats with metdata
- Document Level
- Paragraph Level
- MetaData
Meta data example
{
"id": "2106.09680",
"title": "Accuracy, Interpretability, and Differential Privacy via Explainable Boosting",
"summary": "We show that adding differential privacy to Explainable Boosting Machines\n(EBMs), a recent method for training interpretable ML models, yields\nstate-of-the-art accuracy while protecting privacy. Our experiments on multiple\nclassification and regression datasets show that DP-EBM models suffer\nsurprisingly little accuracy loss even with strong differential privacy\nguarantees. In addition to high accuracy, two other benefits of applying DP to\nEBMs are: a) trained models provide exact global and local interpretability,\nwhich is often important in settings where differential privacy is needed; and\nb) the models can be edited after training without loss of privacy to correct\nerrors which DP noise may have introduced.",
"source": "http://arxiv.org/pdf/2106.09680",
"authors": "Harsha Nori Rich Caruana Zhiqi Bu Judy Hanwen Shen Janardhan Kulkarni",
"references": ""
}
-
Embeddings are generated for the documents and paragraphs using OpenAI Models
-
Authors are then searched on the Google SERP API and the documents (Top 10) are then compared individually with the embeddings of the documents.
-
Retreived documents & Top 3 simialar papers from Google SERP API on the topic
- Metadata and text is extracted
-
Once Extracted Unique Lines and Paragraphs are extracted and then compared by using LLM - GPT 4 Preview Model - 128K
-
Unique Lines are then compared with the document embeddings and the paragraphs are compared with the paragraph embeddings.
-
Top 3 Similar Text and respective documents are then returned to the user as Plagiarised Content.
- ProWritingAid API V2 - Free Plan
- Unicheck - Request Demo
- Copyleaks - Request Demo
- EDEN AI - Free Plan
- Python 3.9+
- Gradio
- GPT Keys
pip install -r requirements.txt
We are using a gradio app to implement the plagiarism checker
python app.py or gradio app.py