This is a demo software that helps spot plagiarism attempts in texts and images. It uses Redis Stack Vector Similarity Search (VSS) feature. Learn about VSS.
First, start a Redis Stack instance. You can use a Docker container. Start it as follows:
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
Now, clone the repository, create a virtual environment and install the dependencies in the environment.
git clone https://github.com/mortensi/oaps.git
python3 -m venv oapsvenv
source oapsvenv/bin/activate
cd oaps
pip install -e .
Now you can execute the demo.
python3 demo.py
In the file oaps.py
you can check the three relevant methods in Python.
init()
verifies that the index does not exist and proceeds to create itindex_document(pk, text)
will index sentences, one by one, in the textcheck_document(text, epsilon)
will execute Vector Similarity Search and retrieve the most similar document/s based on tolerance. This tolerance-based search, also known as VSS range query, depends on epsilon, a coefficient to filter the results by distance from the query vector.index_image(pk, imagepath)
will index the picture stored at the indicated pathcheck_image(imagepath, epsilon)
will execute Vector Similarity Search and retrieve the most similar image/s based on tolerance
The demo creates an index that considers JSON documents that are prefixed oaps:seq:
and will consider the vectors stored at the $.embedding
path in the documents. I have added also an inverted index on the text itself in $sentence
to enable full-text search.
FT.CREATE oaps_txt_idx
ON JSON
PREFIX 1 oaps:seq:
SCHEMA $.sentence AS sentence TEXT
$.embedding AS embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE
The demo imports the dataset demo/mortensi.csv
, which stores a collection of sample articles from my blog
Machine learning models produce embeddings of texts of limited size. As an example, the model all-MiniLM-L12-v1 will truncate input text longer than 128 words.
Because of this, and to increase also the precision of our anti-plagiarism application, we will split and index texts by sentence. We are using a simple regular expression to split the text by the separators .
, !
, ?
.
To test this functionality, we pass an arbitrary text including a sentence that was copied from the dataset.
The output will indicate the most similar sentence (successfully).
Indexed 20 elements
oaps:seq:16gpy:1
[' We are a bunch of people convinced that you have to pass through difficult, or better, impossible challenges to see an idea reach the production stage and possibly provide benefits']
The index for images is created with the following syntax:
FT.CREATE oaps_pic_idx
ON JSON
PREFIX 1 oaps:pic:
SCHEMA $.file AS file TAG SEPARATOR ,
$.embedding AS embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 512 DISTANCE_METRIC COSINE
For the test, three sample images are vectorized and stored (a spoon, a cup, and a glass).
As a test, we propose another glass and have it identified (as a glass).