A webapp for discovering similar SCP Wiki articles and stories based on cosine similarity of their embeddings, calculated with Sentence Transformers (Sentence-BERT).
Try it out: https://cxcorp.github.io/scp-wiki-semantic-similarity-search/
The web app
The Python project that calculates the embeddings. Needs CUDA 12.1.
- clone repo
- pull the submodules:
git submodule update --init --recursive
- Install CUDA 12.1.x, make sure it works (
nvidia-smi
can be ran) - Install pyenv (https://github.com/pyenv/pyenv)
cd recommender
- install the python version specified by .python-version, idk
pyenv install
orpyenv local
or something- make sure it works in your PATH (
python --version
says 3.12)
- make sure it works in your PATH (
- Make a virtualenv
python -m venv .ven
- Activate the virtualenv
source .venv/bin/activate
- Install deps
pip install -f requirements.txt
- run preprocess pipeline to test that it works
python -m recommender.preprocess
You should now have a wiki.sqlite3
file.
You can now run python -m recommender.gen_recommend
followed by python -m recommender.gen_hubs_sqlite
to generate the embeddings, calculate the article similarities, and write out the required datafiles into webapp/public/
.
- Install Node 20 (https://nodejs.org/en/download/).
cd webapp
- Build the customized sql.js with emscripten
- install Docker if you don't already have it
cd webapp
docker build -t sqljs-builder -f sqljs-builder.Dockerfile .
docker run --rm -it -v "$(pwd)/vendor/sql.js:/src" sqljs-builder npm run build
- now, back in the
webapp
directory npm i
npm run dev
cd webapp
npm run build
- clone this repo again into a different directory, checkout the branch
gh-pages
- delete everything in that directory except
.git
, copy everything from inside thewebapp/dist
directory in this repo to that repo's directory - add everything, commit, push
The content of this repository is licensed under the GPLv3 license (see LICENSE-GPLv3
for full license terms), with some exceptions. The git submodules have their own licensing information. The data files corpus.txt
, hubs.sqlite
and matches.bin
are content relating to or derived from content relating to the SCP Foundation, have a different license. See next section.
scp-wiki-semantic-similarity-search
Copyright (C) 2024 Joona Heikkilä
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
The data files corpus.txt
, hubs.sqlite
and matches.bin
whose contents are presented in the web app are content relating to or derived from content relating to the SCP Foundation. Content relating to the SCP Foundation, including the SCP Foundation logo, is licensed under Creative Commons Attribution-Sharealike 3.0 (CC BY-SA 3.0) and all concepts originate from https://scpwiki.com/ and its authors. The aforementioned data files, being derived from this content, are hereby also released under Creative Commons Attribution-Sharealike 3.0 (CC BY-SA 3.0).