A simple semantic search engine for scientific papers. Check out our demo here.
This repository requires Python 3.7 or later.
Before installing, you should create and activate a Python virtual environment. See here for detailed instructions.
If you don't plan on modifying the source code, install from git
using pip
pip install git+https://github.com/PathwayCommons/semantic-search.git
Otherwise, clone the repository locally and then install
git clone https://github.com/PathwayCommons/semantic-search.git
cd semantic-search
pip install --editable .
Finally, if you would like to take advantage of a CUDA-enabled GPU, you must also install PyTorch with CUDA support by following the instructions for your system here.
To start up the server:
uvicorn semantic_search.main:app
You can pass the
--reload
flag if you are developing to force the server to reload on changes.
To provide arguments to the server, pass them as environment variables, e.g.:
CUDA_DEVICE=0 MAX_LENGTH=384 uvicorn semantic_search.main:app
Once the server is running, you can make a POST request to the /search
endpoint with a JSON body. E.g.
{
"query": {
"uid": "9887103",
"text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development."
},
"documents": [
{
"uid": "10320478",
"text": "Drosophila dSmad2 and Atr-I transmit activin/TGFbeta signals. "
},
{
"uid": "22563507",
"text": "R-Smad competition controls activin receptor output in Drosophila. "
},
{
"uid": "18820452",
"text": "Distinct signaling of Drosophila Activin/TGF-beta family members. "
},
{
"uid": "10357889"
},
{
"uid": "31270814"
}
],
"top_k": 3
}
The return value is a JSON representation of the top_k
most similar documents (defaults to 10):
[
{
"uid": "10320478",
"score": 0.6997108459472656
},
{
"uid": "22563507",
"score": 0.6877762675285339
},
{
"uid": "18820452",
"score": 0.6436074376106262
}
]
If "text"
is not provided, we assume "uid"
s are valid PMIDs and fetch the title and abstract text before embedding, indexing and searching.
- Notes on optional parameters
top_k
: A positive integer (default is10
) that limits the search results to this many of the most similar neighbours (articles)docs_only
: A boolean (default isFalse
) that instructs the service to return scores for the provideddocuments
. If true,top_k
is disregarded.
If you are intending on using a CUDA-enabled GPU, you must also install the NVIDIA Container Toolkit on the host following the instructions for your system here.
For Ubuntu 18.04:
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list |\
sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install nvidia-container-runtime
Restart Docker
sudo systemctl stop docker
sudo systemctl start docker
Check your install
docker run --gpus all nvidia/cuda:10.2-cudnn7-devel nvidia-smi
First, build the docker image:
docker build -t semantic-search .
Then, run it
docker run -it -p <PORT>:8000 semantic-search
For CUDA-enabled GPU
docker run --gpus all -dt --rm --name semantic_container -p 8000:8000 --env CUDA_DEVICE=0 --env MAX_LENGTH=384 semantic-search:latest
With the web server running, open http://127.0.0.1:8000/redoc in your browser for the API documentation.
For contributing guidelines, see CONTRIBUTING.md
.