Full text search engine from scratch by Golangʕ◔ϖ◔ʔ (Just a toy)
- Omochi is an inverted index based search engine by Golang.
- If indexed correctly, any document can be searched.
- You can search documents from RESTful API.
- Supported language: English, Japanese.
Create docker network(omochi_network) by:
$ docker network create omochi_network
Omochi uses MariaDB for storing Inverted Indexes & Documents, and Ent for ORM.
For database migration, connect docker container shell by:
$ docker-compose run api bash
Then, running database migration by:
$ go run ./cmd/migrate/migrate.go
To try search engine, this project provides two datasets as samples in TSV Format.
The dataset for English is a Movie title dataset, and the dataset for Japanese is a Doraemon comic title dataset.
At first, connect docker container shell by:
$ docker-compose run api bash
Then, seed data by:
$ go run {path to seed.go}
If you initialize with a Japanese dataset, {path to seed.go}
should be ./cmd/seeds/ja/seed.go
. On the other hand, for English, ./cmd/seeds/eng/seed.go
.
After completing setup, you can start application by running:
$ docker-compose up
This app starts a RESTful API and listens on port 8081 for connections
After seeding data , you can search documents by send GET request to /v1/document/search
.
Query parameters are as follow:
"keywords"
: Keywords to search. If there are multiple search terms, specify them separated by commas like"hoge,fuga,piyo"
"mode"
: Search mode. The search modes that can be specified are"And"
and"Or"
- Doraemon comic title dataset
After data seeding by Doraemon comic title dataset, you can search documents which include "ドラえもん" by:
$ curl "http://localhost:8081/v1/document/search?keywords=ドラえもん" | jq .
{
"documents": [
{
"id": 12054,
"content": "ドラえもんの歌",
"tokenized_content": [
"ドラえもん",
"歌"
],
"created_at": "2022-07-08T12:59:49+09:00",
"updated_at": "2022-07-08T12:59:49+09:00"
},
{
"id": 11992,
"content": "恋するドラえもん",
"tokenized_content": [
"恋する",
"ドラえもん"
],
"created_at": "2022-07-08T12:59:48+09:00",
"updated_at": "2022-07-08T12:59:48+09:00"
},
{
"id": 11230,
"content": "ドラえもん登場!",
"tokenized_content": [
"ドラえもん",
"登場"
],
"created_at": "2022-07-08T12:59:44+09:00",
"updated_at": "2022-07-08T12:59:44+09:00"
},
...
- Movie title dataset
After data seeding by Movie title dataset, you can search documents which include "toy" and "story" by:
$ curl "http://localhost:8081/v1/document/search?keywords=toy,story&mode=And" | jq .
{
"documents": [
{
"id": 1,
"content": "Toy Story",
"tokenized_content": [
"toy",
"story"
],
"created_at": "2022-07-08T13:49:24+09:00",
"updated_at": "2022-07-08T13:49:24+09:00"
},
{
"id": 39,
"content": "Toy Story of Terror!",
"tokenized_content": [
"toy",
"story",
"terror"
],
"created_at": "2022-07-08T13:49:34+09:00",
"updated_at": "2022-07-08T13:49:34+09:00"
},
{
"id": 83,
"content": "Toy Story That Time Forgot",
"tokenized_content": [
"toy",
"story",
"time",
"forgot"
],
"created_at": "2022-07-08T13:49:53+09:00",
"updated_at": "2022-07-08T13:49:53+09:00"
},
{
"id": 213,
"content": "Toy Story 2",
"tokenized_content": [
"toy",
"story"
],
"created_at": "2022-07-08T13:50:35+09:00",
"updated_at": "2022-07-08T13:50:35+09:00"
},
{
"id": 352,
"content": "Toy Story 3",
"tokenized_content": [
"toy",
"story"
],
"created_at": "2022-07-08T13:51:23+09:00",
"updated_at": "2022-07-08T13:51:23+09:00"
}
]
}
- Fujiko.F.Fujio,Doraemon(Tentomushi Comics) 1~45, Shogakukan , 1974~1996
- ROUNAK BANIK."The Movies Dataset".kaggle.https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. Accessed on 07/08