Skip to content

Commit

Permalink
Lang mappings
Browse files Browse the repository at this point in the history
* Improved search for documents with metadata in english, spanish, french, italian, german and portuguese, including genre and singular/plural forms of words in the results among others. NOTE: a reindex is required, will be done automatically when launching Coreander.
* Show message in home page when user does not have highlights.
  • Loading branch information
svera authored Jan 21, 2024
1 parent 9f174c3 commit bee6c6a
Show file tree
Hide file tree
Showing 16 changed files with 292 additions and 199 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/go.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Set up Go
uses: actions/setup-go@v2
with:
go-version: 1.18
go-version: 1.21

- name: Build
run: go build .
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ A personal documents server, Coreander indexes the documents (EPUBs and PDFs wit
* Single binary with all dependencies included.
* Fast search engine powered by [Bleve](https://github.com/blevesearch/bleve), with support for documents in multiple languages.
* Search by author, title and even document series ([Calibre's](https://calibre-ebook.com/) `series` meta supported)
* Improved search for documents with metadata in english, spanish, french, italian, german and portuguese, including genre and singular/plural forms of words in the results among others.
* Estimated reading time calculation.
* High-performance web server powered by [Fiber](https://github.com/gofiber/fiber).
* Lightweight, responsive web interface based on [Bootstrap](https://getbootstrap.com/).
Expand Down
24 changes: 12 additions & 12 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@ module github.com/svera/coreander/v3
go 1.21

require (
github.com/blevesearch/bleve v1.0.14
github.com/blevesearch/bleve/v2 v2.3.8
github.com/blevesearch/bleve/v2 v2.3.11-0.20240110164916-5f1f45a5c32a
github.com/bmatcuk/doublestar/v4 v4.6.0
github.com/disintegration/imaging v1.6.2
github.com/glebarez/sqlite v1.8.0
Expand Down Expand Up @@ -38,6 +37,7 @@ require (
github.com/pkg/errors v0.9.1 // indirect
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
github.com/sirupsen/logrus v1.9.3 // indirect
github.com/stretchr/testify v1.8.4 // indirect
github.com/tinylib/msgp v1.1.8 // indirect
gopkg.in/alexcesaro/quotedprintable.v3 v3.0.0-20150716171945-2caba252f4dc // indirect
modernc.org/libc v1.22.6 // indirect
Expand All @@ -53,21 +53,21 @@ require (
github.com/andybalholm/brotli v1.0.5 // indirect
github.com/aymerick/douceur v0.2.0 // indirect
github.com/bits-and-blooms/bitset v1.7.0 // indirect
github.com/blevesearch/bleve_index_api v1.0.5 // indirect
github.com/blevesearch/geo v0.1.17 // indirect
github.com/blevesearch/bleve_index_api v1.0.6 // indirect
github.com/blevesearch/geo v0.1.18 // indirect
github.com/blevesearch/go-porterstemmer v1.0.3 // indirect
github.com/blevesearch/gtreap v0.1.1 // indirect
github.com/blevesearch/mmap-go v1.0.4 // indirect
github.com/blevesearch/scorch_segment_api/v2 v2.1.5 // indirect
github.com/blevesearch/scorch_segment_api/v2 v2.1.6 // indirect
github.com/blevesearch/segment v0.9.1 // indirect
github.com/blevesearch/snowballstem v0.9.0 // indirect
github.com/blevesearch/upsidedown_store_api v1.0.2 // indirect
github.com/blevesearch/vellum v1.0.9 // indirect
github.com/blevesearch/zapx/v11 v11.3.8 // indirect
github.com/blevesearch/zapx/v12 v12.3.8 // indirect
github.com/blevesearch/zapx/v13 v13.3.8 // indirect
github.com/blevesearch/zapx/v14 v14.3.8 // indirect
github.com/blevesearch/zapx/v15 v15.3.11 // indirect
github.com/blevesearch/vellum v1.0.10 // indirect
github.com/blevesearch/zapx/v11 v11.3.10 // indirect
github.com/blevesearch/zapx/v12 v12.3.10 // indirect
github.com/blevesearch/zapx/v13 v13.3.10 // indirect
github.com/blevesearch/zapx/v14 v14.3.10 // indirect
github.com/blevesearch/zapx/v15 v15.3.13 // indirect
github.com/flotzilla/pdf_parser v0.1.96
github.com/golang/geo v0.0.0-20230421003525-6adc56603217 // indirect
github.com/golang/protobuf v1.5.3 // indirect
Expand All @@ -92,7 +92,7 @@ require (
golang.org/x/image v0.10.0 // indirect
golang.org/x/net v0.17.0 // indirect
golang.org/x/sys v0.13.0 // indirect
google.golang.org/protobuf v1.30.0 // indirect
google.golang.org/protobuf v1.31.0 // indirect
gopkg.in/gomail.v2 v2.0.0-20160411212932-81ebce5c23df
gopkg.in/yaml.v3 v3.0.1 // indirect
olympos.io/encoding/edn v0.0.0-20201019073823-d3554ca0b0a3 // indirect
Expand Down
117 changes: 24 additions & 93 deletions go.sum

Large diffs are not rendered by default.

91 changes: 72 additions & 19 deletions internal/index/bleve.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,37 @@ import (
"path/filepath"
"strings"

"github.com/blevesearch/bleve/analysis/token/lowercase"
"github.com/blevesearch/bleve/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2"
"github.com/blevesearch/bleve/v2/analysis/analyzer/custom"
"github.com/blevesearch/bleve/v2/analysis/char/asciifolding"
"github.com/blevesearch/bleve/v2/analysis/lang/de"
"github.com/blevesearch/bleve/v2/analysis/lang/en"
"github.com/blevesearch/bleve/v2/analysis/lang/es"
"github.com/blevesearch/bleve/v2/analysis/lang/fr"
"github.com/blevesearch/bleve/v2/analysis/lang/it"
"github.com/blevesearch/bleve/v2/analysis/lang/pt"
"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
"github.com/blevesearch/bleve/v2/analysis/token/porter"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/mapping"
"github.com/svera/coreander/v3/internal/metadata"
)

// Version identifies the mapping used for indexing. Any changes in the mapping requires an increase
// of version, to signal that a new index needs to be created.
const Version = "v1"

var noStopWordsFilters = map[string][]string{
es.AnalyzerName: {es.NormalizeName, lowercase.Name, es.LightStemmerName},
en.AnalyzerName: {en.PossessiveName, lowercase.Name, porter.Name},
de.AnalyzerName: {de.NormalizeName, lowercase.Name, de.LightStemmerName},
fr.AnalyzerName: {fr.ElisionName, lowercase.Name, fr.LightStemmerName},
it.AnalyzerName: {it.ElisionName, lowercase.Name, it.LightStemmerName},
pt.AnalyzerName: {lowercase.Name, pt.LightStemmerName},
}

const defaultAnalyzer = "default_analyzer"

type BleveIndexer struct {
idx bleve.Index
libraryPath string
Expand All @@ -29,10 +51,10 @@ func NewBleve(index bleve.Index, libraryPath string, read map[string]metadata.Re
}
}

func Mapping() *mapping.IndexMappingImpl {
func Mapping() mapping.IndexMapping {
indexMapping := bleve.NewIndexMapping()

err := indexMapping.AddCustomAnalyzer("document",
err := indexMapping.AddCustomAnalyzer(defaultAnalyzer,
map[string]interface{}{
"type": custom.Name,
"char_filters": []string{
Expand All @@ -46,21 +68,52 @@ func Mapping() *mapping.IndexMappingImpl {
if err != nil {
log.Fatal(err)
}
indexMapping.DefaultAnalyzer = "document"
languageFieldMapping := bleve.NewTextFieldMapping()
languageFieldMapping.Index = false
indexMapping.DefaultMapping.AddFieldMappingsAt("Language", languageFieldMapping)
yearFieldMapping := bleve.NewTextFieldMapping()
yearFieldMapping.Index = false
indexMapping.DefaultMapping.AddFieldMappingsAt("Year", yearFieldMapping)
slugFieldMapping := bleve.NewKeywordFieldMapping()
indexMapping.DefaultMapping.AddFieldMappingsAt("Slug", slugFieldMapping)
seriesEqFieldMapping := bleve.NewKeywordFieldMapping()
indexMapping.DefaultMapping.AddFieldMappingsAt("SeriesEq", seriesEqFieldMapping)
authorsEqFieldMapping := bleve.NewKeywordFieldMapping()
indexMapping.DefaultMapping.AddFieldMappingsAt("AuthorsEq", authorsEqFieldMapping)
subjectsEqFieldMapping := bleve.NewKeywordFieldMapping()
indexMapping.DefaultMapping.AddFieldMappingsAt("SubjectsEq", subjectsEqFieldMapping)

keywordFieldMapping := bleve.NewKeywordFieldMapping()
keywordFieldMappingNotIndexable := bleve.NewKeywordFieldMapping()
keywordFieldMappingNotIndexable.Index = false

simpleTextFieldMapping := bleve.NewTextFieldMapping()
simpleTextFieldMapping.Analyzer = defaultAnalyzer

for lang := range noStopWordsFilters {
textFieldMapping := bleve.NewTextFieldMapping()
textFieldMapping.Analyzer = lang

err := addNoStopWordsAnalyzer(lang, indexMapping)
if err != nil {
log.Fatal(err)
}
noStopWordsTextFieldMapping := bleve.NewTextFieldMapping()
noStopWordsTextFieldMapping.Analyzer = lang + "_no_stop_words"

indexMapping.AddDocumentMapping(lang, bleve.NewDocumentMapping())
indexMapping.TypeMapping[lang].DefaultAnalyzer = lang
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Title", noStopWordsTextFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Authors", simpleTextFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Description", textFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Subjects", textFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Series", noStopWordsTextFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Slug", keywordFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("SeriesEq", keywordFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("AuthorsEq", keywordFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("SubjectsEq", keywordFieldMapping)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Language", keywordFieldMappingNotIndexable)
indexMapping.TypeMapping[lang].AddFieldMappingsAt("Year", keywordFieldMappingNotIndexable)
}

indexMapping.DefaultMapping.DefaultAnalyzer = defaultAnalyzer
indexMapping.DefaultMapping.AddFieldMappingsAt("Title", simpleTextFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("Authors", simpleTextFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("Description", simpleTextFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("Subjects", simpleTextFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("Series", simpleTextFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("Slug", keywordFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("SeriesEq", keywordFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("AuthorsEq", keywordFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("SubjectsEq", keywordFieldMapping)
indexMapping.DefaultMapping.AddFieldMappingsAt("Language", keywordFieldMappingNotIndexable)
indexMapping.DefaultMapping.AddFieldMappingsAt("Year", keywordFieldMappingNotIndexable)

return indexMapping
}
Expand Down
58 changes: 27 additions & 31 deletions internal/index/bleve_read.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,49 +43,44 @@ func (b *BleveIndexer) Search(keywords string, page, resultsPerPage int) (result
return b.runPaginatedQuery(qb, page, resultsPerPage)
}

splitted := strings.Split(strings.TrimSpace(keywords), " ")

var (
authorQueries []query.Query
titleQueries []query.Query
descriptionQueries []query.Query
seriesQueries []query.Query
subjectQueries []query.Query
)
compound := composeQuery(keywords)
return b.runPaginatedQuery(compound, page, resultsPerPage)
}

for _, keyword := range splitted {
if keyword == "" {
continue
}
qa := bleve.NewMatchQuery(keyword)
qa.SetField("Authors")
authorQueries = append(authorQueries, qa)
func composeQuery(keywords string) *query.DisjunctionQuery {
langCompoundQuery := bleve.NewDisjunctionQuery()

qt := bleve.NewMatchQuery(keyword)
for lang := range noStopWordsFilters {
qt := bleve.NewMatchPhraseQuery(keywords)
qt.Analyzer = lang + "_no_stop_words"
qt.SetField("Title")
titleQueries = append(titleQueries, qt)
langCompoundQuery.AddQuery(qt)

qs := bleve.NewMatchQuery(keyword)
qs := bleve.NewMatchQuery(keywords)
qs.Analyzer = lang + "_no_stop_words"
qs.SetField("Series")
seriesQueries = append(seriesQueries, qs)
qs.Operator = query.MatchQueryOperatorAnd
langCompoundQuery.AddQuery(qs)

qu := bleve.NewMatchQuery(keyword)
qu := bleve.NewMatchQuery(keywords)
qu.Analyzer = lang
qu.SetField("Subjects")
subjectQueries = append(subjectQueries, qt)
qu.Operator = query.MatchQueryOperatorAnd
langCompoundQuery.AddQuery(qu)

qd := bleve.NewMatchQuery(keyword)
qd := bleve.NewMatchQuery(keywords)
qd.Analyzer = lang
qd.SetField("Description")
descriptionQueries = append(descriptionQueries, qd)
qd.Operator = query.MatchQueryOperatorAnd
langCompoundQuery.AddQuery(qd)
}

authorCompoundQuery := bleve.NewConjunctionQuery(authorQueries...)
titleCompoundQuery := bleve.NewConjunctionQuery(titleQueries...)
seriesCompoundQuery := bleve.NewConjunctionQuery(seriesQueries...)
descriptionCompoundQuery := bleve.NewConjunctionQuery(descriptionQueries...)
subjectCompoundQuery := bleve.NewConjunctionQuery(subjectQueries...)
qa := bleve.NewMatchQuery(keywords)
qa.SetField("Authors")
qa.Operator = query.MatchQueryOperatorAnd
qa.Analyzer = defaultAnalyzer

compound := bleve.NewDisjunctionQuery(authorCompoundQuery, titleCompoundQuery, seriesCompoundQuery, descriptionCompoundQuery, subjectCompoundQuery)
return b.runPaginatedQuery(compound, page, resultsPerPage)
return bleve.NewDisjunctionQuery(qa, langCompoundQuery)
}

func (b *BleveIndexer) runQuery(query query.Query, results int) ([]Document, error) {
Expand All @@ -110,6 +105,7 @@ func (b *BleveIndexer) runPaginatedQuery(query query.Query, page, resultsPerPage
if err != nil {
return result.Paginated[[]Document]{}, err
}

if searchResult.Total == 0 {
return res, nil
}
Expand Down
58 changes: 58 additions & 0 deletions internal/index/bleve_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -203,5 +203,63 @@ func testCases() []testCase {
},
),
},
{
"Test genre spanish stemmer",
"lib/book6.epub",
metadata.Metadata{
Title: "La Guerrera",
Authors: []string{"Anónimo"},
Description: "Just test metadata",
Language: "es",
Subjects: []string{"History", "Middle age"},
},
"guerrero",
result.NewPaginated[[]index.Document](
model.ResultsPerPage,
1,
1,
[]index.Document{
{
ID: "book6.epub",
Slug: "anonimo-la-guerrera",
Metadata: metadata.Metadata{
Title: "La Guerrera",
Authors: []string{"Anónimo"},
Description: "Just test metadata",
Subjects: []string{"History", "Middle age"},
},
},
},
),
},
{
"Test plural italian stemmer",
"lib/book7.epub",
metadata.Metadata{
Title: "Fratelli",
Authors: []string{"Anónimo"},
Description: "Just test metadata",
Language: "it",
Subjects: []string{"History", "Middle age"},
},
"fratello",
result.NewPaginated[[]index.Document](
model.ResultsPerPage,
1,
1,
[]index.Document{
{
ID: "book7.epub",
Slug: "anonimo-fratelli",
Metadata: metadata.Metadata{
Title: "Fratelli",
Authors: []string{"Anónimo"},
Description: "Just test metadata",
Subjects: []string{"History", "Middle age"},
},
},
},
),
},
}
}
1 change: 0 additions & 1 deletion internal/index/bleve_write.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ func (b *BleveIndexer) AddLibrary(fs afero.Fs, batchSize int) error {
}

document := b.createDocument(meta, fullPath, batchSlugs)

batchSlugs[document.Slug] = struct{}{}

err = batch.Index(document.ID, document)
Expand Down
9 changes: 9 additions & 0 deletions internal/index/document.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,12 @@ type DocumentWrite struct {
SeriesEq string
SubjectsEq []string
}

// BleveType is part of the bleve.Classifier interface and its purpose is to tell the indexer
// the type of the document, which will be used to decide which analyzer will parse it.
func (d DocumentWrite) BleveType() string {
if d.Language == "" {
return ""
}
return d.Language[:2]
}
22 changes: 22 additions & 0 deletions internal/index/no_stop_words.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
package index

import (
"fmt"

"github.com/blevesearch/bleve/v2/analysis/analyzer/custom"
"github.com/blevesearch/bleve/v2/analysis/tokenizer/unicode"
"github.com/blevesearch/bleve/v2/mapping"
)

func addNoStopWordsAnalyzer(lang string, indexMapping *mapping.IndexMappingImpl) error {
if _, ok := noStopWordsFilters[lang]; !ok {
return fmt.Errorf("no stemmer defined for %s", lang)
}

return indexMapping.AddCustomAnalyzer(lang+"_no_stop_words",
map[string]interface{}{
"type": custom.Name,
"tokenizer": unicode.Name,
"token_filters": noStopWordsFilters[lang],
})
}
Loading

0 comments on commit bee6c6a

Please sign in to comment.