Skip to content

Commit 85784be

Browse files
committed
Add support for the FTS5 trigram tokenizer
1 parent 84d25df commit 85784be

File tree

7 files changed

+380
-12
lines changed

7 files changed

+380
-12
lines changed

Documentation/FTS5Tokenizers.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ All SQLite [built-in tokenizers](https://www.sqlite.org/fts5.html#tokenizers) to
3939

4040
- The [porter](https://www.sqlite.org/fts5.html#porter_tokenizer) tokenizer turns English words into their root: "database engine" gives the "databas" and "engin" tokens. The query "database engines" will match, because it produces the same tokens.
4141

42+
- The [trigram](https://sqlite.org/fts5.html#the_trigram_tokenizer) tokenizer treats each contiguous sequence of three characters as a token to allow general substring matching. "Sequence" gives "seq", "equ", "que", "uen", "enc" and "nce". The queries "SEQUENCE", "SEQUEN", "QUENC" and "QUE" all match as they decompose into a subset of the same trigrams.
43+
4244
However, built-in tokenizers don't match "first" with "1st", because they produce the different "first" and "1st" tokens.
4345

4446
Nor do they match "Grossmann" with "Großmann", because they produce the different "grossmann" and "großmann" tokens.

Documentation/FullTextSearch.md

Lines changed: 31 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@ See [SQLite documentation](https://www.sqlite.org/fts5.html) for more informatio
386386

387387
**A tokenizer defines what "matching" means.** Depending on the tokenizer you choose, full-text searches won't return the same results.
388388

389-
SQLite ships with three built-in FTS5 tokenizers: `ascii`, `porter` and `unicode61` that use different algorithms to match queries with indexed content.
389+
SQLite ships with four built-in FTS5 tokenizers: `ascii`, `porter`, `unicode61` and `trigram` that use different algorithms to match queries with indexed content.
390390

391391
```swift
392392
try db.create(virtualTable: "book", using: FTS5()) { t in
@@ -395,20 +395,23 @@ try db.create(virtualTable: "book", using: FTS5()) { t in
395395
t.tokenizer = .unicode61(...)
396396
t.tokenizer = .ascii
397397
t.tokenizer = .porter(...)
398+
t.tokenizer = .trigram(...)
398399
}
399400
```
400401

401402
See below some examples of matches:
402403

403-
| content | query | ascii | unicode61 | porter on ascii | porter on unicode61 |
404-
| ----------- | ---------- | :----: | :-------: | :-------------: | :-----------------: |
405-
| Foo | Foo | X | X | X | X |
406-
| Foo | FOO | X | X | X | X |
407-
| Jérôme | Jérôme | X ¹ | X ¹ | X ¹ | X ¹ |
408-
| Jérôme | JÉRÔME | | X ¹ | | X ¹ |
409-
| Jérôme | Jerome | | X ¹ | | X ¹ |
410-
| Database | Databases | | | X | X |
411-
| Frustration | Frustrated | | | X | X |
404+
| content | query | ascii | unicode61 | porter on ascii | porter on unicode61 | trigram |
405+
| ----------- | ---------- | :----: | :-------: | :-------------: | :-----------------: | :-----: |
406+
| Foo | Foo | X | X | X | X | X |
407+
| Foo | FOO | X | X | X | X | X |
408+
| Jérôme | Jérôme | X ¹ | X ¹ | X ¹ | X ¹ | X ¹ |
409+
| Jérôme | JÉRÔME | | X ¹ | | X ¹ | X ¹ |
410+
| Jérôme | Jerome | | X ¹ | | X ¹ | X ¹ |
411+
| Database | Databases | | | X | X | |
412+
| Frustration | Frustrated | | | X | X | |
413+
| Sequence | quenc | | | | | X |
414+
412415

413416
¹ Don't miss [Unicode Full-Text Gotchas](#unicode-full-text-gotchas)
414417

@@ -455,6 +458,24 @@ See below some examples of matches:
455458

456459
It strips diacritics from latin script characters if it wraps unicode61, and does not if it wraps ascii (see the example above).
457460

461+
- **trigram**
462+
463+
```swift
464+
try db.create(virtualTable: "book", using: FTS5()) { t in
465+
t.tokenizer = .trigram()
466+
t.tokenizer = .trigram(matching: .caseInsensitiveRemovingDiacritics)
467+
t.tokenizer = .trigram(matching: .caseSensitive)
468+
}
469+
```
470+
471+
The "trigram" tokenizer is case-insensitive for unicode characters by default. It matches "Jérôme" with "JÉRÔME".
472+
473+
Diacritics stripping can be enabled so it matches "jérôme" with "jerome". Case-sensitive matching can also be enabled but is mutually exclusive with diacritics stripping.
474+
475+
Unlike the other tokenizers, it provides general substring matching, matching "Sequence" with "que" by splitting character sequences into overlapping 3 character tokens (trigrams).
476+
477+
It can also act as an index for GLOB and LIKE queries depending on the configuration.
478+
458479
See [SQLite tokenizers](https://www.sqlite.org/fts5.html#tokenizers) for more information, and [custom FTS5 tokenizers](FTS5Tokenizers.md) in order to add your own tokenizers.
459480

460481

GRDB/FTS/FTS5.swift

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,48 @@ public struct FTS5 {
7474
#endif
7575
}
7676

77+
#if GRDBCUSTOMSQLITE || GRDBCIPHER
78+
/// Options for trigram tokenizer character matching. Matches the raw
79+
/// "case_sensitive" and "remove_diacritics" tokenizer arguments.
80+
///
81+
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
82+
public enum TrigramTokenizerMatching: Sendable {
83+
/// Case insensitive matching without removing diacritics. This
84+
/// option matches the raw "case_sensitive=0 remove_diacritics=0"
85+
/// tokenizer argument.
86+
case caseInsensitive
87+
/// Case insensitive matching that removes diacritics before
88+
/// matching. This option matches the raw
89+
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument.
90+
case caseInsensitiveRemovingDiacritics
91+
/// Case sensitive matching. Diacritics are not removed when
92+
/// performing case sensitive matching. This option matches the raw
93+
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument.
94+
case caseSensitive
95+
}
96+
#else
97+
/// Options for trigram tokenizer character matching. Matches the raw
98+
/// "case_sensitive" and "remove_diacritics" tokenizer arguments.
99+
///
100+
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
101+
@available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually)
102+
public enum TrigramTokenizerMatching: Sendable {
103+
/// Case insensitive matching without removing diacritics. This
104+
/// option matches the raw "case_sensitive=0 remove_diacritics=0"
105+
/// tokenizer argument.
106+
case caseInsensitive
107+
/// Case insensitive matching that removes diacritics before
108+
/// matching. This option matches the raw
109+
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument.
110+
@available(*, unavailable, message: "Requires a future OS release that includes SQLite >=3.45")
111+
case caseInsensitiveRemovingDiacritics
112+
/// Case sensitive matching. Diacritics are not removed when
113+
/// performing case sensitive matching. This option matches the raw
114+
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument.
115+
case caseSensitive
116+
}
117+
#endif
118+
77119
/// Creates an FTS5 module.
78120
///
79121
/// For example:

GRDB/FTS/FTS5Tokenizer.swift

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -148,11 +148,11 @@ extension FTS5Tokenizer {
148148
private func tokenize(_ string: String, for tokenization: FTS5Tokenization)
149149
throws -> [(token: String, flags: FTS5TokenFlags)]
150150
{
151-
try ContiguousArray(string.utf8).withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in
151+
try string.utf8CString.withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in
152152
guard let addr = buffer.baseAddress else {
153153
return []
154154
}
155-
let pText = UnsafeMutableRawPointer(mutating: addr).assumingMemoryBound(to: CChar.self)
155+
let pText = addr
156156
let nText = CInt(buffer.count)
157157

158158
var context = TokenizeContext()

GRDB/FTS/FTS5TokenizerDescriptor.swift

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,5 +210,70 @@ public struct FTS5TokenizerDescriptor: Sendable {
210210
}
211211
return FTS5TokenizerDescriptor(components: components)
212212
}
213+
214+
#if GRDBCUSTOMSQLITE || GRDBCIPHER
215+
/// The "trigram" tokenizer.
216+
///
217+
/// For example:
218+
///
219+
/// ```swift
220+
/// try db.create(virtualTable: "book", using: FTS5()) { t in
221+
/// t.tokenizer = .trigram()
222+
/// }
223+
/// ```
224+
///
225+
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
226+
///
227+
/// - parameters:
228+
/// - matching: By default SQLite will perform case insensitive
229+
/// matching and not remove diacritics before matching.
230+
public static func trigram(
231+
matching: FTS5.TrigramTokenizerMatching = .caseInsensitive
232+
) -> FTS5TokenizerDescriptor {
233+
var components = ["trigram"]
234+
switch matching {
235+
case .caseInsensitive:
236+
break
237+
case .caseInsensitiveRemovingDiacritics:
238+
components.append(contentsOf: ["remove_diacritics", "1"])
239+
case .caseSensitive:
240+
components.append(contentsOf: ["case_sensitive", "1"])
241+
}
242+
243+
return FTS5TokenizerDescriptor(components: components)
244+
}
245+
#else
246+
/// The "trigram" tokenizer.
247+
///
248+
/// For example:
249+
///
250+
/// ```swift
251+
/// try db.create(virtualTable: "book", using: FTS5()) { t in
252+
/// t.tokenizer = .trigram()
253+
/// }
254+
/// ```
255+
///
256+
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
257+
///
258+
/// - parameters:
259+
/// - matching: By default SQLite will perform case insensitive
260+
/// matching and not remove diacritics before matching.
261+
@available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually)
262+
public static func trigram(
263+
matching: FTS5.TrigramTokenizerMatching = .caseInsensitive
264+
) -> FTS5TokenizerDescriptor {
265+
var components = ["trigram"]
266+
switch matching {
267+
case .caseInsensitive:
268+
break
269+
case .caseInsensitiveRemovingDiacritics:
270+
components.append(contentsOf: ["remove_diacritics", "1"])
271+
case .caseSensitive:
272+
components.append(contentsOf: ["case_sensitive", "1"])
273+
}
274+
275+
return FTS5TokenizerDescriptor(components: components)
276+
}
277+
#endif
213278
}
214279
#endif

Tests/GRDBTests/FTS5TableBuilderTests.swift

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,89 @@ class FTS5TableBuilderTests: GRDBTestCase {
166166
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''unicode61'' ''tokenchars'' ''-.''')")
167167
}
168168
}
169+
170+
func testTrigramTokenizer() throws {
171+
#if GRDBCUSTOMSQLITE || GRDBCIPHER
172+
guard sqlite3_libversion_number() >= 3034000 else {
173+
throw XCTSkip("FTS5 trigram tokenizer is not available")
174+
}
175+
#else
176+
guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
177+
throw XCTSkip("FTS5 trigram tokenizer is not available")
178+
}
179+
#endif
180+
181+
let dbQueue = try makeDatabaseQueue()
182+
try dbQueue.inDatabase { db in
183+
try db.create(virtualTable: "documents", using: FTS5()) { t in
184+
t.tokenizer = .trigram()
185+
t.column("content")
186+
}
187+
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram''')")
188+
}
189+
}
190+
191+
func testTrigramTokenizerCaseInsensitive() throws {
192+
#if GRDBCUSTOMSQLITE || GRDBCIPHER
193+
guard sqlite3_libversion_number() >= 3034000 else {
194+
throw XCTSkip("FTS5 trigram tokenizer is not available")
195+
}
196+
#else
197+
guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
198+
throw XCTSkip("FTS5 trigram tokenizer is not available")
199+
}
200+
#endif
201+
202+
let dbQueue = try makeDatabaseQueue()
203+
try dbQueue.inDatabase { db in
204+
try db.create(virtualTable: "documents", using: FTS5()) { t in
205+
t.tokenizer = .trigram(matching: .caseInsensitive)
206+
t.column("content")
207+
}
208+
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram''')")
209+
}
210+
}
169211

212+
func testTrigramTokenizerCaseSensitive() throws {
213+
#if GRDBCUSTOMSQLITE || GRDBCIPHER
214+
guard sqlite3_libversion_number() >= 3034000 else {
215+
throw XCTSkip("FTS5 trigram tokenizer is not available")
216+
}
217+
#else
218+
guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
219+
throw XCTSkip("FTS5 trigram tokenizer is not available")
220+
}
221+
#endif
222+
223+
let dbQueue = try makeDatabaseQueue()
224+
try dbQueue.inDatabase { db in
225+
try db.create(virtualTable: "documents", using: FTS5()) { t in
226+
t.tokenizer = .trigram(matching: .caseSensitive)
227+
t.column("content")
228+
}
229+
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram'' ''case_sensitive'' ''1''')")
230+
}
231+
}
232+
233+
func testTrigramTokenizerCaseInsensitiveRemovingDiacritics() throws {
234+
#if GRDBCUSTOMSQLITE || GRDBCIPHER
235+
guard sqlite3_libversion_number() >= 3045000 else {
236+
throw XCTSkip("FTS5 trigram tokenizer remove_diacritics is not available")
237+
}
238+
239+
let dbQueue = try makeDatabaseQueue()
240+
try dbQueue.inDatabase { db in
241+
try db.create(virtualTable: "documents", using: FTS5()) { t in
242+
t.tokenizer = .trigram(matching: .caseInsensitiveRemovingDiacritics)
243+
t.column("content")
244+
}
245+
assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram'' ''remove_diacritics'' ''1''')")
246+
}
247+
#else
248+
throw XCTSkip("FTS5 trigram tokenizer remove_diacritics is not available")
249+
#endif
250+
}
251+
170252
func testColumns() throws {
171253
let dbQueue = try makeDatabaseQueue()
172254
try dbQueue.inDatabase { db in

0 commit comments

Comments
 (0)