Add support for the FTS5 trigram tokenizer

Jnosh · Jnosh · commit 85784be084ea · 2024-10-14T01:40:02.000+02:00
diff --git a/Documentation/FTS5Tokenizers.md b/Documentation/FTS5Tokenizers.md
@@ -39,6 +39,8 @@ All SQLite [built-in tokenizers](https://www.sqlite.org/fts5.html#tokenizers) to
 
 - The [porter](https://www.sqlite.org/fts5.html#porter_tokenizer) tokenizer turns English words into their root: "database engine" gives the "databas" and "engin" tokens. The query "database engines" will match, because it produces the same tokens.
 
+- The [trigram](https://sqlite.org/fts5.html#the_trigram_tokenizer) tokenizer treats each contiguous sequence of three characters as a token to allow general substring matching. "Sequence" gives "seq", "equ", "que", "uen", "enc" and "nce". The queries "SEQUENCE", "SEQUEN", "QUENC" and "QUE" all match as they decompose into a subset of the same trigrams.
+
 However, built-in tokenizers don't match "first" with "1st", because they produce the different "first" and "1st" tokens.
 
 Nor do they match "Grossmann" with "Großmann", because they produce the different "grossmann" and "großmann" tokens.
diff --git a/Documentation/FullTextSearch.md b/Documentation/FullTextSearch.md
@@ -386,7 +386,7 @@ See [SQLite documentation](https://www.sqlite.org/fts5.html) for more informatio
 
 **A tokenizer defines what "matching" means.** Depending on the tokenizer you choose, full-text searches won't return the same results.
 
-SQLite ships with three built-in FTS5 tokenizers: `ascii`, `porter` and `unicode61` that use different algorithms to match queries with indexed content.
+SQLite ships with four built-in FTS5 tokenizers: `ascii`, `porter`, `unicode61` and `trigram` that use different algorithms to match queries with indexed content.
 
 ```swift
 try db.create(virtualTable: "book", using: FTS5()) { t in
@@ -395,20 +395,23 @@ try db.create(virtualTable: "book", using: FTS5()) { t in
     t.tokenizer = .unicode61(...)
     t.tokenizer = .ascii
     t.tokenizer = .porter(...)
+    t.tokenizer = .trigram(...)
 }
 ```
 
 See below some examples of matches:
 
-| content     | query      | ascii  | unicode61 | porter on ascii | porter on unicode61 |
-| ----------- | ---------- | :----: | :-------: | :-------------: | :-----------------: |
-| Foo         | Foo        |   X    |     X     |        X        |          X          |
-| Foo         | FOO        |   X    |     X     |        X        |          X          |
-| Jérôme      | Jérôme     |   X ¹  |     X ¹   |        X ¹      |          X ¹        |
-| Jérôme      | JÉRÔME     |        |     X ¹   |                 |          X ¹        |
-| Jérôme      | Jerome     |        |     X ¹   |                 |          X ¹        |
-| Database    | Databases  |        |           |        X        |          X          |
-| Frustration | Frustrated |        |           |        X        |          X          |
+| content     | query      | ascii  | unicode61 | porter on ascii | porter on unicode61 | trigram |
+| ----------- | ---------- | :----: | :-------: | :-------------: | :-----------------: | :-----: |
+| Foo         | Foo        |   X    |     X     |        X        |          X          |    X    |
+| Foo         | FOO        |   X    |     X     |        X        |          X          |    X    |
+| Jérôme      | Jérôme     |   X ¹  |     X ¹   |        X ¹      |          X ¹        |    X ¹  |
+| Jérôme      | JÉRÔME     |        |     X ¹   |                 |          X ¹        |    X ¹  |
+| Jérôme      | Jerome     |        |     X ¹   |                 |          X ¹        |    X ¹  |
+| Database    | Databases  |        |           |        X        |          X          |         |
+| Frustration | Frustrated |        |           |        X        |          X          |         |
+| Sequence    | quenc      |        |           |                 |                     |    X    |
+
 
 ¹ Don't miss [Unicode Full-Text Gotchas](#unicode-full-text-gotchas)
 
@@ -455,6 +458,24 @@ See below some examples of matches:
     
     It strips diacritics from latin script characters if it wraps unicode61, and does not if it wraps ascii (see the example above).
 
+- **trigram**
+    
+    ```swift
+    try db.create(virtualTable: "book", using: FTS5()) { t in
+        t.tokenizer = .trigram()
+        t.tokenizer = .trigram(matching: .caseInsensitiveRemovingDiacritics)
+        t.tokenizer = .trigram(matching: .caseSensitive)
+    }
+    ```
+    
+    The "trigram" tokenizer is case-insensitive for unicode characters by default. It matches "Jérôme" with "JÉRÔME".
+    
+    Diacritics stripping can be enabled so it matches "jérôme" with "jerome". Case-sensitive matching can also be enabled but is mutually exclusive with diacritics stripping.
+    
+    Unlike the other tokenizers, it provides general substring matching, matching "Sequence" with "que" by splitting character sequences into overlapping 3 character tokens (trigrams).
+    
+    It can also act as an index for GLOB and LIKE queries depending on the configuration.
+
 See [SQLite tokenizers](https://www.sqlite.org/fts5.html#tokenizers) for more information, and [custom FTS5 tokenizers](FTS5Tokenizers.md) in order to add your own tokenizers.
 
 
diff --git a/GRDB/FTS/FTS5.swift b/GRDB/FTS/FTS5.swift
@@ -74,6 +74,48 @@ public struct FTS5 {
         #endif
     }
     
+    #if GRDBCUSTOMSQLITE || GRDBCIPHER
+    /// Options for trigram tokenizer character matching. Matches the raw
+    /// "case_sensitive" and "remove_diacritics" tokenizer arguments.
+    ///
+    /// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
+    public enum TrigramTokenizerMatching: Sendable {
+        /// Case insensitive matching without removing diacritics. This
+        /// option matches the raw "case_sensitive=0 remove_diacritics=0"
+        /// tokenizer argument.
+        case caseInsensitive
+        /// Case insensitive matching that removes diacritics before
+        /// matching. This option matches the raw
+        /// "case_sensitive=0 remove_diacritics=1" tokenizer argument.
+        case caseInsensitiveRemovingDiacritics
+        /// Case sensitive matching. Diacritics are not removed when
+        /// performing case sensitive matching. This option matches the raw
+        /// "case_sensitive=1 remove_diacritics=0" tokenizer argument.
+        case caseSensitive
+    }
+    #else
+    /// Options for trigram tokenizer character matching. Matches the raw
+    /// "case_sensitive" and "remove_diacritics" tokenizer arguments.
+    ///
+    /// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
+    @available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually)
+    public enum TrigramTokenizerMatching: Sendable {
+        /// Case insensitive matching without removing diacritics. This
+        /// option matches the raw "case_sensitive=0 remove_diacritics=0"
+        /// tokenizer argument.
+        case caseInsensitive
+        /// Case insensitive matching that removes diacritics before
+        /// matching. This option matches the raw
+        /// "case_sensitive=0 remove_diacritics=1" tokenizer argument.
+        @available(*, unavailable, message: "Requires a future OS release that includes SQLite >=3.45")
+        case caseInsensitiveRemovingDiacritics
+        /// Case sensitive matching. Diacritics are not removed when
+        /// performing case sensitive matching. This option matches the raw
+        /// "case_sensitive=1 remove_diacritics=0" tokenizer argument.
+        case caseSensitive
+    }
+    #endif
+    
     /// Creates an FTS5 module.
     ///
     /// For example:
diff --git a/GRDB/FTS/FTS5Tokenizer.swift b/GRDB/FTS/FTS5Tokenizer.swift
@@ -148,11 +148,11 @@ extension FTS5Tokenizer {
     private func tokenize(_ string: String, for tokenization: FTS5Tokenization)
     throws -> [(token: String, flags: FTS5TokenFlags)]
     {
-        try ContiguousArray(string.utf8).withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in
+        try string.utf8CString.withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in
             guard let addr = buffer.baseAddress else {
                 return []
             }
-            let pText = UnsafeMutableRawPointer(mutating: addr).assumingMemoryBound(to: CChar.self)
+            let pText = addr
             let nText = CInt(buffer.count)
             
             var context = TokenizeContext()
diff --git a/GRDB/FTS/FTS5TokenizerDescriptor.swift b/GRDB/FTS/FTS5TokenizerDescriptor.swift
@@ -210,5 +210,70 @@ public struct FTS5TokenizerDescriptor: Sendable {
         }
         return FTS5TokenizerDescriptor(components: components)
     }
+
+    #if GRDBCUSTOMSQLITE || GRDBCIPHER
+    /// The "trigram" tokenizer.
+    ///
+    /// For example:
+    ///
+    /// ```swift
+    /// try db.create(virtualTable: "book", using: FTS5()) { t in
+    ///     t.tokenizer = .trigram()
+    /// }
+    /// ```
+    ///
+    /// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
+    ///
+    /// - parameters:
+    ///     - matching: By default SQLite will perform case insensitive
+    ///     matching and not remove diacritics before matching.
+    public static func trigram(
+        matching: FTS5.TrigramTokenizerMatching = .caseInsensitive
+    ) -> FTS5TokenizerDescriptor {
+        var components = ["trigram"]
+        switch matching {
+        case .caseInsensitive:
+            break
+        case .caseInsensitiveRemovingDiacritics:
+            components.append(contentsOf: ["remove_diacritics", "1"])
+        case .caseSensitive:
+            components.append(contentsOf: ["case_sensitive", "1"])
+        }
+        
+        return FTS5TokenizerDescriptor(components: components)
+    }
+    #else
+    /// The "trigram" tokenizer.
+    ///
+    /// For example:
+    ///
+    /// ```swift
+    /// try db.create(virtualTable: "book", using: FTS5()) { t in
+    ///     t.tokenizer = .trigram()
+    /// }
+    /// ```
+    ///
+    /// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
+    ///
+    /// - parameters:
+    ///     - matching: By default SQLite will perform case insensitive
+    ///     matching and not remove diacritics before matching.
+    @available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually)
+    public static func trigram(
+        matching: FTS5.TrigramTokenizerMatching = .caseInsensitive
+    ) -> FTS5TokenizerDescriptor {
+        var components = ["trigram"]
+        switch matching {
+        case .caseInsensitive:
+            break
+        case .caseInsensitiveRemovingDiacritics:
+            components.append(contentsOf: ["remove_diacritics", "1"])
+        case .caseSensitive:
+            components.append(contentsOf: ["case_sensitive", "1"])
+        }
+
+        return FTS5TokenizerDescriptor(components: components)
+    }
+    #endif
 }
 #endif
diff --git a/Tests/GRDBTests/FTS5TableBuilderTests.swift b/Tests/GRDBTests/FTS5TableBuilderTests.swift
@@ -166,7 +166,89 @@ class FTS5TableBuilderTests: GRDBTestCase {
             assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''unicode61'' ''tokenchars'' ''-.''')")
         }
     }
+    
+    func testTrigramTokenizer() throws {
+        #if GRDBCUSTOMSQLITE || GRDBCIPHER
+        guard sqlite3_libversion_number() >= 3034000 else {
+            throw XCTSkip("FTS5 trigram tokenizer is not available")
+        }
+        #else
+        guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
+            throw XCTSkip("FTS5 trigram tokenizer is not available")
+        }
+        #endif
+        
+        let dbQueue = try makeDatabaseQueue()
+        try dbQueue.inDatabase { db in
+            try db.create(virtualTable: "documents", using: FTS5()) { t in
+                t.tokenizer = .trigram()
+                t.column("content")
+            }
+            assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram''')")
+        }
+    }
+    
+    func testTrigramTokenizerCaseInsensitive() throws {
+        #if GRDBCUSTOMSQLITE || GRDBCIPHER
+        guard sqlite3_libversion_number() >= 3034000 else {
+            throw XCTSkip("FTS5 trigram tokenizer is not available")
+        }
+        #else
+        guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
+            throw XCTSkip("FTS5 trigram tokenizer is not available")
+        }
+        #endif
+        
+        let dbQueue = try makeDatabaseQueue()
+        try dbQueue.inDatabase { db in
+            try db.create(virtualTable: "documents", using: FTS5()) { t in
+                t.tokenizer = .trigram(matching: .caseInsensitive)
+                t.column("content")
+            }
+            assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram''')")
+        }
+    }
 
+    func testTrigramTokenizerCaseSensitive() throws {
+        #if GRDBCUSTOMSQLITE || GRDBCIPHER
+        guard sqlite3_libversion_number() >= 3034000 else {
+            throw XCTSkip("FTS5 trigram tokenizer is not available")
+        }
+        #else
+        guard #available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) else {
+            throw XCTSkip("FTS5 trigram tokenizer is not available")
+        }
+        #endif
+        
+        let dbQueue = try makeDatabaseQueue()
+        try dbQueue.inDatabase { db in
+            try db.create(virtualTable: "documents", using: FTS5()) { t in
+                t.tokenizer = .trigram(matching: .caseSensitive)
+                t.column("content")
+            }
+            assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram'' ''case_sensitive'' ''1''')")
+        }
+    }
+    
+    func testTrigramTokenizerCaseInsensitiveRemovingDiacritics() throws {
+        #if GRDBCUSTOMSQLITE || GRDBCIPHER
+        guard sqlite3_libversion_number() >= 3045000 else {
+            throw XCTSkip("FTS5 trigram tokenizer remove_diacritics is not available")
+        }
+                
+        let dbQueue = try makeDatabaseQueue()
+        try dbQueue.inDatabase { db in
+            try db.create(virtualTable: "documents", using: FTS5()) { t in
+                t.tokenizer = .trigram(matching: .caseInsensitiveRemovingDiacritics)
+                t.column("content")
+            }
+            assertDidExecute(sql: "CREATE VIRTUAL TABLE \"documents\" USING fts5(content, tokenize='''trigram'' ''remove_diacritics'' ''1''')")
+        }
+        #else
+        throw XCTSkip("FTS5 trigram tokenizer remove_diacritics is not available")
+        #endif
+    }
+    
     func testColumns() throws {
         let dbQueue = try makeDatabaseQueue()
         try dbQueue.inDatabase { db in
diff --git a/Tests/GRDBTests/FTS5TokenizerTests.swift b/Tests/GRDBTests/FTS5TokenizerTests.swift

Original file line number	Diff line number	Diff line change
`@@ -148,11 +148,11 @@ extension FTS5Tokenizer {`
`148`	`148`	`private func tokenize(_ string: String, for tokenization: FTS5Tokenization)`
`149`	`149`	`throws -> [(token: String, flags: FTS5TokenFlags)]`
`150`	`150`	`{`
`151`		`- try ContiguousArray(string.utf8).withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in`
	`151`	`+ try string.utf8CString.withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in`
`152`	`152`	`guard let addr = buffer.baseAddress else {`
`153`	`153`	`return []`
`154`	`154`	`}`
`155`		`- let pText = UnsafeMutableRawPointer(mutating: addr).assumingMemoryBound(to: CChar.self)`
	`155`	`+ let pText = addr`
`156`	`156`	`let nText = CInt(buffer.count)`
`157`	`157`
`158`	`158`	`var context = TokenizeContext()`