@@ -137,7 +137,7 @@ __Who is this for?__
137
137
<span style="color:#ABABAB;">arm:</span> <b>0.02</b> GB/s
138
138
</td>
139
139
<td align="center">
140
- <code>sz_find_charset </code><br/>
140
+ <code>sz_find_byteset </code><br/>
141
141
<span style="color:#ABABAB;">x86:</span> <b>4.08</b> ·
142
142
<span style="color:#ABABAB;">arm:</span> <b>3.22</b> GB/s
143
143
</td>
@@ -155,7 +155,7 @@ __Who is this for?__
155
155
</td>
156
156
<td align="center">⚪</td>
157
157
<td align="center">
158
- <code>sz_rfind_charset </code><br/>
158
+ <code>sz_rfind_byteset </code><br/>
159
159
<span style="color:#ABABAB;">x86:</span> <b>0.43</b> ·
160
160
<span style="color:#ABABAB;">arm:</span> <b>0.23</b> GB/s
161
161
</td>
@@ -181,7 +181,7 @@ __Who is this for?__
181
181
<span style="color:#ABABAB;">arm:</span> <b>5.9</b> MB/s
182
182
</td>
183
183
<td align="center">
184
- <code>sz_generate </code><br/>
184
+ <code>sz_fill_random </code><br/>
185
185
<span style="color:#ABABAB;">x86:</span> <b>56.2</b> ·
186
186
<span style="color:#ABABAB;">arm:</span> <b>25.8</b> MB/s
187
187
</td>
@@ -203,7 +203,7 @@ __Who is this for?__
203
203
<span style="color:#ABABAB;">arm:</span> <b>140.0</b> MB/s
204
204
</td>
205
205
<td align="center">
206
- <code>sz_look_up_transform </code><br/>
206
+ <code>sz_lookup </code><br/>
207
207
<span style="color:#ABABAB;">x86:</span> <b>21.2</b> ·
208
208
<span style="color:#ABABAB;">arm:</span> <b>8.5</b> GB/s
209
209
</td>
@@ -247,7 +247,7 @@ __Who is this for?__
247
247
<span style="color:#ABABAB;">arm:</span> <b>2,220</b> ns
248
248
</td>
249
249
<td align="center">
250
- <code>sz_edit_distance </code><br/>
250
+ <code>sz_levenshtein_distance </code><br/>
251
251
<span style="color:#ABABAB;">x86:</span> <b>99</b> ·
252
252
<span style="color:#ABABAB;">arm:</span> <b>180</b> ns
253
253
</td>
@@ -265,7 +265,7 @@ __Who is this for?__
265
265
<span style="color:#ABABAB;">arm:</span> <b>367</b> ms
266
266
</td>
267
267
<td align="center">
268
- <code>sz_alignment_score </code><br/>
268
+ <code>sz_needleman_wunsch_score </code><br/>
269
269
<span style="color:#ABABAB;">x86:</span> <b>73</b> ·
270
270
<span style="color:#ABABAB;">arm:</span> <b>177</b> ms
271
271
</td>
@@ -396,8 +396,8 @@ x: int = text.find_first_of('chars', start=0, end=sys.maxsize)
396
396
x: int = text.find_last_of(' chars' , start = 0 , end = sys.maxsize)
397
397
x: int = text.find_first_not_of(' chars' , start = 0 , end = sys.maxsize)
398
398
x: int = text.find_last_not_of(' chars' , start = 0 , end = sys.maxsize)
399
- x: Strs = text.split_charset (separator = ' chars' , maxsplit = sys.maxsize, keepseparator = False )
400
- x: Strs = text.rsplit_charset (separator = ' chars' , maxsplit = sys.maxsize, keepseparator = False )
399
+ x: Strs = text.split_byteset (separator = ' chars' , maxsplit = sys.maxsize, keepseparator = False )
400
+ x: Strs = text.rsplit_byteset (separator = ' chars' , maxsplit = sys.maxsize, keepseparator = False )
401
401
```
402
402
403
403
You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set.
@@ -453,8 +453,8 @@ StringZilla saves a lot of memory by viewing existing memory regions as substrin
453
453
``` py
454
454
x: SplitIterator[Str] = text.split_iter(separator = ' ' , keepseparator = False )
455
455
x: SplitIterator[Str] = text.rsplit_iter(separator = ' ' , keepseparator = False )
456
- x: SplitIterator[Str] = text.split_charset_iter (separator = ' chars' , keepseparator = False )
457
- x: SplitIterator[Str] = text.rsplit_charset_iter (separator = ' chars' , keepseparator = False )
456
+ x: SplitIterator[Str] = text.split_byteset_iter (separator = ' chars' , keepseparator = False )
457
+ x: SplitIterator[Str] = text.rsplit_byteset_iter (separator = ' chars' , keepseparator = False )
458
458
```
459
459
460
460
StringZilla can easily be 10x more memory efficient than native Python classes for tokenization.
@@ -654,7 +654,7 @@ By design, StringZilla has a couple of notable differences from LibC:
654
654
655
655
That way `sz_find` and `sz_rfind` are similar to `strstr` and `strrstr` in LibC.
656
656
Similarly, `sz_find_byte` and `sz_rfind_byte` replace `memchr` and `memrchr`.
657
- The `sz_find_charset ` maps to `strspn` and `strcspn`, while `sz_rfind_charset ` has no sibling in LibC.
657
+ The `sz_find_byteset ` maps to `strspn` and `strcspn`, while `sz_rfind_byteset ` has no sibling in LibC.
658
658
659
659
<table>
660
660
<tr>
@@ -679,11 +679,11 @@ The `sz_find_charset` maps to `strspn` and `strcspn`, while `sz_rfind_charset` h
679
679
</tr>
680
680
<tr>
681
681
<td><code>strcspn(haystack, needles)</code></td>
682
- <td><code>sz_rfind_charset (haystack, haystack_length, needles_bitset)</code></td>
682
+ <td><code>sz_rfind_byteset (haystack, haystack_length, needles_bitset)</code></td>
683
683
</tr>
684
684
<tr>
685
685
<td><code>strspn(haystack, needles)</code></td>
686
- <td><code>sz_find_charset (haystack, haystack_length, needles_bitset)</code></td>
686
+ <td><code>sz_find_byteset (haystack, haystack_length, needles_bitset)</code></td>
687
687
</tr>
688
688
<tr>
689
689
<td><code>memmem(haystack, haystack_length, needle, needle_length)</code>, <code>strstr</code></td>
@@ -923,7 +923,7 @@ StringZilla provides a convenient `partition` function, which returns a tuple of
923
923
``` cpp
924
924
auto parts = haystack.partition(' :' ); // Matching a character
925
925
auto [before, match, after] = haystack.partition(' :' ); // Structure unpacking
926
- auto [before, match, after] = haystack.partition(sz::char_set (" :;" )); // Character-set argument
926
+ auto [before, match, after] = haystack.partition(sz::byteset (" :;" )); // Character-set argument
927
927
auto [before, match, after] = haystack.partition(" : " ); // String argument
928
928
auto [before, match, after] = haystack.rpartition(sz::whitespaces_set()); // Split around the last whitespace
929
929
```
@@ -951,8 +951,8 @@ Here is a sneak peek of the most useful ones.
951
951
``` cpp
952
952
text.hash(); // -> 64 bit unsigned integer
953
953
text.ssize(); // -> 64 bit signed length to avoid `static_cast<std::ssize_t>(text.size())`
954
- text.contains_only(" \w\t " ); // == text.find_first_not_of(sz::char_set (" \w\t")) == npos;
955
- text.contains(sz::whitespaces_set()); // == text.find(sz::char_set (sz::whitespaces_set())) != npos;
954
+ text.contains_only(" \w\t " ); // == text.find_first_not_of(sz::byteset (" \w\t")) == npos;
955
+ text.contains(sz::whitespaces_set()); // == text.find(sz::byteset (sz::whitespaces_set())) != npos;
956
956
957
957
// Simpler slicing than `substr`
958
958
text.front(10 ); // -> sz::string_view
@@ -997,7 +997,7 @@ To avoid those, StringZilla provides lazily-evaluated ranges, compatible with th
997
997
998
998
``` cpp
999
999
for (auto line : haystack.split(" \r\n " ))
1000
- for (auto word : line.split(sz::char_set (" \w\t .,;:!?" )))
1000
+ for (auto word : line.split(sz::byteset (" \w\t .,;:!?" )))
1001
1001
std::cout << word << std::endl;
1002
1002
```
1003
1003
@@ -1006,9 +1006,9 @@ It also allows interleaving matches, if you want both inclusions of `xx` in `xxx
1006
1006
Debugging pointer offsets is not a pleasant exercise, so keep the following functions in mind.
1007
1007
1008
1008
- ` haystack.[r]find_all(needle, interleaving) `
1009
- - ` haystack.[r]find_all(sz::char_set ("")) `
1009
+ - ` haystack.[r]find_all(sz::byteset ("")) `
1010
1010
- ` haystack.[r]split(needle) `
1011
- - ` haystack.[r]split(sz::char_set ("")) `
1011
+ - ` haystack.[r]split(sz::byteset ("")) `
1012
1012
1013
1013
For $N$ matches the split functions will report $N+1$ matches, potentially including empty strings.
1014
1014
Ranges have a few convenience methods as well:
@@ -1065,7 +1065,7 @@ sz::string random_string(std::size_t length, char const *alphabet, std::size_t c
1065
1065
```
1066
1066
1067
1067
Mouthful and slow.
1068
- StringZilla provides a C native method - `sz_generate ` and a convenient C++ wrapper - `sz::generate`.
1068
+ StringZilla provides a C native method - `sz_fill_random ` and a convenient C++ wrapper - `sz::generate`.
1069
1069
Similar to Python it also defines the commonly used character sets.
1070
1070
1071
1071
```cpp
@@ -1085,9 +1085,9 @@ In text processing, it's often necessary to replace all occurrences of a specifi
1085
1085
Standard library functions may not offer the most efficient or convenient methods for performing bulk replacements, especially when dealing with large strings or performance-critical applications.
1086
1086
1087
1087
- ` haystack.replace_all(needle_string, replacement_string) `
1088
- - ` haystack.replace_all(sz::char_set (""), replacement_string) `
1088
+ - ` haystack.replace_all(sz::byteset (""), replacement_string) `
1089
1089
- ` haystack.try_replace_all(needle_string, replacement_string) `
1090
- - ` haystack.try_replace_all(sz::char_set (""), replacement_string) `
1090
+ - ` haystack.try_replace_all(sz::byteset (""), replacement_string) `
1091
1091
- ` haystack.transform(sz::look_up_table::identity()) `
1092
1092
- ` haystack.transform(sz::look_up_table::identity(), haystack.data()) `
1093
1093
@@ -1250,8 +1250,8 @@ sz::find("Hello, world!", "world") // 7
1250
1250
sz :: rfind (" Hello, world!" , " world" ) // 7
1251
1251
1252
1252
// Generalizations of `memchr::memrchr[123]`
1253
- sz :: find_char_from (" Hello, world!" , " world" ) // 2
1254
- sz :: rfind_char_from (" Hello, world!" , " world" ) // 11
1253
+ sz :: find_byte_from (" Hello, world!" , " world" ) // 2
1254
+ sz :: rfind_byte_from (" Hello, world!" , " world" ) // 11
1255
1255
```
1256
1256
1257
1257
Unlike ` memchr ` , the throughput of ` stringzilla ` is [ high in both normal and reverse-order searches] [ memchr-benchmarks ] .
@@ -1268,10 +1268,10 @@ let my_cow_str = Cow::from(&my_string);
1268
1268
// Use the generic function with a String
1269
1269
assert_eq! (my_string . sz_find (" world" ), Some (7 ));
1270
1270
assert_eq! (my_string . sz_rfind (" world" ), Some (7 ));
1271
- assert_eq! (my_string . sz_find_char_from (" world" ), Some (2 ));
1272
- assert_eq! (my_string . sz_rfind_char_from (" world" ), Some (11 ));
1273
- assert_eq! (my_string . sz_find_char_not_from (" world" ), Some (0 ));
1274
- assert_eq! (my_string . sz_rfind_char_not_from (" world" ), Some (12 ));
1271
+ assert_eq! (my_string . sz_find_byte_from (" world" ), Some (2 ));
1272
+ assert_eq! (my_string . sz_rfind_byte_from (" world" ), Some (11 ));
1273
+ assert_eq! (my_string . sz_find_byte_not_from (" world" ), Some (0 ));
1274
+ assert_eq! (my_string . sz_rfind_byte_not_from (" world" ), Some (12 ));
1275
1275
1276
1276
// Same works for &str and Cow<'_, str>
1277
1277
assert_eq! (my_str . sz_find (" world" ), Some (7 ));
@@ -1315,7 +1315,7 @@ s[s.findLast(substring: "o")!...] // "o StringZilla. 👋")
1315
1315
s[s.findFirst (characterFrom : " aeiou" )! ... ] // "ello, world! Welcome to StringZilla. 👋")
1316
1316
s[s.findLast (characterFrom : " aeiou" )! ... ] // "a. 👋")
1317
1317
s[s.findFirst (characterNotFrom : " aeiou" )! ... ] // "Hello, world! Welcome to StringZilla. 👋"
1318
- s.editDistance (from : " Hello, world!" )! // 29
1318
+ s.levenshteinDistance (from : " Hello, world!" )! // 29
1319
1319
```
1320
1320
1321
1321
## Algorithms & Design Decisions 📚
@@ -1561,7 +1561,7 @@ Most StringZilla operations are byte-level, so they work well with ASCII and UTF
1561
1561
In some cases, like edit-distance computation, the result of byte-level evaluation and character-level evaluation may differ.
1562
1562
So StringZilla provides following functions to work with Unicode:
1563
1563
1564
- - ` sz_edit_distance_utf8 ` - computes the Levenshtein distance between two UTF-8 strings.
1564
+ - ` sz_levenshtein_distance_utf8 ` - computes the Levenshtein distance between two UTF-8 strings.
1565
1565
- ` sz_hamming_distance_utf8 ` - computes the Hamming distance between two UTF-8 strings.
1566
1566
1567
1567
Java, JavaScript, Python 2, C#, and Objective-C, however, use wide characters (` wchar ` ) - two byte long codes, instead of the more reasonable fixed-length UTF32 or variable-length UTF8.
0 commit comments