implementing hashtags per UAX31 standard #40

farooqkz · 2023-08-09T12:29:28Z

closes #8, closes #32

Old

Hello. This is a draft. I am stuck with Rust stuff :)

I need some hints. Like how to pass the sets to hashtag and advice on implementing From<PropertiesError> for CustomError. Furthermore, I've upgraded the rust toolchain to 1.71.0 because icu_properties requires it.

src/parser/parse_from_text/text_elements.rs

Simon-Laux · 2023-08-09T16:56:21Z

icu_properties seems big, I would prefer if we could live without it. Like there is a difference between an emoji regex that checks for each emoji and one that checks for unicode ranges instead. Like if code point is in this range instead of comparing with all code points of that range - if possible that would be faster, less code and less binary size.

farooqkz · 2023-08-10T11:48:17Z

icu_properties seems big, I would prefer if we could live without it. Like there is a difference between an emoji regex that checks for each emoji and one that checks for unicode ranges instead. Like if code point is in this range instead of comparing with all code points of that range - if possible that would be faster, less code and less binary size.

It turns out we have to extract from here:

https://www.unicode.org/Public/14.0.0/ucdxml/

Simon-Laux · 2023-08-10T17:26:26Z

allowing only a subset and expanding later is better that allowing more and restricting later. It does not have to be 100% perfect on the first try

farooqkz · 2023-08-13T11:40:03Z

icu_properties seems big, I would prefer if we could live without it. Like there is a difference between an emoji regex that checks for each emoji and one that checks for unicode ranges instead. Like if code point is in this range instead of comparing with all code points of that range - if possible that would be faster, less code and less binary size.

Are you sure it's big? The binary size is only about 100 KB larger.

farooqkz · 2023-08-13T13:18:54Z

Okay we've got two ways:

Create our own hashtag content character detector crate which I guess will be about 20-30 KB binary size and use it in this crate.
Use icu_properties and make it optional. With this option, the hashtag detection will be done exactly as UAX31 has specified. Without it, we consider any non whitespace character a content character but ~100KB less binary size.

farooqkz · 2023-08-14T09:24:42Z

Okay. My new estimate is that if we implement our own solution, it'll be only about 10KB additional binary size.

farooqkz · 2023-08-14T10:42:07Z

Okay. My new estimate is that if we implement our own solution, it'll be only about 10KB additional binary size.

Okay I was wrong. See my latest commit. It's about 36KB :)

Also @Simon-Laux, for some reason I can't tell, the tests are not passing :/

Simon-Laux · 2023-08-14T18:36:13Z

Also @Simon-Laux, for some reason I can't tell, the tests are not passing :/

if you read the clip error it complains being run from a branch on your fork. But also litemap seems to use an unstable feature.

src/parser/parse_from_text/hashtag_content_char_ranges.rs

src/parser/parse_from_text/text_elements.rs

farooqkz · 2023-08-15T10:45:47Z

Okay. In these final commits:

I've moved the function for finding range to its own file alongside with the ranges themselves.
I've used RangeInclusive and its contains method instead of [u32; 2].
A new enum has been defined as the result of the find range function.

But still the tests for German umlauts fail. Even though the left and right of the assertions seem to be equal:

---- text_to_ast::text_only::german_umlaut_hashtag stdout ----
thread 'text_to_ast::text_only::german_umlaut_hashtag' panicked at 'assertion failed: `(left != right)`
  left: `[Tag("#bücher"), Text(" "), Tag("#Ängste")]`,
 right: `[Tag("#bücher"), Text(" "), Tag("#Ängste")]`', tests/text_to_ast/text_only.rs:114:5

Simon-Laux

I still want a unit test that checks all ranges, and more example tests with different "real" use cases. But other than that the code looks good I'd say (didn't test nor benchmark it yet)

farooqkz · 2023-08-15T18:11:27Z

I still want a unit test that checks all ranges, and more example tests with different "real" use cases. But other than that the code looks good I'd say (didn't test nor benchmark it yet)

So you want test cases for the range function separately? I will add more real use case tests. But I also need hint from you why these tests fail :/

Simon-Laux · 2023-08-16T12:41:26Z

So you want test cases for the range function separately? I will add more real use case tests. But I also need hint from you why these tests fail :/

yes, you can also test for the german umlaute there: ö, ä, ü and the sharp s: ß

farooqkz · 2023-08-23T08:40:45Z

@Simon-Laux Ready to merge :D

add 2 more unit tests add space to early exit

…rencing to UCD and the script

Simon-Laux reviewed Aug 9, 2023

View reviewed changes

src/parser/parse_from_text/text_elements.rs Outdated Show resolved Hide resolved

Simon-Laux reviewed Aug 9, 2023

View reviewed changes

src/parser/parse_from_text/text_elements.rs Outdated Show resolved Hide resolved

farooqkz marked this pull request as ready for review August 14, 2023 10:51

farooqkz added 12 commits August 14, 2023 14:25

implementing hashtags per UAX31 standard

6fe3f97

upgrade rust-toolchain

2363fe6

upgrade rust toolchain to 1.71.0

4b5c25c

no alternative hashtag characters

7aba508

fixes

3208d29

using lazy_static

63ce1e4

new method of hashtag detection

dc261d5

no lazy_static

161417f

good formatting

e117b66

more rusty

f55ba99

revert required rust toolchain

7156b02

move character ranges to their own file

4eef3a5

farooqkz force-pushed the i8 branch from 7ebf1c0 to 4eef3a5 Compare August 14, 2023 11:00

Simon-Laux requested changes Aug 14, 2023

View reviewed changes

src/parser/parse_from_text/hashtag_content_char_ranges.rs Outdated Show resolved Hide resolved

src/parser/parse_from_text/text_elements.rs Outdated Show resolved Hide resolved

src/parser/parse_from_text/text_elements.rs Outdated Show resolved Hide resolved

farooqkz added 4 commits August 15, 2023 13:44

remove unused deps

c75bb63

moving find_range func to its own file, refactor

885a409

another refactor to make it more idiomatic

535f3b6

fix formatting

8326dd3

farooqkz requested a review from Simon-Laux August 15, 2023 10:31

Simon-Laux requested changes Aug 15, 2023

View reviewed changes

add new hashtag testcases with Persian letters

090f160

farooqkz added 10 commits August 18, 2023 09:18

fix formatting

63f5298

using const instead of static

446d2f5

make clippy happy

2894d8a

correct tests

7b3bda2

making many modules and funcs pub

8888c0f

fix lifetime thingie

edf5fba

roper formatting

40b20e5

update CHANGELOG

64c9c18

correct texts

b142c78

let's make clippy happy!

d785d85

Simon-Laux and others added 2 commits August 23, 2023 15:37

move unit tests into source code

96f4201

add 2 more unit tests add space to early exit

add script to extract hashtag content char ranges and a commment refe…

bc1464a

…rencing to UCD and the script

Simon-Laux approved these changes Aug 24, 2023

View reviewed changes

cargo fmt

9d81945

Simon-Laux merged commit c049958 into deltachat:master Aug 24, 2023
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementing hashtags per UAX31 standard #40

implementing hashtags per UAX31 standard #40

farooqkz commented Aug 9, 2023 •

edited by Simon-Laux

Loading

Simon-Laux commented Aug 9, 2023

farooqkz commented Aug 10, 2023

Simon-Laux commented Aug 10, 2023

farooqkz commented Aug 13, 2023

farooqkz commented Aug 13, 2023

farooqkz commented Aug 14, 2023

farooqkz commented Aug 14, 2023

Simon-Laux commented Aug 14, 2023 •

edited

Loading

farooqkz commented Aug 15, 2023

Simon-Laux left a comment

farooqkz commented Aug 15, 2023

Simon-Laux commented Aug 16, 2023

farooqkz commented Aug 23, 2023

implementing hashtags per UAX31 standard #40

implementing hashtags per UAX31 standard #40

Conversation

farooqkz commented Aug 9, 2023 • edited by Simon-Laux Loading

Simon-Laux commented Aug 9, 2023

farooqkz commented Aug 10, 2023

Simon-Laux commented Aug 10, 2023

farooqkz commented Aug 13, 2023

farooqkz commented Aug 13, 2023

farooqkz commented Aug 14, 2023

farooqkz commented Aug 14, 2023

Simon-Laux commented Aug 14, 2023 • edited Loading

farooqkz commented Aug 15, 2023

Simon-Laux left a comment

Choose a reason for hiding this comment

farooqkz commented Aug 15, 2023

Simon-Laux commented Aug 16, 2023

farooqkz commented Aug 23, 2023

farooqkz commented Aug 9, 2023 •

edited by Simon-Laux

Loading

Simon-Laux commented Aug 14, 2023 •

edited

Loading