AggressiveTokenizer uses \W which allows `"_"` underscore character. However ALL other tokenizers REMOVE the underscore characters #523

niftylettuce · 2020-04-28T01:03:19Z

e.g. I'm left with tokens like "____________________________"

Pretty sure this affects all locales of stemming/tokenization as well

niftylettuce · 2020-04-28T05:07:18Z

AggressiveTokenizer uses \W which allows "_" underscore character. However SOME other tokenizers, e.g. AggressiveTokenizerEs, AggressiveTokenizerFr, etc. will REMOVE the underscore characters, as they are not included in their .split tokenizer regular expression. There is no consistency here among _ underscore and - hyphen characters as far as I can tell, but I feel there should be.

natural/lib/natural/tokenizers/aggressive_tokenizer_es.js

Line 35 in 73acfeb

return this.trim(text.split(/[^a-zA-Zá-úÁ-ÚñÑüÜ]+/));

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_fa.js

Line 41 in 73acfeb

return text.replace(new RegExp('\.\:\+\-\=\(\)\"\'\!\?\،\,\؛\;', 'g'), ' ');

(does NOT remove)

natural/lib/natural/tokenizers/aggressive_tokenizer_fr.js

Line 35 in 73acfeb

return this.trim(text.split(/[^a-z0-9äâàéèëêïîöôùüûœç]+/i));

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_id.js

Line 36 in 73acfeb

result = text.replace(/[^a-z0-9 -]/g, ' ').replace(/( +)/g, ' ');

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_it.js

Line 35 in 73acfeb

return this.trim(text.split(/\W+/));

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_nl.js

Line 35 in 73acfeb

return this.trim(text.split(/[^a-zA-Z0-9_']+/));

(does NOT remove)

natural/lib/natural/tokenizers/aggressive_tokenizer_no.js

Line 38 in 73acfeb

return this.trim(text.split(/[^A-Za-z0-9_æøåÆØÅäÄöÖüÜ]+/));

(does NOT remove)

natural/lib/natural/tokenizers/aggressive_tokenizer_pl.js

Line 44 in 73acfeb

return this.withoutEmpty(this.clearText(text).split(' '));

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_pt.js

Line 39 in 73acfeb

return this.withoutEmpty(this.trim(text.split(/[^a-zA-Zà-úÀ-Ú]/)));

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_ru.js

Line 39 in 73acfeb

return text.replace(/[^a-zа-яё0-9]/gi, ' ').replace(/[\s\n]+/g, ' ').trim();

(removes)

natural/lib/natural/tokenizers/aggressive_tokenizer_sv.js

Line 39 in 73acfeb

return this.trim(text.split(/[^A-Za-z0-9_åÅäÄöÖüÜ]+/));

(does not remove)

natural/lib/natural/tokenizers/aggressive_tokenizer_vi.js

Line 34 in 73acfeb

    
           return this.trim(text.split(/[^a-z0-9àáảãạăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵđ]+/i));

(removes)

natural/lib/natural/tokenizers/tokenizer_ja.js

Line 165 in 73acfeb

    
           return token.replace(/[＿－・，、；：！？．。（）［］｛｝｢｣＠＊＼／＆＃％｀＾＋＜＝＞｜～≪≫─＄＂_\-･,､;:!?.｡()[\]{}「」@*\/&#%`^+<=>|~«»$"\s]+/g, '');

(removes)

P.S. I also see that "-" hyphen character is/is not preserved in some tokenizers as well.

Hugo-ter-Doest · 2020-05-03T17:43:50Z

I added _ as a non-word character in aggressive tokenizer for English. I agree that tokenizers should handle _ consistently, but will leave that for another time. the handling of - (dash) is a different story, as in some language the dash binds words together to new words.

See #531

niftylettuce · 2020-05-03T22:11:57Z

Thanks @Hugo-ter-Doest. I also might suggest using a modified version of this regex https://github.com/regexhq/punctuation-regex/blob/master/index.js#L12

niftylettuce · 2020-05-04T05:10:52Z

@Hugo-ter-Doest if you can handle this I would gladly tip you - just email me at niftylettuce@gmail.com with your PayPal address - I'm writing a very advanced classifier that handles every edge case and need tokenization to adhere strictly based off the language's use of punctuation

niftylettuce · 2020-05-04T05:43:11Z

Also, I'm on the latest version, v2.1.2 and I still see underscores via AggressiveTokenizr English usage.

Hugo-ter-Doest · 2020-05-04T06:15:52Z

Can you give me some examples. I don't see underscores anymore. Also, the test I added to the spec is working fine:

it('should remove underscores', function() {
    expect(tokenizer.tokenize('_ hi_this_is_a_test_case_ for__removing___underscores_')).toEqual(['hi', 'this', 'is', 'a', 'test', 'case', 'for', 'removing', 'underscores']);

niftylettuce · 2020-05-04T06:40:24Z

It doesn't handle line-breaks

niftylettuce · 2020-05-04T06:40:57Z

For example:

const str = fs.readFileSync(path.join(__dirname, 'test.txt'));

test.txt:

On Wed Aug  number_rponnmzvyi   number_rponnmzvyi  at  number_rponnmzvyi : number_rponnmzvyi  Ulises Ponce wrote:

> Hi!
>
> Is there a command to insert the signature using a combination of keys and not
> to have sent the mail to insert it then?

I simply put it (them) into my (nmh) component files (components,
replcomps, forwcomps and so on).  That way you get them when you are
editing your message.  Also, by using comps files for specific
folders you can alter your .sig per folder (and other tricks).  See
the docs for (n)mh for all the details.

There might (must?) also be a way to get sedit to do it, but I've
been using gvim as my exmh message editor for a long time now.  I
load it with a command that loads some email-specific settings, eg,
to "syntax" colour-highlight the headers and quoted parts of an
email)... it would be possible to map some (vim) keys that would add
a sig (or even give a selection of sigs to choose from).

And there are all sorts of ways to have randomly-chosen sigs...
somewhere at  url_clbkqiipty .. ok, here we go:
 url_clbkqiipty
(Warning... it's old, May  number_rponnmzvyi ).

> Regards,
> Ulises

Hope this helps.

Cheers
Tony

_______________________________________________
Exmh-users mailing list
 email_qmonrtrsxz
 url_clbkqiipty

niftylettuce · 2020-05-04T06:42:26Z

In the meanwhile I've implemented https://github.com/yoshuawuyts/newline-remove

Hugo-ter-Doest · 2020-05-04T06:57:22Z

Did you try first segmenting the text in sentences and then tokenize in words?

niftylettuce · 2020-05-04T06:58:47Z

See my comment on the other issue - the v2.1.2 published version does not match up to what's on GitHub

niftylettuce · 2020-05-04T06:59:11Z

#532 (comment)

Hugo-ter-Doest · 2020-05-04T07:51:23Z

See my comment on the other issue - the v2.1.2 published version does not match up to what's on GitHub

You're right. I fixed it now with a new patch.

niftylettuce · 2020-05-04T08:29:31Z

Thank you!

niftylettuce changed the title ~~Why aren't underscores removed in tokenizationAndStem?~~ AggressiveTokenizer uses \W which allows "_" underscore character. However ALL other tokenizers REMOVE the underscore characters Apr 28, 2020

Hugo-ter-Doest self-assigned this May 4, 2020

Hugo-ter-Doest added the Improvement label May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AggressiveTokenizer uses \W which allows `"_"` underscore character. However ALL other tokenizers REMOVE the underscore characters #523

AggressiveTokenizer uses \W which allows `"_"` underscore character. However ALL other tokenizers REMOVE the underscore characters #523

niftylettuce commented Apr 28, 2020

niftylettuce commented Apr 28, 2020 •

edited

Loading

Hugo-ter-Doest commented May 3, 2020 •

edited

Loading

niftylettuce commented May 3, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

Hugo-ter-Doest commented May 4, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

Hugo-ter-Doest commented May 4, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

Hugo-ter-Doest commented May 4, 2020

niftylettuce commented May 4, 2020

AggressiveTokenizer uses \W which allows "_" underscore character. However ALL other tokenizers REMOVE the underscore characters #523

AggressiveTokenizer uses \W which allows "_" underscore character. However ALL other tokenizers REMOVE the underscore characters #523

Comments

niftylettuce commented Apr 28, 2020

niftylettuce commented Apr 28, 2020 • edited Loading

Hugo-ter-Doest commented May 3, 2020 • edited Loading

niftylettuce commented May 3, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

Hugo-ter-Doest commented May 4, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

Hugo-ter-Doest commented May 4, 2020

niftylettuce commented May 4, 2020

niftylettuce commented May 4, 2020

Hugo-ter-Doest commented May 4, 2020

niftylettuce commented May 4, 2020

AggressiveTokenizer uses \W which allows `"_"` underscore character. However ALL other tokenizers REMOVE the underscore characters #523

AggressiveTokenizer uses \W which allows `"_"` underscore character. However ALL other tokenizers REMOVE the underscore characters #523

niftylettuce commented Apr 28, 2020 •

edited

Loading

Hugo-ter-Doest commented May 3, 2020 •

edited

Loading