Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggressiveTokenizer uses \W which allows "_" underscore character. However ALL other tokenizers REMOVE the underscore characters #523

Open
niftylettuce opened this issue Apr 28, 2020 · 14 comments
Assignees

Comments

@niftylettuce
Copy link
Contributor

cc @Hugo-ter-Doest

e.g. I'm left with tokens like "____________________________"

Pretty sure this affects all locales of stemming/tokenization as well

@niftylettuce niftylettuce changed the title Why aren't underscores removed in tokenizationAndStem? AggressiveTokenizer uses \W which allows "_" underscore character. However ALL other tokenizers REMOVE the underscore characters Apr 28, 2020
@niftylettuce
Copy link
Contributor Author

niftylettuce commented Apr 28, 2020

AggressiveTokenizer uses \W which allows "_" underscore character. However SOME other tokenizers, e.g. AggressiveTokenizerEs, AggressiveTokenizerFr, etc. will REMOVE the underscore characters, as they are not included in their .split tokenizer regular expression. There is no consistency here among _ underscore and - hyphen characters as far as I can tell, but I feel there should be.

return this.trim(text.split(/[^a-zA-Zá-úÁ-ÚñÑüÜ]+/));
(removes)

return text.replace(new RegExp('\.\:\+\-\=\(\)\"\'\!\?\،\,\؛\;', 'g'), ' ');
(does NOT remove)

return this.trim(text.split(/[^a-z0-9äâàéèëêïîöôùüûœç]+/i));
(removes)

result = text.replace(/[^a-z0-9 -]/g, ' ').replace(/( +)/g, ' ');
(removes)

return this.trim(text.split(/\W+/));
(removes)

return this.trim(text.split(/[^a-zA-Z0-9_']+/));
(does NOT remove)

return this.trim(text.split(/[^A-Za-z0-9_æøåÆØÅäÄöÖüÜ]+/));
(does NOT remove)

return this.withoutEmpty(this.clearText(text).split(' '));
(removes)

return this.withoutEmpty(this.trim(text.split(/[^a-zA-Zà-úÀ-Ú]/)));
(removes)

return text.replace(/[^a-zа-яё0-9]/gi, ' ').replace(/[\s\n]+/g, ' ').trim();
(removes)

return this.trim(text.split(/[^A-Za-z0-9_åÅäÄöÖüÜ]+/));
(does not remove)

return this.trim(text.split(/[^a-z0-9àáảãạăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵđ]+/i));
(removes)

return token.replace(/[_-・,、;:!?.。()[]{}「」@*\/&#%`^+<=>|~≪≫─$"_\-・,、;:!?.。()[\]{}「」@*\/&#%`^+<=>|~«»$"\s]+/g, '');
(removes)

P.S. I also see that "-" hyphen character is/is not preserved in some tokenizers as well.

@Hugo-ter-Doest
Copy link
Collaborator

Hugo-ter-Doest commented May 3, 2020

I added _ as a non-word character in aggressive tokenizer for English. I agree that tokenizers should handle _ consistently, but will leave that for another time. the handling of - (dash) is a different story, as in some language the dash binds words together to new words.

See #531

@niftylettuce
Copy link
Contributor Author

Thanks @Hugo-ter-Doest. I also might suggest using a modified version of this regex https://github.com/regexhq/punctuation-regex/blob/master/index.js#L12

@niftylettuce
Copy link
Contributor Author

@Hugo-ter-Doest if you can handle this I would gladly tip you - just email me at niftylettuce@gmail.com with your PayPal address - I'm writing a very advanced classifier that handles every edge case and need tokenization to adhere strictly based off the language's use of punctuation

@niftylettuce
Copy link
Contributor Author

Also, I'm on the latest version, v2.1.2 and I still see underscores via AggressiveTokenizr English usage.

@Hugo-ter-Doest
Copy link
Collaborator

Can you give me some examples. I don't see underscores anymore. Also, the test I added to the spec is working fine:

it('should remove underscores', function() {
    expect(tokenizer.tokenize('_ hi_this_is_a_test_case_ for__removing___underscores_')).toEqual(['hi', 'this', 'is', 'a', 'test', 'case', 'for', 'removing', 'underscores']);

@niftylettuce
Copy link
Contributor Author

It doesn't handle line-breaks

@niftylettuce
Copy link
Contributor Author

For example:

const str = fs.readFileSync(path.join(__dirname, 'test.txt'));

test.txt:

On Wed Aug  number_rponnmzvyi   number_rponnmzvyi  at  number_rponnmzvyi : number_rponnmzvyi  Ulises Ponce wrote:

> Hi!
>
> Is there a command to insert the signature using a combination of keys and not
> to have sent the mail to insert it then?

I simply put it (them) into my (nmh) component files (components,
replcomps, forwcomps and so on).  That way you get them when you are
editing your message.  Also, by using comps files for specific
folders you can alter your .sig per folder (and other tricks).  See
the docs for (n)mh for all the details.

There might (must?) also be a way to get sedit to do it, but I've
been using gvim as my exmh message editor for a long time now.  I
load it with a command that loads some email-specific settings, eg,
to "syntax" colour-highlight the headers and quoted parts of an
email)... it would be possible to map some (vim) keys that would add
a sig (or even give a selection of sigs to choose from).

And there are all sorts of ways to have randomly-chosen sigs...
somewhere at  url_clbkqiipty .. ok, here we go:
 url_clbkqiipty
(Warning... it's old, May  number_rponnmzvyi ).

> Regards,
> Ulises

Hope this helps.

Cheers
Tony



_______________________________________________
Exmh-users mailing list
 email_qmonrtrsxz
 url_clbkqiipty

@niftylettuce
Copy link
Contributor Author

In the meanwhile I've implemented https://github.com/yoshuawuyts/newline-remove

@Hugo-ter-Doest
Copy link
Collaborator

Did you try first segmenting the text in sentences and then tokenize in words?

@niftylettuce
Copy link
Contributor Author

See my comment on the other issue - the v2.1.2 published version does not match up to what's on GitHub

@niftylettuce
Copy link
Contributor Author

#532 (comment)

@Hugo-ter-Doest
Copy link
Collaborator

See my comment on the other issue - the v2.1.2 published version does not match up to what's on GitHub

You're right. I fixed it now with a new patch.

@niftylettuce
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants