Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Improvements to @stdlib/nlp-expand-contractions #496

Open
3 tasks done
titanism opened this issue Jun 13, 2022 · 8 comments
Open
3 tasks done

[RFC]: Improvements to @stdlib/nlp-expand-contractions #496

titanism opened this issue Jun 13, 2022 · 8 comments
Labels
Feature Issue or pull request for adding a new feature. RFC Request for comments. Feature requests and proposed changes.

Comments

@titanism
Copy link

Description

We're writing as we found your library to be the most tested and fastest for expanding contractions. For context, we're working on https://spamscanner.net and expanding contractions before passing to tokenizers for spam classification.

To clarify, this is with regards to the generated codebase https://github.com/stdlib-js/nlp-expand-contractions from the source at https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/nlp/expand-contractions.

We noticed that your library is missing quite a few contractions in English, and could also benefit from contractions from other languages too (perhaps with an option).

While we can open a PR, we wanted to check to see what your thoughts were on this and how you might want the PR to look like (integration wise; e.g. new options?).

Here is our current compiled list of research and findings:

Related Issues

No response

Questions

No response

Other

No response

Checklist

  • I have read and understood the Code of Conduct.
  • Searched for existing issues and pull requests.
  • The issue name begins with RFC:.
@titanism titanism added Feature Issue or pull request for adding a new feature. RFC Request for comments. Feature requests and proposed changes. labels Jun 13, 2022
@github-actions
Copy link
Contributor

🎉 Welcome! 🎉

And thank you for opening your first issue! We will get back to you shortly. 🏃 💨

@titanism
Copy link
Author

titanism commented Jun 13, 2022

Doing a review and will submit a PR to contractions.json with changes.

Caught some interesting bugs like "what's": "what has/is", in the JSON (which is obviously a bug).

The other question I wanted to raise is that we should probably handle and and interchangeably somehow.

@kgryte
Copy link
Member

kgryte commented Jun 13, 2022

Re: missing contractions. Some of the entries in your list are already present in the contractions file. E.g., wouldn't've, mightn't've.

@kgryte
Copy link
Member

kgryte commented Jun 13, 2022

@Planeshifter Is there a reason for the what has/is entry?

@kgryte
Copy link
Member

kgryte commented Jun 13, 2022

Re: fancy apostrophe. That should be possible to handle in the @stdlib/nlp/tokenize package.

@titanism
Copy link
Author

I'm about to submit a PR, one moment @kgryte

@titanism
Copy link
Author

See #497

cc @kgryte

@kgryte
Copy link
Member

kgryte commented Jun 18, 2022

@titanism One recent update: @Planeshifter added initial support for expanding acronyms (see https://github.com/stdlib-js/stdlib/tree/c624a5eb4bca8f4f3d45e01bcc4eeee41652e3ba/lib/node_modules/%40stdlib/nlp/expand-acronyms). This may help to avoid mixing contraction/acronym concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Issue or pull request for adding a new feature. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants