Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize phrasal-verbs #473

Open
ayman-ibrahim opened this issue Nov 14, 2018 · 7 comments
Open

tokenize phrasal-verbs #473

ayman-ibrahim opened this issue Nov 14, 2018 · 7 comments

Comments

@ayman-ibrahim
Copy link

is there is a way to tokenize a sentence taking into consideration phrasal-verbs.
example:

"The flight take off at three o'clock"

output should be:
[the, flight, take off, at, three, o'clock]

take off should be tokenized as one word.

@Hugo-ter-Doest
Copy link
Collaborator

Imho that is not what tokenization is meant for. Tokenization splits a text into words (and punctuation, if necessary) and "take off" consists two words. Combining them into a phrasal verb requires partial parsing or chunking.

@ayman-ibrahim
Copy link
Author

@Hugo-ter-Doest
Ok, do you know if there's a way to combine phrasal verbs in natural library ?

@Hugo-ter-Doest
Copy link
Collaborator

It's not yet in natural, but I'm working on that to use it for named entity recognition. You can have a preview at a CYK and Earley parsers here in this branch:
https://github.com/Hugo-ter-Doest/natural/tree/NER/

parsers are in lib/natural/parsers
a chunker based on the Earley parser is in lib/natural/NER

Feel free to already use that, but it may still change.

@ayman-ibrahim
Copy link
Author

cool, I'll have a look.
Thanks.

@lazharichir
Copy link

You could tokenize your sentence, tag each token's part of speech, and then find patterns. For example, VERB + DET or VERB + PREPOSITION. I use that to find noun phrases (JJ|NN+).

@privateOmega
Copy link

@Hugo-ter-Doest Do you have a set timeline as to when you would be able to integrate the code into Natural's codebase?

@lazharichir
Copy link

You can implement that, for now, using some sort of pattern matching (e.g. spaCy) such as you would walk the array of tokens, and find whatever patterns you are looking for (e.g. NOUN followed by PREP, or as many NOUNS/ADJ followed by PREP, etc).

You can look at spaCy's code (python) and port it to Node and Natural's token structure: https://github.com/explosion/spaCy/tree/master/spacy/matcher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants