-
Notifications
You must be signed in to change notification settings - Fork 4
NLP
Nagyfi Richárd edited this page Jun 8, 2018
·
6 revisions
Lara comes with a handful of NLP functions that can be used for information retrieval.
Most of the NLP functions are for tokenizing and cleaning texts from artifacts. If you want to find dates, timestamps, numbers, etc. in short texts check the Wiki page on the Extract class as well.
Function | Description |
---|---|
lara.nlp.az(word) |
Returns either 'a' or 'az' based on the first character of the word. Works with numbers also. |
lara.nlp.crop_text(text,limit=100,end='...',reverse=False) |
Returns a maximum of limit letters without cutting words in half. If the returned text is longer than the maximum number of letters allowed, the end string will be attached to the text. If reverse is True, the function will start from the end of the text and add the end string to the beggining if needed. |
lara.nlp.is_vowel(letter) |
Returns True if the letter is a vowel (vowels with accents return True also). Returns False otherwise. |
lara.nlp.is_consonant(letter) |
Returns True if the letter is a consonant. Returns False otherwise. |
lara.nlp.is_gibberish(text) |
Returns True if text is most likely just gibberish (like a chat message containing only random key presses). |
lara.nlp.consonant_beginning(word) |
Returns True if word starts with a consonant. Returns False otherwise. |
lara.nlp.consonant_ending(word) |
Returns True if word ends with a consonant. Returns False otherwise. |
lara.nlp.ngram(tokens,[n=2]) |
Returns list of ngrams generated from the list of tokens provided. |
lara.nlp.number_of_sentences(text) |
Returns number of sentences in text, based on the sentencens received from the sent_tokenize() function. |
lara.nlp.number_of_words(text) |
Returns number of words in text, based on the words received from the tokenize() function. |
lara.nlp.remove_double_letters(text, [replace='']) |
The function replaces characters that are followed by the same character multiple times into single characters (kappan->kapan, busszal->buszal). Case sensitive. |
lara.nlp.remove_email_addresses(text,[replace='']) |
Removes valid e-mail addresses from text. |
lara.nlp.remove_html_tags(text,[replace='']) |
Removes possible HTML tags from text with regular expressions. Regular epressions are NOT the most efficient solutions for parsing HTML but it usually works for chat messages. |
lara.nlp.remove_line_breaks(text, [replace='']) |
Removes line breaks from text. |
lara.nlp.remove_punctuation(text, [replace='']) |
Removes punctuation from text. |
lara.nlp.remove_smileys(text) |
Removes common smileys from text (does not remove emojis). |
lara.nlp.remove_spaces_between_numbers(text,[replace='']) |
Removes whitespaces and hyphens between numbers (useful for parsing phone numbers). |
lara.nlp.remove_stopwords(text,negation=True) |
Removes common Hungarian stopwords from text. If negation is True, the words ne, nem, se, sem, semmi, hanem will also be removed. |
lara.nlp.remove_urls(text,[replace='']) |
Removes valid URLs from text. |
lara.nlp.sent_tokenize(text) |
Returns sentences in text as a list. Note that this function only uses regular expressions. |
lara.nlp.strip_accents(text) |
Returns text without accents (á->a, é->e, etc.). |
lara.nlp.tokenize(text) |
Returns words in text as a list. Note that this function only uses regular expressions. |
lara.nlp.trim(text) |
Trims text and also removes all whitespaces. |
lara.nlp.vowel_beggining(word) |
Returns True if word starts with a vowel. Returns False otherwise. |
lara.nlp.vowel_ending(word) |
Returns True if word ends with a vowel. Returns False otherwise. |
lara.nlp.vowel_harmony(word,[vegyes=True]) |
Returns the vowel harmony for a word. Can return 'magas', 'mely' and 'vegyes' if optional vegyes parameter was set to True. |
Further NLP functions to retrieve the rhythmic structure of Hungarian verse lines, because why not:
Function | Description |
---|---|
lara.nlp.is_hexameter(pattern) |
Returns True if the the metre list representing the rhythmic structure of a verse line is a valid Hexameter. |
lara.nlp.is_pentameter(pattern) |
Returns True if the the metre list representing the rhythmic structure of a verse line is a valid Pentameter. For example: is_pentameter(metre('Csend vala, felleg alól szállt fel az éjjeli hold.')) would return True |
lara.nlp.metre(text) |
Returns the rhythmic structure of a verse line as list of 'u' and '-' characters (former meaning short, latter meaning long foot). For instance: metre('Bús düledékeiden, Husztnak romvára megállék;') would return ['-', 'u', 'u', '-', 'u', 'u', '-', '-', '-', '-', '-', 'u', 'u', '-', '-'] |
lara.nlp.metre_pattern(match,pattern) |
Returns True if the the list representing the rhythmic structure of a verse line matches the given pattern, defined as a list of 'u' and '-' characters. |
lara.nlp.number_of_syllables(word, [rhyme=False]) |
Returns the number of syllables in a word based on the number of its vowels. If rhyme is se to True, the returned number of syllables will be harshly based on pronunciation instead (by taking acronyms in account). |