Skip to content

Extract

Richard Nagyfi edited this page Aug 24, 2018 · 20 revisions

The lara.parser.Extract() class lets you easily process short texts for further analysis. You can easily extract and standardize important information and numerical values, even with unstructured data. You can find further examples here.

from lara import parser

text	= 'feldolgozandó szöveg'
example	= parser.Extract(text)

Some functions will not just match certain parts of the text, but also normalize them (and return values with types other than strings). Some functions will automatically convert Hungarian word representations of numbers into digits.

Function Description Effect of normalize Effect of convert
example.currencies([normalize=True,convert=True]) Returns list of common money formats ($5.000, 5 000 dollár, 5000.-) as strings. Works with HUF, EUR, USD, GBP and JPY. "Ez a ruha 19 dollárba és 99 centbe kerül" would match "19 dollárba és 99 centbe" and return ['19.99 USD'] if normalized. "húsz forintos fagylalt" would return ['20.0 HUF'] if both normalize and convert are True
example.commands() Not to be confused with the commands entity. If a message starts with either \ or / character, it will return the following list of commands and arguments. For example the message \echo "hello world" would return ['echo','hello world']. If no command was found (string does not start with neither \ nor /) it will return a an empty list. Useful for detecting commands sent to a Chatbot.
example.dates([normalize=True,convert=True,current=False]) Returns list of common valid Hungarian date formats (2017.12.31, 17/12/31-én, december 31., 2017 XII 31-i) as strings. All dates will be converted to XXXX-XX-XX format. If the entire date can not be guessed from the string, the digits will be replaced with questionmark (for example: "2018 VII" would return ['2018-07-??']). Will also check for dates that are just days alone if no other formats were found. (You can pass current="YYYY-MM-DD" dates to this function if you want to use a day other than today to calculate relative dates) "2017 szeptember tizenkettedikei gyűlésen" would return ['2018-09-12'] if both normalize and convert are True
example.digits([n=0, normalize=True,convert=True]) Returns list of digits with n places (or all digits if n<=0). Allows spaces, dots, commas and hyphens to be in between the digits. Use for retrieving credit card numbers, etc. Normalized output will only return the digits without any special characters between them (for example: '00-00' would return '0000' for n=4 in this case). "kettő kettő tizenkettő" would return ['2212'] if both normalize and convert are True
example.durations([normalize=True,convert=True]) Returns list of valid time intervals, where a number is followed by a duration: 3 napig, 4.5 perc múlva Normalized durations are converted to seconds and returned as list of floats instead of stringss (for example: "3 óra és 4 perc múlva" would return [11040.0]). "Jöttömre öt nap múltán számítsatok, pirkadatkor, kelet felől" would return [432000.0] if both normalize and convert are True
example.emojis() Returns list of extracted emojis. 👌
example.hashtags([normalize=True]) Returns list of extracted #hashtags. All hashtags will be converted to lowercase if normalized.
example.mentions() Returns list of extracted @mentions.
example.numbers([decimals=True,convert=True]) Returns list of possible numbers (both integers and floats) as floats if decimals is True. Returns only valid integers otherwise as list of ints. Numbers declared as words are also returned.
example.percentages([normalize=True]) Returns list of possible percentages as either strings or floats (100%, 50,00 %, 123.45 %). ",7 %" would be 0.007 if normalized.
example.phone_numbers([normalize=True,convert=True]) Returns list of valid Hungarian phone numbers (20/123 4567, (06 20) 123/4567) All phone numbers would be converted to +36 XX 1234567 or +36 X 1234567 format if normalized. "a számom: harminc 123-4567" would return ['+36 30 1234567'] if both normalize and convert are True
example.relative_dates([normalize=True,current=False]) Returns a list of valid dates days mentioned in text like "hétfő" or "holnap" relative to today's date. Takes current time into account to solve ambiguous declarations. (You can pass current="YYYY-MM-DD" dates to this function if you want to use a day other than today to calculate relative timestamps) If normalized "múlt hét vasárnap" would retrun ["2018-03-25"]
example.smileys() Returns a list of extracted common ASCII smileys (does not match emojis).
example.times([normalize=True,convert=True,current=False]) Returns list of valid times. Hours and minutes must be numbers. Takes current time into account to solve ambiguous declarations. (You can turn this feature off, by setting current=0 or override the user's datetime by passing your own hour value as int, like current=23) If normalized, "holnap délután fél 2 után 7 perccel" would return ["13:37"] "nyolc perccel háromnegyed kettő előtt" would return ["13:37"] if both normalize and convert are True
example.timestamps([current=False]) Returns list of valid timestamps based on dates(), relative_dates() and times(). For example: "találkozzunk ma délben" could return ["2018-04-01 12:00"]. The output is always normalized, missing information will be returned as question marks: '????-??-?? ??:??'. Takes current time into account to solve ambiguous declarations. (Optional variable current accepts 'YYYY-MM-DD HH:MM' format string, uses today's date otherwise.)
example.urls() Returns list of extracted valid URLs.
Clone this wiki locally