wiki-en-train.word

Natural language processing -LRB- NLP -RRB- is a field of computer science , artificial intelligence -LRB- also called machine learning -RRB- , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages .
Specifically , it is the process of a computer extracting meaningful information from natural language input and\/or producing natural language output .
In theory , natural language processing is a very attractive method of human -- computer interaction .
Natural language understanding is sometimes referred to as an AI-complete problem because it seems to require extensive knowledge about the outside world and the ability to manipulate it .
Whether NLP is distinct from , or identical to , the field of computational linguistics is a matter of perspective .
The Association for Computational Linguistics defines the latter as focusing on the theoretical aspects of NLP .
On the other hand , the open-access journal `` Computational Linguistics '' , styles itself as `` the longest running publication devoted exclusively to the design and analysis of natural language processing systems '' -LRB- Computational Linguistics -LRB- Journal -RRB- -RRB- Modern NLP algorithms are grounded in machine learning , especially statistical machine learning .
Research into modern statistical NLP algorithms requires an understanding of a number of disparate fields , including linguistics , computer science , and statistics .
For a discussion of the types of algorithms currently used in NLP , see the article on pattern recognition .
An automated online assistant providing customer service on a web page , an example of an application where natural language processing is a major component .
In 1950 , Alan Turing published his famous article `` Computing Machinery and Intelligence '' which proposed what is now called the Turing test as a criterion of intelligence .
This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge , sufficiently well that the judge is unable to distinguish reliably -- on the basis of the conversational content alone -- between the program and a real human .
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English .
The authors claimed that within three or five years , machine translation would be a solved problem .
However , real progress was much slower , and after the ALPAC report in 1966 , which found that ten years long research had failed to fulfill the expectations , funding for machine translation was dramatically reduced .
Little further research in machine translation was conducted until the late 1980s , when the first statistical machine translation systems were developed .
Some notably successful NLP systems developed in the 1960s were SHRDLU , a natural language system working in restricted `` blocks worlds '' with restricted vocabularies , and ELIZA , a simulation of a Rogerian psychotherapist , written by Joseph Weizenbaum between 1964 to 1966 .
Using almost no information about human thought or emotion , ELIZA sometimes provided a startlingly human-like interaction .
When the `` patient '' exceeded the very small knowledge base , ELIZA might provide a generic response , for example , responding to `` My head hurts '' with `` Why do you say your head hurts ? '' .
During the 70 's many programmers began to write ` conceptual ontologies ' , which structured real-world information into computer-understandable data .
Examples are MARGIE -LRB- Schank , 1975 -RRB- , SAM -LRB- Cullingford , 1978 -RRB- , PAM -LRB- Wilensky , 1978 -RRB- , TaleSpin -LRB- Meehan , 1976 -RRB- , QUALM -LRB- Lehnert , 1977 -RRB- , Politics -LRB- Carbonell , 1979 -RRB- , and Plot Units -LRB- Lehnert 1981 -RRB- .
During this time , many chatterbots were written including PARRY , Racter , and Jabberwacky .
Up to the 1980s , most NLP systems were based on complex sets of hand-written rules .
Starting in the late 1980s , however , there was a revolution in NLP with the introduction of machine learning algorithms for language processing .
This was due both to the steady increase in computational power resulting from Moore 's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics -LRB- e.g. transformational grammar -RRB- , whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing .
Some of the earliest-used machine learning algorithms , such as decision trees , produced systems of hard if-then rules similar to existing hand-written rules .
Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to the features making up the input data .
The cache language models upon which many speech recognition systems now rely are examples of such statistical models .
Such models are generally more robust when given unfamiliar input , especially input that contains errors -LRB- as is very common for real-world data -RRB- , and produce more reliable results when integrated into a larger system comprising multiple subtasks .
Many of the notable early successes occurred in the field of machine translation , due especially to work at IBM Research , where successively more complicated statistical models were developed .
These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government .
However , most other systems depended on corpora specifically developed for the tasks implemented by these systems , which was -LRB- and often continues to be -RRB- a major limitation in the success of these systems .
As a result , a great deal of research has gone into methods of more effectively learning from limited amounts of data .
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms .
Such algorithms are able to learn from data that has not been hand-annotated with the desired answers , or using a combination of annotated and non-annotated data .
Generally , this task is much more difficult than supervised learning , and typically produces less accurate results for a given amount of input data .
However , there is an enormous amount of non-annotated data available -LRB- including , among other things , the entire content of the World Wide Web -RRB- , which can often make up for the inferior results .
NLP using machine learning As described above , modern approaches to natural language processing -LRB- NLP -RRB- are grounded in machine learning .
The paradigm of machine learning is different from that of most prior attempts at language processing .
Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules .
The machine-learning paradigm calls instead for using general learning algorithms -- often , although not always , grounded in statistical inference -- to automatically learn such rules through the analysis of large corpora of typical real-world examples .
A corpus -LRB- plural , `` corpora '' -RRB- is a set of documents -LRB- or sometimes , individual sentences -RRB- that have been hand-annotated with the correct values to be learned .
Consider the task of part of speech tagging , i.e. determining the correct part of speech of each word in a given sentence , typically one that has never been seen before .
A typical machine-learning-based implementation of a part of speech tagger proceeds in two steps , a training step and an evaluation step .
The first step -- the training step -- makes use of a corpus of training data , which consists of a large number of sentences , each of which has the correct part of speech attached to each word .
-LRB- An example of such a corpus in common use is the Penn Treebank .
This includes -LRB- among other things -RRB- a set of 500 texts from the Brown Corpus , containing examples of various genres of text , and 2500 articles from the Wall Street Journal . -RRB-
This corpus is analyzed and a learning model is generated from it , consisting of automatically created rules for determining the part of speech for a word in a sentence , typically based on the nature of the word in question , the nature of surrounding words , and the most likely part of speech for those surrounding words .
The model that is generated is typically the best model that can be found that simultaneously meets two conflicting objectives : To perform as well as possible on the training data , and to be as simple as possible -LRB- so that the model avoids overfitting the training data , i.e. so that it generalizes as well as possible to new data rather than only succeeding on sentences that have already been seen -RRB- .
In the second step -LRB- the evaluation step -RRB- , the model that has been learned is used to process new sentences .
An important part of the development of any learning algorithm is testing the model that has been learned on new , previously unseen data .
It is critical that the data used for testing is not the same as the data used for training ; otherwise , the testing accuracy will be unrealistically high .
Many different classes of machine learning algorithms have been applied to NLP tasks .
In common to all of these algorithms is that they take as input a large set of `` features '' that are generated from the input data .
As an example , for a part-of-speech tagger , typical features might be the identity of the word being processed , the identity of the words immediately to the left and right , the part-of-speech tag of the word to the left , and whether the word being considered or its immediate neighbors are content words or function words .
The algorithms differ , however , in the nature of the rules generated .
Some of the earliest-used algorithms , such as decision trees , produced systems of hard if-then rules similar to the systems of hand-written rules that were then common .
Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to each input feature .
Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one , producing more reliable results when such a model is included as a component of a larger system .
In addition , models that make soft decisions are generally more robust when given unfamiliar input , especially input that contains errors -LRB- as is very common for real-world data -RRB- .
Systems based on machine-learning algorithms have many advantages over hand-produced rules : The learning procedures used during machine learning automatically focus on the most common cases , whereas when writing rules by hand it is often not obvious at all where the effort should be directed .
Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input -LRB- e.g. containing words or structures that have not been seen before -RRB- and to erroneous input -LRB- e.g. with misspelled words or words accidentally omitted -RRB- .
Generally , handling such input gracefully with hand-written rules -- or more generally , creating systems of hand-written rules that make soft decisions -- is extremely difficult , error-prone and time-consuming .
Systems based on automatically learning the rules can be made more accurate simply by supplying more input data .
However , systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules , which is a much more difficult task .
In particular , there is a limit to the complexity of systems based on hand-crafted rules , beyond which the systems become more and more unmanageable .
However , creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked , generally without significant increases in the complexity of the annotation process .
Major tasks in NLP The following is a list of some of the most commonly researched tasks in NLP .
Note that some of these tasks have direct real-world applications , while others more commonly serve as subtasks that are used to aid in solving larger tasks .
What distinguishes these tasks from other potential and actual NLP tasks is not only the volume of research devoted to them but the fact that for each one there is typically a well-defined problem setting , a standard metric for evaluating the task , standard corpora on which the task can be evaluated , and competitions devoted to the specific task .
Automatic summarization : Produce a readable summary of a chunk of text .
Often used to provide summaries of text of a known type , such as articles in the financial section of a newspaper .
Coreference resolution : Given a sentence or larger chunk of text , determine which words -LRB- `` mentions '' -RRB- refer to the same objects -LRB- `` entities '' -RRB- .
Anaphora resolution is a specific example of this task , and is specifically concerned with matching up pronouns with the nouns or names that they refer to .
For example , in a sentence such as `` He entered John 's house through the front door '' , `` the front door '' is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John 's house -LRB- rather than of some other structure that might also be referred to -RRB- .
Discourse analysis : This rubric includes a number of related tasks .
One task is identifying the discourse structure of connected text , i.e. the nature of the discourse relationships between sentences -LRB- e.g. elaboration , explanation , contrast -RRB- .
Another possible task is recognizing and classifying the speech acts in a chunk of text -LRB- e.g. yes-no question , content question , statement , assertion , etc. -RRB- .
Machine translation : Automatically translate text from one human language to another .
This is one of the most difficult problems , and is a member of a class of problems colloquially termed `` AI-complete '' , i.e. requiring all of the different types of knowledge that humans possess -LRB- grammar , semantics , facts about the real world , etc. -RRB- in order to solve properly .
Morphological segmentation : Separate words into individual morphemes and identify the class of the morphemes .
The difficulty of this task depends greatly on the complexity of the morphology -LRB- i.e. the structure of words -RRB- of the language being considered .
English has fairly simple morphology , especially inflectional morphology , and thus it is often possible to ignore this task entirely and simply model all possible forms of a word -LRB- e.g. `` open , opens , opened , opening '' -RRB- as separate words .
In languages such as Turkish , however , such an approach is not possible , as each dictionary entry has thousands of possible word forms .
Named entity recognition -LRB- NER -RRB- : Given a stream of text , determine which items in the text map to proper names , such as people or places , and what the type of each such name is -LRB- e.g. person , location , organization -RRB- .
Note that , although capitalization can aid in recognizing named entities in languages such as English , this information can not aid in determining the type of named entity , and in any case is often inaccurate or insufficient .
For example , the first word of a sentence is also capitalized , and named entities often span several words , only some of which are capitalized .
Furthermore , many other languages in non-Western scripts -LRB- e.g. Chinese or Arabic -RRB- do not have any capitalization at all , and even languages with capitalization may not consistently use it to distinguish names .
For example , German capitalizes all nouns , regardless of whether they refer to names , and French and Spanish do not capitalize names that serve as adjectives .
Natural language generation : Convert information from computer databases into readable human language .
Natural language understanding : Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate .
Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural languages concepts .
Introduction and creation of language metamodel and ontology are efficient however empirical solutions .
An explicit formalization of natural languages semantics without confusions with implicit assumptions such as closed world assumption -LRB- CWA -RRB- vs. open world assumption , or subjective Yes\/No vs. objective True\/False is expected for the construction of a basis of semantics formalization .
Optical character recognition -LRB- OCR -RRB- : Given an image representing printed text , determine the corresponding text .
Part-of-speech tagging : Given a sentence , determine the part of speech for each word .
Many words , especially common ones , can serve as multiple parts of speech .
For example , `` book '' can be a noun -LRB- `` the book on the table '' -RRB- or verb -LRB- `` to book a flight '' -RRB- ; `` set '' can be a noun , verb or adjective ; and `` out '' can be any of at least five different parts of speech .
Note that some languages have more such ambiguity than others .
Languages with little inflectional morphology , such as English are particularly prone to such ambiguity .
Chinese is prone to such ambiguity because it is a tonal language during verbalization .
Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning .
Parsing : Determine the parse tree -LRB- grammatical analysis -RRB- of a given sentence .
The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses .
In fact , perhaps surprisingly , for a typical sentence there may be thousands of potential parses -LRB- most of which will seem completely nonsensical to a human -RRB- .
Question answering : Given a human-language question , determine its answer .
Typical questions have a specific right answer -LRB- such as `` What is the capital of Canada ? '' -RRB-
, but sometimes open-ended questions are also considered -LRB- such as `` What is the meaning of life ? '' -RRB- .
Relationship extraction : Given a chunk of text , identify the relationships among named entities -LRB- e.g. who is the wife of whom -RRB- .
Sentence breaking -LRB- also known as sentence boundary disambiguation -RRB- : Given a chunk of text , find the sentence boundaries .
Sentence boundaries are often marked by periods or other punctuation marks , but these same characters can serve other purposes -LRB- e.g. marking abbreviations -RRB- .
Sentiment analysis : Extract subjective information usually from a set of documents , often using online reviews to determine `` polarity '' about specific objects .
It is especially useful for identifying trends of public opinion in the social media , for the purpose of marketing .
Speech recognition : Given a sound clip of a person or people speaking , determine the textual representation of the speech .
This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed `` AI-complete '' -LRB- see above -RRB- .
In natural speech there are hardly any pauses between successive words , and thus speech segmentation is a necessary subtask of speech recognition -LRB- see below -RRB- .
Note also that in most spoken languages , the sounds representing successive letters blend into each other in a process termed coarticulation , so the conversion of the analog signal to discrete characters can be a very difficult process .
Speech segmentation : Given a sound clip of a person or people speaking , separate it into words .
A subtask of speech recognition and typically grouped with it .
Topic segmentation and recognition : Given a chunk of text , separate it into segments each of which is devoted to a topic , and identify the topic of the segment .
Word segmentation : Separate a chunk of continuous text into separate words .
For a language like English , this is fairly trivial , since words are usually separated by spaces .
However , some written languages like Chinese , Japanese and Thai do not mark word boundaries in such a fashion , and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language .
Word sense disambiguation : Many words have more than one meaning ; we have to select the meaning which makes the most sense in context .
For this problem , we are typically given a list of words and associated word senses , e.g. from a dictionary or from an online resource such as WordNet .
In some cases , sets of related tasks are grouped into subfields of NLP that are often considered separately from NLP as a whole .
Examples include : Information retrieval -LRB- IR -RRB- : This is concerned with storing , searching and retrieving information .
It is a separate field within computer science -LRB- closer to databases -RRB- , but IR relies on some NLP methods -LRB- for example , stemming -RRB- .
Some current research and applications seek to bridge the gap between IR and NLP .
Information extraction -LRB- IE -RRB- : This is concerned in general with the extraction of semantic information from text .
This covers tasks such as named entity recognition , coreference resolution , relationship extraction , etc. .
Speech processing : This covers speech recognition , text-to-speech and related tasks .
Other tasks include : Stemming Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing Statistical NLP Main article : statistical natural language processing Statistical natural-language processing uses stochastic , probabilistic and statistical methods to resolve some of the difficulties discussed above , especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars , yielding thousands or millions of possible analyses .
Methods for disambiguation often involve the use of corpora and Markov models .
Statistical NLP comprises all quantitative approaches to automated language processing , including probabilistic modeling , information theory , and linear algebra .
The technology for statistical NLP comes mainly from machine learning and data mining , both of which are fields of artificial intelligence that involve learning from data .
Evaluation of natural language processing Objectives The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system , in order to determine whether -LRB- or to what extent -RRB- the system answers the goals of its designers , or meets the needs of its users .
Research in NLP evaluation has received considerable attention , because the definition of proper evaluation criteria is one way to specify precisely an NLP problem , going thus beyond the vagueness of tasks defined only as language understanding or language generation .
A precise set of evaluation criteria , which includes mainly evaluation data and evaluation metrics , enables several teams to compare their solutions to a given NLP problem .
Short history of evaluation in NLP The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 -LRB- Pallet 1998 -RRB- .
Then , the Parseval\/GEIG project compared phrase-structure grammars -LRB- Black 1991 -RRB- .
A series of campaigns within Tipster project were realized on tasks like summarization , translation and searching -LRB- Hirschman 1998 -RRB- .
In 1994 , in Germany , the Morpholympics compared German taggers .
Then , the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation .
In 1996 , the Sparkle campaign compared syntactic parsers in four different languages -LRB- English , French , German and Italian -RRB- .
In France , the Grace project compared a set of 21 taggers for French in 1997 -LRB- Adda 1999 -RRB- .
In 2004 , during the Technolangue\/Easy project , 13 parsers for French were compared .
Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007 .
In Italy , the EVALITA campaign was conducted in 2007 and 2009 to compare various NLP and speech tools for Italian ; the 2011 campaign is in full progress - EVALITA web site .
In France , within the ANR-Passage project -LRB- end of 2007 -RRB- , 10 parsers for French were compared - passage web site .
Adda G. , Mariani J. , Paroubek P. , Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discors pour le français .
Langues vol-2 Black E. , Abney S. , Flickinger D. , Gdaniec C. , Grishman R. , Harrison P. , Hindle D. , Ingria R. , Jelinek F. , Klavans J. , Liberman M. , Marcus M. , Reukos S. , Santoni B. , Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars .
DARPA Speech and Natural Language Workshop Hirschman L. 1998 Language understanding evaluation : lessons learned from MUC and ATIS .
LREC Granada Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests .
LREC Granada Different types of evaluation Depending on the evaluation procedures , a number of distinctions are traditionally made in NLP evaluation .
Intrinsic vs. extrinsic evaluation Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result , pre-defined by the evaluators .
Extrinsic evaluation , also called evaluation in use considers the NLP system in a more complex setting , either as an embedded system or serving a precise function for a human user .
The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user .
For example , consider a syntactic parser that is based on the output of some new part of speech -LRB- POS -RRB- tagger .
An intrinsic evaluation would run the POS tagger on some labeled data , and compare the system output of the POS tagger to the gold standard -LRB- correct -RRB- output .
An extrinsic evaluation would run the parser with some other POS tagger , and then with the new POS tagger , and compare the parsing accuracy .
Black-box vs. glass-box evaluation Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process -LRB- speed , reliability , resource consumption -RRB- and , most importantly , to the quality of the result -LRB- e.g. the accuracy of data annotation or the fidelity of a translation -RRB- .
Glass-box evaluation looks at the design of the system , the algorithms that are implemented , the linguistic resources it uses -LRB- e.g. vocabulary size -RRB- , etc. .
Given the complexity of NLP problems , it is often difficult to predict performance only on the basis of glass-box evaluation , but this type of evaluation is more informative with respect to error analysis or future developments of a system .
Automatic vs. manual evaluation In many cases , automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard -LRB- or desired -RRB- one .
Although the cost of producing the gold standard can be quite high , automatic evaluation can be repeated as often as needed without much additional costs -LRB- on the same input data -RRB- .
However , for many NLP problems , the definition of a gold standard is a complex task , and can prove impossible when inter-annotator agreement is insufficient .
Manual evaluation is performed by human judges , which are instructed to estimate the quality of a system , or most often of a sample of its output , based on a number of criteria .
Although , thanks to their linguistic competence , human judges can be considered as the reference for a number of language processing tasks , there is also considerable variation across their ratings .
This is why automatic evaluation is sometimes referred to as objective evaluation , while the human kind appears to be more subjective .
Shared tasks -LRB- Campaigns -RRB- BioCreative Message Understanding Conference Technolangue\/Easy Text Retrieval Conference Evaluation exercises on Semantic Evaluation -LRB- SemEval -RRB- MorphoChallenge Semi-supervised and Unsupervised Morpheme Analysis Standardization in NLP An ISO sub-committee is working in order to ease interoperability between lexical resources and NLP programs .
The sub-committee is part of ISO\/TC37 and is called ISO\/TC37\/SC4 .
Some ISO standards are already published but most of them are under construction , mainly on lexicon representation -LRB- see LMF -RRB- , annotation and data category registry .
Automatic summarization is the creation of a shortened version of a text by a computer program .
The product of this procedure still contains the most important points of the original text .
Discourse analysis -LRB- DA -RRB- , or discourse studies , is a general term for a number of approaches to analyzing written , spoken , signed language use or any significant semiotic event .
Machine translation , sometimes referred to by the abbreviation MT -LRB- not to be confused with computer-aided translation , machine-aided human translation MAHT and interactive translation -RRB- is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another .
On a basic level , MT performs simple substitution of words in one natural language for words in another , but that alone usually can not produce a good translation of a text , because recognition of whole phrases and their closest counterparts in the target language is needed .
Solving this problem with corpus and statistical techniques is a rapidly growing field that is leading to better translations , handling differences in linguistic typology , translation of idioms , and the isolation of anomalies .
-LRB- citation needed -RRB- Current machine translation software often allows for customisation by domain or profession -LRB- such as weather reports -RRB- , improving output by limiting the scope of allowable substitutions .
This technique is particularly effective in domains where formal or formulaic language is used .
It follows that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text .
Improved output quality can also be achieved by human intervention : for example , some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names .
With the assistance of these techniques , MT has proven useful as a tool to assist human translators and , in a very limited number of cases , can even produce output that can be used as is -LRB- e.g. , weather reports -RRB- .
The progress and potential of machine translation has been debated much through its history .
Since the 1950s , a number of scholars have questioned the possibility of achieving fully automatic machine translation of high quality .
Some critics claim that there are in-principle obstacles to automatizing the translation process .
In 1629 , René Descartes proposed a universal language , with equivalent ideas in different tongues sharing one symbol .
In the 1950s , The Georgetown experiment -LRB- 1954 -RRB- involved fully automatic translation of over sixty Russian sentences into English .
The experiment was a great success and ushered in an era of substantial funding for machine-translation research .
The authors claimed that within three to five years , machine translation would be a solved problem .
Real progress was much slower , however , and after the ALPAC report -LRB- 1966 -RRB- , which found that the ten-year-long research had failed to fulfill expectations , funding was greatly reduced .
Beginning in the late 1980s , as computational power increased and became less expensive , more interest was shown in statistical models for machine translation .
The idea of using digital computers for translation of natural languages was proposed as early as 1946 by A. D. Booth and possibly others .
Warren Weaver wrote an important memorandum `` Translation '' in 1949 .
The Georgetown experiment was by no means the first such application , and a demonstration was made in 1954 on the APEXC machine at Birkbeck College -LRB- University of London -RRB- of a rudimentary translation of English into French .
Several papers on the topic were published at the time , and even articles in popular journals -LRB- see for example Wireless World , Sept. 1955 , Cleave and Zacharov -RRB- .
A similar application , also pioneered at Birkbeck College at the time , was reading and composing Braille texts by computer .
Translation process Main article : Translation process The human translation process may be described as : Decoding the meaning of the source text ; and Re-encoding this meaning in the target language .
Behind this ostensibly simple procedure lies a complex cognitive operation .
To decode the meaning of the source text in its entirety , the translator must interpret and analyze all the features of the text , a process that requires in-depth knowledge of the grammar , semantics , syntax , idioms , etc. , of the source language , as well as the culture of its speakers .
The translator needs the same in-depth knowledge to re-encode the meaning in the target language .
Therein lies the challenge in machine translation : how to program a computer that will `` understand '' a text as a person does , and that will `` create '' a new text in the target language that `` sounds '' as if it has been written by a person .
This problem may be approached in a number of ways .
Approaches Bernard Vauquois ' pyramid showing comparative depths of intermediary representation , interlingual machine translation at the peak , followed by transfer-based , then direct translation .
Machine translation can use a method based on linguistic rules , which means that words will be translated in a linguistic way -- the most suitable -LRB- orally speaking -RRB- words of the target language will replace the ones in the source language .
It is often argued that the success of machine translation requires the problem of natural language understanding to be solved first .
Generally , rule-based methods parse a text , usually creating an intermediary , symbolic representation , from which the text in the target language is generated .
According to the nature of the intermediary representation , an approach is described as interlingual machine translation or transfer-based machine translation .
These methods require extensive lexicons with morphological , syntactic , and semantic information , and large sets of rules .
Given enough data , machine translation programs often work well enough for a native speaker of one language to get the approximate meaning of what is written by the other native speaker .
The difficulty is getting enough data of the right kind to support the particular method .
For example , the large multilingual corpus of data needed for statistical methods to work is not necessary for the grammar-based methods .
But then , the grammar methods need a skilled linguist to carefully design the grammar that they use .
To translate between closely related languages , a technique referred to as shallow-transfer machine translation may be used .
Rule-based The rule-based machine translation paradigm includes transfer-based machine translation , interlingual machine translation and dictionary-based machine translation paradigms .
Main article : Rule-based machine translation Transfer-based machine translation Main article : Transfer-based machine translation Interlingual Main article : Interlingual machine translation Interlingual machine translation is one instance of rule-based machine-translation approaches .
In this approach , the source language , i.e. the text to be translated , is transformed into an interlingual , i.e. source - \/ target-language-independent representation .
The target language is then generated out of the interlingua .
Dictionary-based Main article : Dictionary-based machine translation Machine translation can use a method based on dictionary entries , which means that the words will be translated as they are by a dictionary .
Statistical Main article : Statistical machine translation Statistical machine translation tries to generate translations using statistical methods based on bilingual text corpora , such as the Canadian Hansard corpus , the English-French record of the Canadian parliament and EUROPARL , the record of the European Parliament .
Where such corpora are available , impressive results can be achieved translating texts of a similar kind , but such corpora are still very rare .
The first statistical machine translation software was CANDIDE from IBM .
Google used SYSTRAN for several years , but switched to a statistical translation method in October 2007 .
Recently , they improved their translation capabilities by inputting approximately 200 billion words from United Nations materials to train their system .
Accuracy of the translation has improved .
Example-based Main article : Example-based machine translation Example-based machine translation -LRB- EBMT -RRB- approach was proposed by Makoto Nagao in 1984 .
It is often characterised by its use of a bilingual corpus as its main knowledge base , at run-time .
It is essentially a translation by analogy and can be viewed as an implementation of case-based reasoning approach of machine learning .
Hybrid MT Hybrid machine translation -LRB- HMT -RRB- leverages the strengths of statistical and rule-based translation methodologies .
Several MT companies -LRB- Asia Online , LinguaSys , Systran , PangeaMT , UPV -RRB- are claiming to have a hybrid approach using both rules and statistics .
The approaches differ in a number of ways : Rules post-processed by statistics : Translations are performed using a rules based engine .
Statistics are then used in an attempt to adjust\/correct the output from the rules engine .
Statistics guided by rules : Rules are used to pre-process data in an attempt to better guide the statistical engine .
Rules are also used to post-process the statistical output to perform functions such as normalization .
This approach has a lot more power , flexibility and control when translating .
Major issues Disambiguation Main article : Word sense disambiguation Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning .
The problem was first raised in the 1950s by Yehoshua Bar-Hillel .
He pointed out that without a `` universal encyclopedia '' , a machine would never be able to distinguish between the two meanings of a word .
Today there are numerous approaches designed to overcome this problem .
They can be approximately divided into `` shallow '' approaches and `` deep '' approaches .
Shallow approaches assume no knowledge of the text .
They simply apply statistical methods to the words surrounding the ambiguous word .
Deep approaches presume a comprehensive knowledge of the word .
So far , shallow approaches have been more successful .
-LRB- citation needed -RRB- The late Claude Piron , a long-time translator for the United Nations and the World Health Organization , wrote that machine translation , at its best , automates the easier part of a translator 's job ; the harder and more time-consuming part usually involves doing extensive research to resolve ambiguities in the source text , which the grammatical and lexical exigencies of the target language require to be resolved : Why does a translator need a whole workday to translate five pages , and not an hour or two ?
... About 90 % of an average text corresponds to these simple conditions .
But unfortunately , there 's the other 10 % .
It 's that part that requires six -LRB- more -RRB- hours of work .
There are ambiguities one has to resolve .
For instance , the author of the source text , an Australian physician , cited the example of an epidemic which was declared during World War II in a `` Japanese prisoner of war camp '' .
Was he talking about an American camp with Japanese prisoners or a Japanese camp with American prisoners ?
The English has two senses .
It 's necessary therefore to do research , maybe to the extent of a phone call to Australia .
The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own ; but this would require a higher degree of AI than has yet been attained .
A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions -LRB- based , perhaps , on which kind of prisoner-of-war camp is more often mentioned in a given corpus -RRB- would have a reasonable chance of guessing wrong fairly often .
A shallow approach that involves `` ask the user about each ambiguity '' would , by Piron 's estimate , only automate about 25 % of a professional translator 's job , leaving the harder 75 % still to be done by a human .
The objects of discourse analysis -- discourse , writing , conversation , communicative event , etc. -- are variously defined in terms of coherent sequences of sentences , propositions , speech acts or turns-at-talk .
Contrary to much of traditional linguistics , discourse analysts not only study language use ` beyond the sentence boundary ' , but also prefer to analyze ` naturally occurring ' language use , and not invented examples .
Text linguistics is related .
The essential difference between discourse analysis and text linguistics is that it aims at revealing socio-psychological characteristics of a person\/persons rather than text structure .
Discourse analysis has been taken up in a variety of social science disciplines , including linguistics , sociology , anthropology , social work , cognitive psychology , social psychology , international relations , human geography , communication studies and translation studies , each of which is subject to its own assumptions , dimensions of analysis , and methodologies .
The examples and perspective in this article deal primarily with the United States and do not represent a worldwide view of the subject .
Please improve this article and discuss the issue on the talk page .
-LRB- December 2010 -RRB- Some scholars -LRB- which ? -RRB-
consider the Austrian emigre Leo Spitzer 's Stilstudien -LRB- Style Studies -RRB- of 1928 the earliest example of discourse analysis -LRB- DA -RRB- ; Michel Foucault himself translated it into French .
But the term first came into general use following the publication of a series of papers by Zellig Harris beginning in 1952 and reporting on work from which he developed transformational grammar in the late 1930s .
Formal equivalence relations among the sentences of a coherent discourse are made explicit by using sentence transformations to put the text in a canonical form .
Words and sentences with equivalent information then appear in the same column of an array .
This work progressed over the next four decades -LRB- see references -RRB- into a science of sublanguage analysis -LRB- Kittredge & Lehrberger 1982 -RRB- , culminating in a demonstration of the informational structures in texts of a sublanguage of science , that of immunology , -LRB- Harris et al. 1989 -RRB- and a fully articulated theory of linguistic informational content -LRB- Harris 1991 -RRB- .
During this time , however , most linguists decided a succession of elaborate theories of sentence-level syntax and semantics .
Although Harris had mentioned the analysis of whole discourses , he had not worked out a comprehensive model , as of January , 1952 .
A linguist working for the American Bible Society , James A. Lauriault\/Loriot , needed to find answers to some fundamental errors in translating Quechua , in the Cuzco area of Peru .
He took Harris 's idea , recorded all of the legends and , after going over the meaning and placement of each word with a native speaker of Quechua , was able to form logical , mathematical rules that transcended the simple sentence structure .
He then applied the process to another language of Eastern Peru , Shipibo .
He taught the theory in Norman , Oklahoma , in the summers of 1956 and 1957 and entered the University of Pennsylvania in the interim year .
He tried to publish a paper Shipibo Paragraph Structure , but it was delayed until 1970 -LRB- Loriot & Hollenbach 1970 -RRB- .
In the meantime , Dr. Kenneth Lee Pike , a professor at University of Michigan , Ann Arbor , taught the theory , and one of his students , Robert E. Longacre , was able to disseminate it in a dissertation .
Harris 's methodology was developed into a system for the computer-aided analysis of natural language by a team led by Naomi Sager at NYU , which has been applied to a number of sublanguage domains , most notably to medical informatics .
The software for the Medical Language Processor is publicly available on SourceForge .
In the late 1960s and 1970s , and without reference to this prior work , a variety of other approaches to a new cross-discipline of DA began to develop in most of the humanities and social sciences concurrently with , and related to , other disciplines , such as semiotics , psycholinguistics , sociolinguistics , and pragmatics .
Many of these approaches , especially those influenced by the social sciences , favor a more dynamic study of oral talk-in-interaction .
Mention must also be made of the term `` Conversational analysis '' , which was influenced by the Sociologist Harold Garfinkel who is the founder of Ethnomethodology .
In Europe , Michel Foucault became one of the key theorists of the subject , especially of discourse , and wrote The Archaeology of Knowledge on the subject .
Topics of interest Topics of discourse analysis include : The various levels or dimensions of discourse , such as sounds -LRB- intonation , etc. -RRB- , gestures , syntax , the lexicon , style , rhetoric , meanings , speech acts , moves , strategies , turns and other aspects of interaction Genres of discourse -LRB- various types of discourse in politics , the media , education , science , business , etc. -RRB- The relations between discourse and the emergence of syntactic structure The relations between text -LRB- discourse -RRB- and context The relations between discourse and power The relations between discourse and interaction The relations between discourse and cognition and memory Political discourse Political discourse analysis is a field of discourse analysis which focuses on discourse in political forums -LRB- such as debates , speeches , and hearings -RRB- as the phenomenon of interest .
Political discourse is the informal exchange of reasoned views as to which of several alternative courses of action should be taken to solve a societal problem .
It is a science that has been used through the history of the United States .
It is the essence of democracy .
Full of problems and persuasion , political discourse is used in many debates , candidacies and in our everyday life .
Perspectives The following are some of the specific theoretical perspectives and analytical approaches used in linguistic discourse analysis : Emergent grammar Text grammar -LRB- or ` discourse grammar ' -RRB- Cohesion and relevance theory Functional grammar Rhetoric Stylistics -LRB- linguistics -RRB- Interactional sociolinguistics Ethnography of communication Pragmatics , particularly speech act theory Conversation analysis Variation analysis Applied linguistics Cognitive psychology , often under the label discourse processing , studying the production and comprehension of discourse .
Discursive psychology Response based therapy -LRB- counselling -RRB- Critical discourse analysis Sublanguage analysis Genre Analysis & Critical Genre Analysis Although these approaches emphasize different aspects of language use , they all view language as social interaction , and are concerned with the social contexts in which discourse is embedded .
Often a distinction is made between ` local ' structures of discourse -LRB- such as relations among sentences , propositions , and turns -RRB- and ` global ' structures , such as overall topics and the schematic organization of discourses and conversations .
For instance , many types of discourse begin with some kind of global ` summary ' , in titles , headlines , leads , abstracts , and so on .
A problem for the discourse analyst is to decide when a particular feature is relevant to the specification is required .
Are there general principles which will determine the relevance or nature of the specification .
Prominent discourse analysts This article contains embedded lists that may be poorly defined , unverified or indiscriminate .
Please help to clean it up to meet Wikipedia 's quality standards .
-LRB- May 2012 -RRB- Marc Angenot , Robert de Beaugrande , Jan Blommaert , Adriana Bolivar , Carmen Rosa Caldas-Coulthard , Robyn Carston , Wallace Chafe , Paul Chilton , Guy Cook , Malcolm Coulthard , James Deese , Paul Drew , John Du Bois , Alessandro Duranti , Brenton D. Faber , Norman Fairclough , Michel Foucault , Roger Fowler , James Paul Gee , Talmy Givón , Charles Goodwin , Art Graesser , Michael Halliday , Zellig Harris , John Heritage , Janet Holmes , David R. Howarth , Paul Hopper , Gail Jefferson , Barbara Johnstone , Walter Kintsch , Richard Kittredge , Adam Jaworski , William Labov , George Lakoff , Jay Lemke , Stephen H. Levinsohn , James A. Lauriault\/Loriot , Robert E. Longacre , Jim Martin , Aletta Norval , David Nunan , Elinor Ochs , Gina Poncini , Jonathan Potter , Edward Robinson , Nikolas Rose , Harvey Sacks , Svenka Savic Naomi Sager , Emanuel Schegloff , Deborah Schiffrin , Michael Schober , Stef Slembrouck , Michael Stubbs , John Swales , Deborah Tannen , Sandra Thompson , Teun A. van Dijk , Theo van Leeuwen , Jef Verschueren , Henry Widdowson , Carla Willig , Deirdre Wilson , Ruth Wodak , Margaret Wetherell , Ernesto Laclau , Chantal Mouffe , Judith M. De Guzman , Cynthia Hardy , Louise J. Phillips .
-LRB- citation needed -RRB- Bhatia , V.J. , John Swales , Zellig Harris The phenomenon of information overload has meant that access to coherent and correctly-developed summaries is vital .
As access to data has increased so has interest in automatic summarization .
An example of the use of summarization technology is search engines such as Google .
Technologies that can make a coherent summary , of any kind of text , need to take into account several variables such as length , writing style and syntax to make a useful summary .
Extractive methods work by selecting a subset of existing words , phrases , or sentences in the original text to form the summary .
In contrast , abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate .
Such a summary might contain words not explicitly present in the original .
The state-of-the-art abstractive methods are still quite weak , so most research has focused on extractive methods , and this is what we will cover .
Two particular types of summarization often addressed in the literature are keyphrase extraction , where the goal is to select individual words or phrases to `` tag '' a document , and document summarization , where the goal is to select whole sentences to create a short paragraph summary .
Extraction and abstraction Broadly , one distinguishes two approaches : extraction and abstraction .
Extraction techniques merely copy the information deemed most important by the system to the summary -LRB- for example , key clauses , sentences or paragraphs -RRB- , while abstraction involves paraphrasing sections of the source document .
In general , abstraction can condense a text more strongly than extraction , but the programs that can do this are harder to develop as they require the use of natural language generation technology , which itself is a growing field .
Types of summaries There are different types of summaries depending what the summarization program focuses on to make the summary of the text , for example generic summaries or query relevant summaries -LRB- sometimes called query-biased summaries -RRB- .
Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs .
Summarization of multimedia documents , e.g. pictures or movies , is also possible .
Some systems will generate a summary based on a single source document , while others can use multiple source documents -LRB- for example , a cluster of news stories on the same topic -RRB- .
These systems are known as multi-document summarization systems .
Keyphrase extraction Task description and example The task is the following .
You are given a piece of text , such as a journal article , and you must produce a list of keywords or keyphrases that capture the primary topics discussed in the text .
In the case of research articles , many authors provide manually assigned keywords , but most text lacks pre-existing keyphrases .
For example , news articles rarely have keyphrases attached , but it would be useful to be able to automatically do so for a number of applications discussed below .
Consider the example text from a recent news article : `` The Army Corps of Engineers , rushing to meet President Bush 's promise to protect New Orleans by the start of the 2006 hurricane season , installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm , according to documents obtained by The Associated Press '' .
An extractive keyphrase extractor might select `` Army Corps of Engineers '' , `` President Bush '' , `` New Orleans '' , and `` defective flood-control pumps '' as keyphrases .
These are pulled directly from the text .
In contrast , an abstractive keyphrase system would somehow internalize the content and generate keyphrases that might be more descriptive and more like what a human would produce , such as `` political negligence '' or `` inadequate protection from floods '' .
Note that these terms do not appear in the text and require a deep understanding , which makes it difficult for a computer to produce such keyphrases .
Keyphrases have many applications , such as to improve document browsing by providing a short summary .
Also , keyphrases can improve information retrieval -- if documents have keyphrases assigned , a user could search by keyphrase to produce more reliable hits than a full-text search .
Also , automatic keyphrase extraction can be useful in generating index entries for a large text corpus .
Keyphrase extraction as supervised learning Beginning with the Turney paper , many researchers have approached keyphrase extraction as a supervised machine learning problem .
Given a document , we construct an example for each unigram , bigram , and trigram found in the text -LRB- though other text units are also possible , as discussed below -RRB- .
We then compute various features describing each example -LRB- e.g. , does the phrase begin with an upper-case letter ? -RRB- .
We assume there are known keyphrases available for a set of training documents .
Using the known keyphrases , we can assign positive or negative labels to the examples .
Then we learn a classifier that can discriminate between positive and negative examples as a function of the features .
Some classifiers make a binary classification for a test example , while others assign a probability of being a keyphrase .
For instance , in the above text , we might learn a rule that says phrases with initial capital letters are likely to be keyphrases .
After training a learner , we can select keyphrases for test documents in the following manner .
We apply the same example-generation strategy to the test documents , then run each example through the learner .
We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model .
If probabilities are given , a threshold is used to select the keyphrases .
Keyphrase extractors are generally evaluated using precision and recall .
Precision measures how many of the proposed keyphrases are actually correct .
Recall measures how many of the true keyphrases your system proposed .
The two measures can be combined in an F-score , which is the harmonic mean of the two -LRB- F = 2PR \/ -LRB- P + R -RRB- -RRB- .
Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization .
Design choices Designing a supervised keyphrase extraction system involves deciding on several choices -LRB- some of these apply to unsupervised , too -RRB- : What are the examples ?
The first choice is exactly how to generate examples .
Turney and others have used all possible unigrams , bigrams , and trigrams without intervening punctuation and after removing stopwords .
Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags .
Ideally , the mechanism for generating examples produces all the known labeled keyphrases as candidates , though this is often not the case .
For example , if we use only unigrams , bigrams , and trigrams , then we will never be able to extract a known keyphrase containing four words .
Thus , recall may suffer .
However , generating too many examples can also lead to low precision .
What are the features ?
We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non - keyphrases .
Typically features involve various term frequencies -LRB- how many times a phrase appears in the current text or in a larger corpus -RRB- , the length of the example , relative position of the first occurrence , various boolean syntactic features -LRB- e.g. , contains all caps -RRB- , etc. .
The Turney paper used about 12 such features .
Hulth uses a reduced set of features , which were found most successful in the KEA -LRB- Keyphrase Extraction Algorithm -RRB- work derived from Turney 's seminal paper .
How many keyphrases to return ?
In the end , the system will need to return a list of keyphrases for a test document , so we need to have a way to limit the number .
Ensemble methods -LRB- i.e. , using votes from several classifiers -RRB- have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases .
This is the technique used by Turney with C4 .5 decision trees .
Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number .
What learning algorithm ?
Once examples and features are created , we need a way to learn to predict keyphrases .
Virtually any supervised learning algorithm could be used , such as decision trees , Naive Bayes , and rule induction .
In the case of Turney 's GenEx algorithm , a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm .
The extractor follows a series of heuristics to identify keyphrases .
The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases .
Unsupervised keyphrase extraction : TextRank While supervised methods have some nice properties , like being able to produce interpretable rules for what features characterize a keyphrase , they also require a large amount of training data .
Many documents with known keyphrases are needed .
Furthermore , training on a specific domain tends to customize the extraction process to that domain , so the resulting classifier is not necessarily portable , as some of Turney 's results demonstrate .
Unsupervised keyphrase extraction removes the need for training data .
It approaches the problem from a different angle .
Instead of trying to learn explicit features that characterize keyphrases , the TextRank algorithm exploits the structure of the text itself to determine keyphrases that appear `` central '' to the text in the same way that PageRank selects important Web pages .
Recall this is based on the notion of `` prestige '' or `` recommendation '' from social networks .
In this way , TextRank does not rely on any previous training data at all , but rather can be run on any arbitrary piece of text , and it can produce output simply based on the text 's intrinsic properties .
Thus the algorithm is easily portable to new domains and languages .
TextRank is a general purpose graph-based ranking algorithm for NLP .
Essentially , it runs PageRank on a graph specially designed for a particular NLP task .
For keyphrase extraction , it builds a graph using some set of text units as vertices .
Edges are based on some measure of semantic or lexical similarity between the text unit vertices .
Unlike PageRank , the edges are typically undirected and can be weighted to reflect a degree of similarity .
Once the graph is constructed , it is used to form a stochastic matrix , combined with a damping factor -LRB- as in the `` random surfer model '' -RRB- , and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 -LRB- i.e. , the stationary distribution of the random walk on the graph -RRB- .
Design choices What should vertices be ?
The vertices should correspond to what we want to rank .
Potentially , we could do something similar to the supervised methods and create a vertex for each unigram , bigram , trigram , etc. .
However , to keep the graph small , the authors decide to rank individual unigrams in a first step , and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases .
This has a nice side effect of allowing us to produce keyphrases of arbitrary length .
For example , if we rank unigrams and find that `` advanced '' , `` natural '' , `` language '' , and `` processing '' all get high ranks , then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together .
Note that the unigrams placed in the graph can be filtered by part of speech .
The authors found that adjectives and nouns were the best to include .
Thus , some linguistic knowledge comes into play in this step .
How should we create edges ?
Edges are created based on word co-occurrence in this application of TextRank .
Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text .
N is typically around 2 -- 10 .
Thus , `` natural '' and `` language '' might be linked in a text about NLP .
`` Natural '' and `` processing '' would also be linked because they would both appear in the same string of N words .
These edges build on the notion of `` text cohesion '' and the idea that words that appear near each other are likely related in a meaningful way and `` recommend '' each other to the reader .
How are the final keyphrases formed ?
Since this method simply ranks the individual vertices , we need a way to threshold or produce a limited number of keyphrases .
The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph .
Then the top T vertices\/unigrams are selected based on their stationary probabilities .
A post - processing step is then applied to merge adjacent instances of these T unigrams .
As a result , potentially more or less than T final keyphrases will be produced , but the number should be roughly proportional to the length of the original text .
Why it works It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases .
One way to think about it is the following .
A word that appears multiple times throughout a text may have many different co-occurring neighbors .
For example , in a text about machine learning , the unigram `` learning '' might co-occur with `` machine '' , supervised '' , `` un-supervised '' , and `` semi-supervised '' in four different sentences .
Thus , the `` learning '' vertex would be a central `` hub '' that connects to these other modifying words .
Running PageRank\/TextRank on the graph is likely to rank `` learning '' highly .
Similarly , if the text contains the phrase `` supervised classification '' , then there would be an edge between `` supervised '' and `` classification '' .
If `` classification '' appears several other places and thus has many neighbors , it is importance would contribute to the importance of `` supervised '' .
If it ends up with a high rank , it will be selected as one of the top T unigrams , along with `` learning '' and probably `` classification '' .
In the final post-processing step , we would then end up with keyphrases `` supervised learning '' and `` supervised classification '' .
In short , the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts .
A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters .
This is similar to densely connected Web pages getting ranked highly by PageRank .
Document summarization Like keyphrase extraction , document summarization hopes to identify the essence of a text .
The only real difference is that now we are dealing with larger text units -- whole sentences instead of words and phrases .
While some work has been done in abstractive summarization -LRB- creating an abstract synopsis like that of a human -RRB- , the majority of summarization systems are extractive -LRB- selecting a subset of sentences to place in a summary -RRB- .
Before getting into the details of some summarization methods , we will mention how summarization systems are typically evaluated .
The most common way is using the so-called ROUGE -LRB- Recall-Oriented Understudy for Gisting Evaluation -RRB- measure -LRB- http:\/\/haydn.isi.edu\/ROUGE\/ -RRB- .
This is a recall-based measure that determines how well a system-generated summary covers the content present in one or more human-generated model summaries known as references .
It is recall-based to encourage systems to include all the important topics in the text .
Recall can be computed with respect to unigram , bigram , trigram , or 4-gram matching , though ROUGE-1 -LRB- unigram matching -RRB- has been shown to correlate best with human assessments of system-generated summaries -LRB- i.e. , the summaries with highest ROUGE-1 values correlate with the summaries humans deemed the best -RRB- .
ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary .
If there are multiple references , the ROUGE-1 scores are averaged .
Because ROUGE is based only on content overlap , it can determine if the same general concepts are discussed between an automatic summary and a reference summary , but it can not determine if the result is coherent or the sentences flow together in a sensible manner .
High-order n-gram ROUGE measures try to judge fluency to some degree .
Note that ROUGE is similar to the BLEU measure for machine translation , but BLEU is precision - based , because translation systems favor accuracy .
A promising line in document summarization is adaptive document\/text summarization .
The idea of adaptive summarization involves preliminary recognition of document\/text genre and subsequent application of summarization algorithms optimized for this genre .
First summarizes that perform adaptive summarization have been created .
Overview of supervised learning approaches Supervised text summarization is very much like supervised keyphrase extraction , and we will not spend much time on it .
Basically , if you have a collection of documents and human-generated summaries for them , you can learn features of sentences that make them good candidates for inclusion in the summary .
Features might include the position in the document -LRB- i.e. , the first few sentences are probably important -RRB- , the number of words in the sentence , etc. .
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as `` in summary '' or `` not in summary '' .
This is not typically how people create summaries , so simply using journal abstracts or existing summaries is usually not sufficient .
The sentences in these summaries do not necessarily match up with sentences in the original text , so it would difficult to assign labels to examples for training .
Note , however , that these natural summaries can still be used for evaluation purposes , since ROUGE-1 only cares about unigrams .
Unsupervised approaches : TextRank and LexRank The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data .
Some unsupervised summarization approaches are based on finding a `` centroid '' sentence , which is the mean word vector of all the sentences in the document .
Then the sentences can be ranked with regard to their similarity to this centroid sentence .
A more principled way to estimate sentence importance is using random walks and eigenvector centrality .
LexRank is an algorithm essentially identical to TextRank , and both use this approach for document summarization .
The two methods were developed by different groups at the same time , and LexRank simply focused on summarization , but could just as easily be used for keyphrase extraction or any other NLP ranking task .
Design choices What are the vertices ?
In both LexRank and TextRank , a graph is constructed by creating a vertex for each sentence in the document .
What are the edges ?
The edges between sentences are based on some form of semantic similarity or content overlap .
While LexRank uses cosine similarity of TF-IDF vectors , TextRank uses a very similar measure based on the number of words two sentences have in common -LRB- normalized by the sentences ' lengths -RRB- .
The LexRank paper explored using unweighted edges after applying a threshold to the cosine values , but also experimented with using edges with weights equal to the similarity score .
TextRank uses continuous similarity scores as weights .
How are summaries formed ?
In both algorithms , the sentences are ranked by applying PageRank to the resulting graph .
A summary is formed by combining the top ranking sentences , using a threshold or length cutoff to limit the size of the summary .
TextRank and LexRank differences It is worth noting that TextRank was applied to summarization exactly as described here , while LexRank was used as part of a larger summarization system -LRB- MEAD -RRB- that combines the LexRank score -LRB- stationary probability -RRB- with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights .
In this case , some training documents might be needed , though the TextRank results show the additional features are not absolutely necessary .
Another important distinction is that TextRank was used for single document summarization , while LexRank has been applied to multi-document summarization .
The task remains the same in both cases -- only the number of sentences to choose from has grown .
However , when summarizing multiple documents , there is a greater risk of selecting duplicate or highly redundant sentences to place in the same summary .
Imagine you have a cluster of news articles on a particular event , and you want to produce one summary .
Each article is likely to have many similar sentences , and you would only want to include distinct ideas in the summary .
To address this issue , LexRank applies a heuristic post-processing step that builds up a summary by adding sentences in rank order , but discards any sentences that are too similar to ones already placed in the summary .
The method used is called Cross-Sentence Information Subsumption -LRB- CSIS -RRB- .
Why unsupervised summarization works These methods work based on the idea that sentences `` recommend '' other similar sentences to the reader .
Thus , if one sentence is very similar to many others , it will likely be a sentence of great importance .
The importance of this sentence also stems from the importance of the sentences `` recommending '' it .
Thus , to get ranked highly and placed in a summary , a sentence must be similar to many sentences that are in turn also similar to many other sentences .
This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text .
The methods are domain-independent and easily portable .
One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain .
However , the unsupervised `` recommendation '' - based approach applies to any domain .
Incorporating diversity : GRASSHOPPER algorithm As mentioned above , multi-document extractive summarization faces a problem of potential redundancy .
Ideally , we would like to extract sentences that are both `` central '' -LRB- i.e. , contain the main ideas -RRB- and `` diverse '' -LRB- i.e. , they differ from one another -RRB- .
LexRank deals with diversity as a heuristic final stage using CSIS , and other systems have used similar methods , such as Maximal Marginal Relevance -LRB- MMR -RRB- , in trying to eliminate redundancy in information retrieval results .
We have developed a general purpose graph-based ranking algorithm like Page\/Lex\/TextRank that handles both `` centrality '' and `` diversity '' in a unified mathematical framework based on absorbing Markov chain random walks .
-LRB- An absorbing random walk is like a standard random walk , except some states are now absorbing states that act as `` black holes '' that cause the walk to end abruptly at that state . -RRB-
The algorithm is called GRASSHOPPER for reasons that should soon become clear .
In addition to explicitly promoting diversity during the ranking process , GRASSHOPPER incorporates a prior ranking -LRB- based on sentence position in the case of summarization -RRB- .
Maximum entropy-based summarization It is an abstractive method .
Even though automating abstractive summarization is the goal of summarization research , most practical systems are based on some form of extractive summarization .
Extracted sentences can form a valid summary in itself or form a basis for further condensation operations .
Furthermore , evaluation of extracted summaries can be automated , since it is essentially a classification task .
During the DUC 2001 and 2002 evaluation workshops , TNO developed a sentence extraction system for multi-document summarization in the news domain .
The system was based on a hybrid system using a naive Bayes classifier and statistical language models for modeling salience .
Although the system exhibited good results , we wanted to explore the effectiveness of a maximum entropy -LRB- ME -RRB- classifier for the meeting summarization task , as ME is known to be robust against feature dependencies .
Maximum entropy has also been applied successfully for summarization in the broadcast news domain .
Aided summarization Machine learning techniques from closely related fields such as information retrieval or text mining have been successfully adapted to help automatic summarization .
Apart from Fully Automated Summarizers -LRB- FAS -RRB- , there are systems that aid users with the task of summarization -LRB- MAHS = Machine Aided Human Summarization -RRB- , for example by highlighting candidate passages to be included in the summary , and there are systems that depend on post-processing by a human -LRB- HAMS = Human Aided Machine Summarization -RRB- .
Evaluation An ongoing issue in this field is that of evaluation .
Evaluation techniques fall into intrinsic and extrinsic , inter-texual and intra-texual .
An intrinsic evaluation tests the summarization system in of itself while an extrinsic evaluation tests the summarization based on how it affects the completion of some other task .
Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries .
Extrinsic evaluations , on the other hand , have tested the impact of summarization on tasks like relevance assessment , reading comprehension , etc. .
Intra-texual methods assess the output of a specific summarization system , and the inter-texual ones focus on contrastive analysis of outputs of several summarization systems .
Human judgement often has wide variance on what is considered a `` good '' summary , which means that making the evaluation process automatic is particularly difficult .
Manual evaluation can be used , but this is both time and labor intensive as it requires humans to read not only the summaries but also the source documents .
Other issues are those concerning coherence and coverage .
One of the metrics used in NIST 's annual Document Understanding Conferences , in which research groups submit their systems for both summarization and translation tasks , is the ROUGE metric -LRB- Recall-Oriented Understudy for Gisting Evaluation -RRB- .
It essentially calculates n-gram overlaps between automatically generated summaries and previously-written human summaries .
A high level of overlap should indicate a high level of shared concepts between the two summaries .
Note that overlap metrics like this are unable to provide any feedback on a summary 's coherence .
Anaphor resolution remains another problem yet to be fully solved .
Evaluating summaries , either manually or automatically , is a hard task .
The main difficulty in evaluation comes from the impossibility of building a fair gold-standard against which the results of the systems can be compared .
Furthermore , it is also very hard to determine what a correct summary is , because there is always the possibility of a system to generate a good summary that is quite different from any human summary used as an approximation to the correct output .
Current difficulties in evaluating summaries automatically The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries .
However , as content selection is not a deterministic problem , different people would choose different sentences , and even , the same person may chose different sentences at different times , showing evidence of low agreement among humans as to which sentences are good summary sentences .
Besides the human variability , the semantic equivalence is another problem , because two distinct sentences can express the same meaning but not using the same words .
This phenomenon is known as paraphrase .
We can find an approach to automatically evaluating summaries using paraphrases -LRB- ParaEval -RRB- .
Moreover , most summarization systems perform an extractive approach , selecting and copying important sentences from the source documents .
Although humans can also cut and paste relevant information of a text , most of the times they rephrase sentences when necessary , or they join different related information into one sentence .
Evaluating summaries qualitatively The main drawback of the evaluation systems existing so far is that we need at least one reference summary , and for some methods more than one , to be able to compare automatic summaries with models .
This is a hard and expensive task .
Much effort has to be done in order to have corpus of texts and their corresponding summaries .
Furthermore , for some methods presented in the previous Section , not only do we need to have human-made summaries available for comparison , but also manual annotation has to be performed in some of them -LRB- e.g. SCU in the Pyramid Method -RRB- .
In any case , what the evaluation methods need as an input , is a set of summaries to serve as gold standards and a set of automatic summaries .
Moreover , they all perform a quantitative evaluation with regard to different similarity metrics .
To overcome these problems , we think that the quantitative evaluation might not be the only way to evaluate summaries , and a qualitative automatic evaluation would be also important .
Therefore , the second aim of this paper is to suggest a novel proposal for evaluating automatically the quality of a summary in a qualitative manner rather than in a quantitative one .
Our evaluation approach is a preliminary approach which has to be studied more deeply , and developed in the future .
Its main underlying idea is to define several quality criteria and check how a generated summary tackles each of these , in such a way that a reference model would not be necessary anymore , taking only into consideration the automatic summary and the original source .
Once performed , it could be used together with any other automatic methodology to measure summary 's informativeness .
Natural Language Generation -LRB- NLG -RRB- is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form .
Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations .
In a sense , one can say that an NLG system is like a translator that converts a computer based representation into a natural language representation .
However , the methods to produce the final language are very different from those of a compiler due to the inherent expressivity of natural languages .
NLG may be viewed as the opposite of natural language understanding .
The difference can be put this way : whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language , in NLG the system needs to make decisions about how to put a concept into words .
The simplest -LRB- and perhaps trivial -RRB- examples are systems that generate form letters .
Such systems do not typically involve grammar rules , but may generate a letter to a consumer , e.g. stating that a credit card spending limit is about to be reached .
More complex NLG systems dynamically create texts to meet a communicative goal .
As in other areas of natural language processing , this can be done using either explicit models of language -LRB- e.g. , grammars -RRB- and the domain , or using statistical models derived by analyzing human-written texts .
NLG is a fast-evolving field .
The best single source for up-to-date research in the area is the SIGGEN portion of the ACL Anthology .
Perhaps the closest the field comes to a specialist textbook is Reiter and Dale -LRB- 2000 -RRB- , but this book does not describe developments in the field since 2000 .
This system takes as input six numbers , which give predicted pollen levels in different parts of Scotland .
From these numbers , the system generates a short textual summary of pollen levels as its output .
For example , using the historical data for 1-July-2005 , the software produces Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country .
However , in Northern areas , pollen levels will be moderate with values of 4 .
In contrast , the actual forecast -LRB- written by a human meteorologist -RRB- from this data was Pollen counts are expected to remain high at level 6 over most of Scotland , and even level 7 in the south east .
The only relief is in the Northern Isles and far northeast of mainland Scotland with medium levels of pollen count .
Comparing these two illustrates some of the choices that NLG systems must make ; these are further discussed below .
Stages The process to generate text can be as simple as keeping a list of canned text that is copied and pasted , possibly linked with some glue text .
The results may be satisfactory in simple domains such as horoscope machines or generators of personalised business letters .
However , a sophisticated NLG system needs to include stages of planning and merging of information to enable the generation of text that looks natural and does not become repetitive .
Typical stages are : Content determination : Deciding what information to mention in the text .
For instance , in the pollen example above , deciding whether to explicitly mention that pollen level is 7 in the south east .
Document structuring : Overall organization of the information to convey .
For example , deciding to describe the areas with high pollen levels first , instead of the areas with low pollen levels .
Aggregation : Merging of similar sentences to improve readability and naturalness .
For instance , merging the two sentences Grass pollen levels for Friday have increased from the moderate to high levels of yesterday and Grass pollen levels will be around 6 to 7 across most parts of the country into the single sentence Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country .
Lexical choice : Putting words to the concepts .
For example , deciding whether medium or moderate should be used when describing a pollen level of 4 .
Referring expression generation : Creating referring expressions that identify objects and regions .
For example , deciding to use in the Northern Isles and far northeast of mainland Scotland to refer to a certain region in Scotland .
This task also includes making decisions about pronouns and other types of anaphora .
Realisation : Creating the actual text , which should be correct according to the rules of syntax , morphology , and orthography .
For example , using will be for the future tense of to be .
Applications The popular media has been especially interested in NLG systems which generate jokes -LRB- see computational humor -RRB- .
But from a commercial perspective , the most successful NLG applications have been data-to-text systems which generate textual summaries of databases and data sets ; these systems usually perform data analysis as well as text generation .
In particular , several systems have been built that produce textual weather forecasts from weather data .
The earliest such system to be deployed was FoG , which was used by Environment Canada to generate weather forecasts in French and English in the early 1990s .
The success of FoG triggered other work , both research and commercial .
Recent research in this area include an experiment which showed that users sometimes preferred computer-generated weather forecasts to human-written ones , in part because the computer forecasts used more consistent terminology , and a demonstration that statistical techniques could be used to generate high-quality weather forecasts .
Recent applications include the ARNS system used to summarise conditions in US ports .
In the 1990s there was considerable interest in using NLG to summarise financial and business data .
For example the SPOTLIGHT system developed at A.C. Nielsen automatically generated readable English text based on the analysis of large amounts of retail sales data .
More recently there is growing interest in using NLG to summarise electronic medical records .
Commercial applications in this area are starting to appear , and researchers have shown that NLG summaries of medical data can be effective decision-support aids for medical professionals .
There is also growing interest in using NLG to enhance accessibility , for example by describing graphs and data sets to blind people .
An example for a highly interactive use of NLG is the WYSIWYM framework .
It stands for What you see is what you meant and allows users to see and manipulate the continuously rendered view -LRB- NLG output -RRB- of an underlying formal language document -LRB- NLG input -RRB- , thereby editing the formal language without having to learn it .
Evaluation As in other scientific fields , NLG researchers need to be able to test how well their systems , modules , and algorithms work .
This is called evaluation .
There are three basic techniques for evaluating NLG systems : task-based -LRB- extrinsic -RRB- evaluation : give the generated text to a person , and assess how well it helps him perform a task -LRB- or otherwise achieves its communicative goal -RRB- .
For example , a system which generates summaries of medical data can be evaluated by giving these summaries to doctors , and assessing whether the summaries helps doctors make better decisions .
human ratings : give the generated text to a person , and ask him or her to rate the quality and usefulness of the text .
metrics : compare generated texts to texts written by people from the same input data , using an automatic metric such as BLEU .
Generally speaking , what we ultimately want to know is how useful NLG systems are at helping people , which is the first of the above techniques .
However , task-based evaluations are time-consuming and expensive , and can be difficult to carry out -LRB- especially if they require subjects with specialised expertise , such as doctors -RRB- .
Hence -LRB- as in other areas of NLP -RRB- task-based evaluations are the exception , not the norm .
In recent years researchers have started trying to assess how well human-ratings and metrics correlate with -LRB- predict -RRB- task-based evaluations .
Much of this work is being conducted in the context of Generation Challenges shared-task events .
Initial results suggest that human ratings are much better than metrics in this regard .
In other words , human ratings usually do predict task-effectiveness at least to some degree -LRB- although there are exceptions -RRB- , while ratings produced by metrics often do not predict task-effectiveness well .
These results are very preliminary , hopefully better data will be available soon .
In any case , human ratings are currently the most popular evaluation technique in NLG ; this is contrast to machine translation , where metrics are very widely used .
Natural language understanding is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension .
The process of disassembling and parsing input is more complex than the reverse process of assembling output in natural language generation because of the occurrence of unknown and unexpected features in the input and the need to determine the appropriate syntactic and semantic schemes to apply to it , factors which are pre-determined when outputting language .
There is considerable commercial interest in the field because of its application to news-gathering , text categorization , voice-activation , archiving and large-scale content-analysis .
Eight years after John McCarthy coined the term artificial intelligence , Bobrow 's dissertation -LRB- titled Natural Language Input for a Computer Problem Solving System -RRB- showed how a computer can understand simple natural language input to solve algebra word problems .
A year later , in 1965 , Joseph Weizenbaum at MIT wrote ELIZA , an interactive program that carried on a dialogue in English on any topic , the most popular being psychotherapy .
ELIZA worked by simple parsing and substitution of key words into canned phrases and Weizenbaum sidestepped the problem of giving the program a database of real-world knowledge or a rich lexicon .
Yet ELIZA gained surprising popularity as a toy project and can be seen as a very early precursor to current commercial systems such as those used by Ask.com .
In 1969 Roger Schank at Stanford University introduced the conceptual dependency theory for natural language understanding .
This model , partially influenced by the work of Sydney Lamb , was extensively used by Schank 's students at Yale University , such as Robert Wilensky , Wendy Lehnert , and Janet Kolodner .
In 1970 , William A. Woods introduced the augmented transition network -LRB- ATN -RRB- to represent natural language input .
Instead of phrase structure rules ATNs used an equivalent set of finite state automata that were called recursively .
ATNs and their more general format called `` generalized ATNs '' continued to be used for a number of years .
In 1971 Terry Winograd finished writing SHRDLU for his PhD thesis at MIT .
SHRDLU could understand simple English sentences in a restricted world of children 's blocks to direct a robotic arm to move items .
The successful demonstration of SHRDLU provided significant momentum for continued research in the field .
Winograd continued to be a major influence in the field with the publication of his book Language as a Cognitive Process .
At Stanford , Winograd was later the adviser for Larry Page , who co-founded Google .
In the 1970s and 1980s the natural language processing group at SRI International continued research and development in the field .
A number of commercial efforts based on the research were undertaken , e.g. , in 1982 Gary Hendrix formed Symantec Corporation originally as a company for developing a natural language interface for database queries on personal computers .
However , with the advent of mouse driven , graphic user interfaces Symantec changed direction .
A number of other commercial efforts were started around the same time , e.g. , Larry R. Harris at the Artificial Intelligence Corporation and Roger Schank and his students at Cognitive Systems corp. .
In 1983 , Michael Dyer developed the BORIS system at Yale which bore similarities to the work of Roger Schank and W. G. Lehnart .
Scope and context The umbrella term `` natural language understanding '' can be applied to a diverse set of computer applications , ranging from small , relatively simple tasks such as short commands issued to robots , to highly complex endeavors such as the full comprehension of newspaper articles or poetry passages .
Many real world applications fall between the two extremes , for instance text classification for the automatic analysis of emails and their routing to a suitable department in a corporation does not require in depth understanding of the text , but is far more complex than the management of simple queries to database tables with fixed schemata .
Throughout the years various attempts at processing natural language or English-like sentences presented to computers have taken place at varying degrees of complexity .
Some attempts have not resulted in systems with deep understanding , but have helped overall system usability .
For example , Wayne Ratliff originally developed the Vulcan program with an English-like syntax to mimic the English speaking computer in Star Trek .
Vulcan later became the dBase system whose easy-to-use syntax effectively launched the personal computer database industry .
Systems with an easy to use or English like syntax are , however , quite distinct from systems that use a rich lexicon and include an internal representation -LRB- often as first order logic -RRB- of the semantics of natural language sentences .
Hence the breadth and depth of `` understanding '' aimed at by a system determine both the complexity of the system -LRB- and the implied challenges -RRB- and the types of applications it can deal with .
The `` breadth '' of a system is measured by the sizes of its vocabulary and grammar .
The `` depth '' is measured by the degree to which its understanding approximates that of a fluent native speaker .
At the narrowest and shallowest , English-like command interpreters require minimal complexity , but have a small range of applications .
Narrow but deep systems explore and model mechanisms of understanding , but they still have limited application .
Systems that attempt to understand the contents of a document such as a news release beyond simple keyword matching and to judge its suitability for a user are broader and require significant complexity , but they are still somewhat shallow .
Systems that are both very broad and very deep are beyond the current state of the art .
Components and architecture Regardless of the approach used , some common components can be identified in most natural language understanding systems .
The system needs a lexicon of the language and a parser and grammar rules to break sentences into an internal representation .
The construction of a rich lexicon with a suitable ontology requires significant effort , e.g. , the Wordnet lexicon required many person-years of effort .
The system also needs a semantic theory to guide the comprehension .
The interpretation capabilities of a language understanding system depend on the semantic theory it uses .
Competing semantic theories of language have specific trade offs in their suitability as the basis of computer automated semantic interpretation .
These range from naive semantics or stochastic semantic analysis to the use of pragmatics to derive meaning from context .
Advanced applications of natural language understanding also attempt to incorporate logical inference within their framework .
This is generally achieved by mapping the derived meaning into a set of assertions in predicate logic , then using logical deduction to arrive at conclusions .
Systems based on functional languages such as Lisp hence need to include a subsystem for the representation of logical assertions , while logic oriented systems such as those using the language Prolog generally rely on an extension of the built in logical representation framework .
The management of context in natural language understanding can present special challenges .
A large variety of examples and counter examples have resulted in multiple approaches to the formal modeling of context , each with specific strengths and weaknesses .
Optical character recognition , usually abbreviated to OCR , is the mechanical or electronic conversion of scanned images of handwritten , typewritten or printed text into machine-encoded text .
It is widely used as a form of data entry from some sort of original paper data source , whether documents , sales receipts , mail , or any number of printed records .
It is crucial to the computerization of printed texts so that they can be electronically searched , stored more compactly , displayed on-line , and used in machine processes such as machine translation , text-to-speech and text mining .
OCR is a field of research in pattern recognition , artificial intelligence and computer vision .
Early versions needed to be programmed with images of each character , and worked on one font at a time .
`` Intelligent '' systems with a high degree of recognition accuracy for most fonts are now common .
Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images , columns and other non-textual components .
In 1914 , Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code .
-LRB- citation needed -RRB- Around the same time , Edmund Fournier d'Albe developed the Optophone , a handheld scanner that when moved across a printed page , produced tones that corresponded to specific letters or characters .
Goldberg continued to develop OCR technology for data entry .
Later , he proposed photographing data records and then , using photocells , matching the photos against a template containing the desired identification pattern .
In 1929 Gustav Tauschek had similar ideas , and obtained a patent on OCR in Germany .
Paul W. Handel also obtained a US patent on such template-matching OCR technology in USA in 1933 -LRB- U.S. Patent 1,915,993 -RRB- .
In 1935 Tauschek was also granted a US patent on his method -LRB- U.S. Patent 2,026,329 -RRB- .
In 1949 RCA engineers worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration , but instead of converting the printed characters to machine language , their device converted it to machine language and then spoke the letters : an early text-to-speech technology .
It proved far too expensive and was not pursued after testing .
In 1950 , David H. Shepard , a cryptanalyst at the Armed Forces Security Agency in the United States , addressed the problem of converting printed messages into machine language for computer processing and built a machine to do this , called `` Gismo . '' .
He received a patent for his `` Apparatus for Reading '' in 1953 U.S. Patent 2,663,758 .
`` Gismo '' could read 23 letters of the English alphabet , comprehend Morse Code , read musical notations , read aloud from printed pages , and duplicate typewritten pages .
Shepard went on to found Intelligent Machines Research Corporation -LRB- IMR -RRB- , which soon developed the world 's first commercial OCR systems .
In 1955 , the first commercial system was installed at the Reader 's Digest , which used OCR to input sales reports into a computer .
It converted the typewritten reports into punched cards for input into the computer in the magazine 's subscription department , for help in processing the shipment of 15-20 million books a year .
The second system was sold to the Standard Oil Company for reading credit card imprints for billing purposes .
Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages .
IBM and others were later licensed on Shepard 's OCR patents .
In about 1965 , Reader 's Digest and RCA collaborated to build an OCR Document reader designed to digitize the serial numbers on Reader 's Digest coupons returned from advertisements .
The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font .
The reader was connected directly to an RCA 301 computer -LRB- one of the first solid state computers -RRB- .
This reader was followed by a specialised document reader installed at TWA where the reader processed Airline Ticket stock .
The readers processed documents at a rate of 1,500 documents per minute , and checked each document , rejecting those it was not able to process correctly .
The product became part of the RCA product line as a reader designed to process `` Turn around Documents '' such as those utility and insurance bills returned with payments .
The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow .
The first use of OCR in Europe was by the British General Post Office -LRB- GPO -RRB- .
In 1965 it began planning an entire banking system , the National Giro , using OCR technology , a process that revolutionized bill payment systems in the UK .
Canada Post has been using OCR systems since 1971 -LRB- citation needed -RRB- .
OCR systems read the name and address of the addressee at the first mechanized sorting center , and print a routing bar code on the envelope based on the postal code .
To avoid confusion with the human-readable address field which can be located anywhere on the letter , special ink -LRB- orange in visible light -RRB- is used that is clearly visible under ultraviolet light .
Envelopes may then be processed with equipment based on simple bar code readers .
Importance of OCR to the Blind In 1974 Ray Kurzweil started the company Kurzweil Computer Products , Inc. and continued development of omni-font OCR , which could recognize text printed in virtually any font .
He decided that the best application of this technology would be to create a reading machine for the blind , which would allow blind people to have a computer read text to them out loud .
This device required the invention of two enabling technologies -- the CCD flatbed scanner and the text-to-speech synthesizer .
On January 13 , 1976 the successful finished product was unveiled during a widely-reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind -LRB- citation needed -RRB- .
In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program .
LexisNexis was one of the first customers , and bought the program to upload paper legal and news documents onto its nascent online databases .
Two years later , Kurzweil sold his company to Xerox , which had an interest in further commercializing paper-to-computer text conversion .
Xerox eventually spun it off as Scansoft , which merged with Nuance Communications -LRB- citation needed -RRB- .
OCR software Desktop & Server OCR Software OCR software and ICR software technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases .
Based on the analysis of sequential lines and curves , OCR and ICR make ` best guesses ' at characters using database look-up tables to closely associate or match the strings of characters that form words .
WebOCR & OnlineOCR With IT technology development , the platform for people to use software has been changed from single PC platform to multi-platforms such as PC + Web-based + Cloud Computing + Mobile devices .
After 30 years development , OCR software started to adapt to new application requirements .
WebOCR also known as OnlineOCR or Web-based OCR service , has been a new trend to meet larger volume and larger group of users after 30 years development of the desktop OCR .
Internet and broadband technologies have made WebOCR & OnlineOCR practically available to both individual users and enterprise customers .
Since 2000 , some major OCR vendors began offering WebOCR & Online software , a number of new entrants companies to seize the opportunity to develop innovative Web-based OCR service , some of which are free of charge services .
Application-Oriented OCR Since OCR technology has been more and more widely applied to paper-intensive industry , it is facing more complex images environment in the real world .
For example : complicated backgrounds , degraded-images , heavy-noise , paper skew , picture distortion , low-resolution , disturbed by grid & lines , text image consisting of special fonts , symbols , glossary words and etc. .
All the factors affect OCR products ' stability in recognition accuracy .
In recent years , the major OCR technology providers began to develop dedicated OCR systems , each for special types of images .
They combine various optimization methods related to the special image , such as business rules , standard expression , glossary or dictionary and rich information contained in color images , to improve the recognition accuracy .
Such strategy to customize OCR technology is called `` Application-Oriented OCR '' or `` Customized OCR '' , widely used in the fields of Business-card OCR , Invoice OCR , Screenshot OCR , ID card OCR , Driver-license OCR or Auto plant OCR , and so on .
See also : List of optical character recognition software Current state of OCR technology This section needs additional citations for verification .
Please help improve this article by adding citations to reliable sources .
Unsourced material may be challenged and removed .
-LRB- May 2009 -RRB- Commissioned by the U.S. Department of Energy -LRB- DOE -RRB- , the Information Science Research Institute -LRB- ISRI -RRB- had the mission to foster the improvement of automated technologies for understanding machine printed documents -LRB- citation needed -RRB- , and it conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s .
Recognition of Latin-script , typewritten text is still not 100 % accurate even where clear imaging is available .
One study based on recognition of 19th - and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71 % to 98 % ; total accuracy can be achieved only by human review .
Other areas -- including recognition of hand printing , cursive handwriting , and printed text in other scripts -LRB- especially those East Asian language characters which have many strokes for a single character -RRB- -- are still the subject of active research .
Accuracy rates can be measured in several ways , and how they are measured can greatly affect the reported accuracy rate .
For example , if word context -LRB- basically a lexicon of words -RRB- is not used to correct software finding non-existent words , a character error rate of 1 % -LRB- 99 % accuracy -RRB- may result in an error rate of 5 % -LRB- 95 % accuracy -RRB- or worse if the measurement is based on whether each whole word was recognized with no incorrect letters .
On-line character recognition is sometimes confused with Optical Character Recognition -LRB- see Handwriting recognition -RRB- .
OCR is an instance of off-line character recognition , where the system recognizes the fixed static shape of the character , while on-line character recognition instead recognizes the dynamic motion during handwriting .
For example , on-line recognition , such as that used for gestures in the Penpoint OS or the Tablet PC can tell whether a horizontal mark was drawn right-to-left , or left-to-right .
On-line character recognition is also referred to by other terms such as dynamic character recognition , real-time character recognition , and Intelligent Character Recognition or ICR .
On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years -LRB- see Tablet PC history -RRB- .
Among these are the input devices for personal digital assistants such as those running Palm OS .
The Apple Newton pioneered this product .
The algorithms used in these devices take advantage of the fact that the order , speed , and direction of individual lines segments at input are known .
Also , the user can be retrained to use only specific letter shapes .
These methods can not be used in software that scans paper documents , so accurate recognition of hand-printed documents is still largely an open problem .
Accuracy rates of 80 % to 90 % on neat , clean hand-printed characters can be achieved , but that accuracy rate still translates to dozens of errors per page , making the technology useful only in very limited applications .
Recognition of cursive text is an active area of research , with recognition rates even lower than that of hand-printed text .
Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information .
For example , recognizing entire words from a dictionary is easier than trying to parse individual characters from script .
Reading the Amount line of a cheque -LRB- which is always a written-out number -RRB- is an example where using a smaller dictionary can increase recognition rates greatly .
Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun , for example , allowing greater accuracy .
The shapes of individual cursive characters themselves simply do not contain enough information to accurately -LRB- greater than 98 % -RRB- recognize all handwritten cursive script .
It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications .
Due to this , an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology .
For more complex recognition problems , intelligent character recognition systems are generally used , as artificial neural networks can be made indifferent to both affine and non-linear transformations .
A technique which is having considerable success in recognizing difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA system .
In corpus linguistics , part-of-speech tagging -LRB- POS tagging or POST -RRB- , also called grammatical tagging or word-category disambiguation , is the process of marking up a word in a text -LRB- corpus -RRB- as corresponding to a particular part of speech , based on both its definition , as well as its context -- i.e. relationship with adjacent and related words in a phrase , sentence , or paragraph .
A simplified form of this is commonly taught to school-age children , in the identification of words as nouns , verbs , adjectives , adverbs , etc. .
Once performed by hand , POS tagging is now done in the context of computational linguistics , using algorithms which associate discrete terms , as well as hidden parts of speech , in accordance with a set of descriptive tags .
POS-tagging algorithms fall into two distinctive groups : rule-based and stochastic .
E. Brill 's tagger , one of the first and widely used English POS-taggers , employs rule-based algorithms .
This is not rare -- in natural languages -LRB- as opposed to many artificial languages -RRB- , a large percentage of word-forms are ambiguous .
For example , even `` dogs '' , which is usually thought of as just a plural noun , can also be a verb : The sailor dogs the barmaid .
Performing grammatical tagging will indicate that `` dogs '' is a verb , and not the more common plural noun , since one of the words must be the main verb , and the noun reading is less likely following `` sailor '' -LRB- sailor !
→ dogs -RRB- .
Semantic analysis can then extrapolate that `` sailor '' and `` barmaid '' implicate `` dogs '' as 1 -RRB- in the nautical context -LRB- sailor → <verb> ← barmaid -RRB- and 2 -RRB- an action applied to the object `` barmaid '' -LRB- -LRB- subject -RRB- dogs → barmaid -RRB- .
In this context , `` dogs '' is a nautical term meaning `` fastens -LRB- a watertight barmaid -RRB- securely ; applies a dog to '' .
`` Dogged '' , on the other hand , can be either an adjective or a past-tense verb .
Just which parts of speech a word can represent varies greatly .
Trained linguists can identify the grammatical parts of speech to various fine degrees depending on the tagging system .
Schools commonly teach that there are 9 parts of speech in English : noun , verb , article , adjective , preposition , pronoun , adverb , conjunction , and interjection .
However , there are clearly many more categories and sub-categories .
For nouns , plural , possessive , and singular forms can be distinguished .
In many languages words are also marked for their `` case '' -LRB- role as subject , object , etc. -RRB- , grammatical gender , and so on ; while verbs are marked for tense , aspect , and other things .
In part-of-speech tagging by computer , it is typical to distinguish from 50 to 150 separate parts of speech for English , for example , NN for singular common nouns , NNS for plural common nouns , NP for singular proper nouns -LRB- see the POS tags used in the Brown Corpus -RRB- .
Work on stochastic methods for tagging Koine Greek -LRB- DeRose 1990 -RRB- has used over 1,000 parts of speech , and found that about as many words were ambiguous there as in English .
A morphosyntactic descriptor in the case of morphologically rich languages can be expressed like Ncmsan , which means Category = Noun , Type = common , Gender = masculine , Number = singular , Case = accusative , Animate = no. .
History The Brown Corpus Research on part-of-speech tagging has been closely tied to corpus linguistics .
The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and Nelson Francis , in the mid-1960s .
It consists of about 1,000,000 words of running English prose text , made up of 500 samples from randomly chosen publications .
Each sample is 2,000 or more words -LRB- ending at the first sentence-end after 2,000 words , so that the corpus contains only complete sentences -RRB- .
The Brown Corpus was painstakingly `` tagged '' with part-of-speech markers over many years .
A first approximation was done with a program by Greene and Rubin , which consisted of a huge handmade list of what categories could co-occur at all .
For example , article then noun can occur , but article verb -LRB- arguably -RRB- can not .
The program got about 70 % correct .
Its results were repeatedly reviewed and corrected by hand , and later users sent in errata , so that by the late 70s the tagging was nearly perfect -LRB- allowing for some cases on which even human speakers might not agree -RRB- .
This corpus has been used for innumerable studies of word-frequency and of part-of-speech , and inspired the development of similar `` tagged '' corpora in many other languages .
Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems , such as CLAWS -LRB- linguistics -RRB- and VOLSUNGA .
However , by this time -LRB- 2005 -RRB- it has been superseded by larger corpora such as the 100 million word British National Corpus .
For some time , part-of-speech tagging was considered an inseparable part of natural language processing , because there are certain cases where the correct part of speech can not be decided without understanding the semantics or even the pragmatics of the context .
This is extremely expensive , especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word .
Use of Hidden Markov Models In the mid 1980s , researchers in Europe began to use hidden Markov models -LRB- HMMs -RRB- to disambiguate parts of speech , when working to tag the Lancaster-Oslo-Bergen Corpus of British English .
HMMs involve counting cases -LRB- such as from the Brown Corpus -RRB- , and making a table of the probabilities of certain sequences .
For example , once you 've seen an article such as ` the ' , perhaps the next word is a noun 40 % of the time , an adjective 40 % , and a number 20 % .
Knowing this , a program can decide that `` can '' in `` the can '' is far more likely to be a noun than a verb or a modal .
The same method can of course be used to benefit from knowledge about following words .
More advanced -LRB- `` higher order '' -RRB- HMMs learn the probabilities not only of pairs , but triples or even larger sequences .
So , for example , if you 've just seen an article and a verb , the next item may be very likely a preposition , article , or noun , but much less likely another verb .
When several ambiguous words occur together , the possibilities multiply .
However , it is easy to enumerate every combination and to assign a relative probability to each one , by multiplying together the probabilities of each choice in turn .
The combination with highest probability is then chosen .
The European group developed CLAWS , a tagging program that did exactly this , and achieved accuracy in the 93-95 % range .
It is worth remembering , as Eugene Charniak points out in Statistical techniques for natural language parsing , that merely assigning the most common tag to each known word and the tag `` proper noun '' to all unknowns , will approach 90 % accuracy because many words are unambiguous .
CLAWS pioneered the field of HMM-based part of speech tagging , but was quite expensive since it enumerated all possibilities .
It sometimes had to resort to backup methods when there were simply too many -LRB- the Brown Corpus contains a case with 17 ambiguous words in a row , and there are words such as `` still '' that can represent as many as 7 distinct parts of speech -RRB- .
HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm .
Dynamic Programming methods In 1987 , Steven DeRose and Ken Church independently developed dynamic programming algorithms to solve the same problem in vastly less time .
Their methods were similar to the Viterbi algorithm known for some time in other fields .
DeRose used a table of pairs , while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus -LRB- actual measurement of triple probabilities would require a much larger corpus -RRB- .
Both methods achieved accuracy over 95 % .
DeRose 's 1990 dissertation at Brown University included analyses of the specific error types , probabilities , and other related data , and replicated his work for Greek , where it proved similarly effective .
These findings were surprisingly disruptive to the field of natural language processing .
The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis : syntax , morphology , semantics , and so on .
CLAWS , DeRose 's and Church 's methods did fail for some of the known cases where semantics is required , but those proved negligibly rare .
This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing ; this in turn simplified the theory and practice of computerized language analysis , and encouraged researchers to find ways to separate out other pieces as well .
Markov Models are now the standard method for part-of-speech assignment .
Unsupervised taggers The methods already discussed involve working from a pre-existing corpus to learn tag probabilities .
It is , however , also possible to bootstrap using `` unsupervised '' tagging .
Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction .
That is , they observe patterns in word use , and derive part-of-speech categories themselves .
For example , statistics readily reveal that `` the '' , `` a '' , and `` an '' occur in similar contexts , while `` eat '' occurs in very different ones .
With sufficient iteration , similarity classes of words emerge that are remarkably similar to those human linguists would expect ; and the differences themselves sometimes suggest valuable new insights .
These two categories can be further subdivided into rule-based , stochastic , and neural approaches .
Other taggers and methods Some current major algorithms for part-of-speech tagging include the Viterbi algorithm , Brill Tagger , Constraint Grammar , and the Baum-Welch algorithm -LRB- also known as the forward-backward algorithm -RRB- .
Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm .
The Brill tagger is unusual in that it learns a set of patterns , and then applies those patterns rather than optimizing a statistical quantity .
Many machine learning methods have also been applied to the problem of POS tagging .
Methods such as SVM , Maximum entropy classifier , Perceptron , and Nearest-neighbor have all been tried , and most can achieve accuracy above 95 % .
A direct comparison of several methods is reported -LRB- with references -RRB- at .
This comparison uses the Penn tag set on some of the Penn Treebank data , so the results are directly comparable .
However , many significant taggers are not included -LRB- perhaps because of the labor involved in reconfiguring them for this particular dataset -RRB- .
Thus , it should not be assumed that the results reported there are the best that can be achieved with a given approach ; nor even the best that have been achieved with a given approach .
Issues While there is broad agreement about basic categories , a number of edge cases make it difficult to settle on a single `` correct '' set of tags , even in a single language such as English .
For example , it is hard to say whether `` fire '' is functioning as an adjective or a noun in the big green fire truck A second important example is the use\/mention distinction , as in the following example , where `` blue '' is clearly not functioning as an adjective -LRB- the Brown Corpus tag set appends the suffix '' - NC '' in such cases -RRB- : the word `` blue '' has 4 letters .
Words in a language other than that of the `` main '' text , are commonly tagged as `` foreign '' , usually in addition to a tag for the role the foreign word is actually playing in context .
There are also many cases where POS categories and `` words '' do not map one to one , for example : David 's gonna do n't vice versa first-cut can not pre - and post-secondary look -LRB- a word -RRB- up In the last example , `` look '' and `` up '' arguably function as a single verbal unit , despite the possibility of other words coming between them .
Some tag sets -LRB- such as Penn -RRB- break hyphenated words , contractions , and possessives into separate tokens , thus avoiding some but far from all such problems .
It is unclear whether it is best to treat words such as `` be '' , `` have '' , and `` do '' as categories in their own right -LRB- as in the Brown Corpus -RRB- , or as simply verbs -LRB- as in the LOB Corpus and the Penn Treebank -RRB- .
`` be '' has more forms than other English verbs , and occurs in quite different grammatical contexts , complicating the issue .
The most popular `` tag set '' for POS tagging for American English is probably the Penn tag set , developed in the Penn Treebank project .
It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets , though much smaller .
In Europe , tag sets from the Eagles Guidelines see wide use , and include versions for multiple languages .
POS tagging work has been done in a variety of languages , and the set of POS tags used varies greatly with language .
Tags usually are designed to include overt morphological distinctions -LRB- this makes the tag sets for heavily inflected languages such as Greek and Latin very large ; and makes tagging words in agglutinative languages such an Inuit virtually impossible .
However , Petrov , D. Das , and R. McDonald -LRB- `` A Universal Part-of-Speech Tagset '' http:\/\/arxiv.org\/abs\/1104.2086 -RRB- have proposed a `` universal '' tag set , with 12 categories -LRB- for example , no subtypes of nouns , verbs , punctuation , etc. ; no distinction of `` to '' as an infinitive marker vs. preposition , etc. -RRB- .
Whether a very small set of very broad tags , or a much larger set of more precise ones , is preferable , depends on the purpose at hand .
Automatic tagging is easier on smaller tag-sets .
A different issue is that some cases are in fact ambiguous .
Beatrice Santorini gives examples in `` Part-of-speech Tagging Guidelines for the Penn Treebank Project , '' -LRB- 3rd rev , June 1990 -RRB- , including the following -LRB- p. 32 -RRB- case in which entertaining can function either as an adjective or a verb , and there is no evident way to decide : The Duchess was entertaining last night .
In computer science and linguistics , parsing , or , more formally , syntactic analysis , is the process of analyzing a text , made of a sequence of tokens -LRB- for example , words -RRB- , to determine its grammatical structure with respect to a given -LRB- more or less -RRB- formal grammar .
Parsing can also be used as a linguistic term , for instance when discussing how phrases are divided up in garden path sentences .
Parsing is also an earlier term for the diagramming of sentences of natural languages , and is still used for the diagramming of inflected languages , such as the Romance languages or Latin .
The term parsing comes from Latin pars -LRB- ōrātiōnis -RRB- , meaning part -LRB- of speech -RRB- .
Parsing is a common term used in psycholinguistics when describing language comprehension .
In this context , parsing refers to the way that human beings , rather than computers , analyze a sentence or phrase -LRB- in spoken language or text -RRB- `` in terms of grammatical constituents , identifying the parts of speech , syntactic relations , etc. '' This term is especially common when discussing what linguistic cues help speakers to parse garden-path sentences .
The parser often uses a separate lexical analyser to create tokens from the sequence of input characters .
Parsers may be programmed by hand or may be -LRB- semi - -RRB- automatically generated -LRB- in some programming languages -RRB- by a tool .
Human languages See also : Category : Natural language parsing In some machine translation and natural language processing systems , human languages are parsed by computer programs .
Human sentences are not easily parsed by programs , as there is substantial ambiguity in the structure of human language , whose usage is to convey meaning -LRB- or semantics -RRB- amongst a potentially unlimited range of possibilities but only some of which are germane to the particular case .
So an utterance `` Man bites dog '' versus `` Dog bites man '' is definite on one detail but in another language might appear as `` Man dog bites '' with a reliance on the larger context to distinguish between those two possibilities , if indeed that difference was of concern .
It is difficult to prepare formal rules to describe informal behavior even though it is clear that some rules are being followed .
In order to parse natural language data , researchers must first agree on the grammar to be used .
The choice of syntax is affected by both linguistic and computational concerns ; for instance some parsing systems use lexical functional grammar , but in general , parsing for grammars of this type is known to be NP-complete .
Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community , but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank .
Shallow parsing aims to find only the boundaries of major constituents such as noun phrases .
Another popular strategy for avoiding linguistic controversy is dependency grammar parsing .
Most modern parsers are at least partly statistical ; that is , they rely on a corpus of training data which has already been annotated -LRB- parsed by hand -RRB- .
This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts .
-LRB- See machine learning . -RRB-
Approaches which have been used include straightforward PCFGs -LRB- probabilistic context-free grammars -RRB- , maximum entropy , and neural nets .
Most of the more successful systems use lexical statistics -LRB- that is , they consider the identities of the words involved , as well as their part of speech -RRB- .
However such systems are vulnerable to overfitting and require some kind of smoothing to be effective .
-LRB- citation needed -RRB- Parsing algorithms for natural language can not rely on the grammar having ` nice ' properties as with manually designed grammars for programming languages .
As mentioned earlier some grammar formalisms are very difficult to parse computationally ; in general , even if the desired structure is not context-free , some kind of context-free approximation to the grammar is used to perform a first pass .
Algorithms which use context-free grammars often rely on some variant of the CKY algorithm , usually with some heuristic to prune away unlikely analyses to save time .
-LRB- See chart parsing . -RRB-
However some systems trade speed for accuracy using , e.g. , linear-time versions of the shift-reduce algorithm .
A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses , and a more complex system selects the best option .
Programming languages The most common use of a parser is as a component of a compiler or interpreter .
This parses the source code of a computer programming language to create some form of internal representation .
Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them .
Parsers are written by hand or generated by parser generators .
Context-free grammars are limited in the extent to which they can express all of the requirements of a language .
Informally , the reason is that the memory of such a language is limited .
The grammar can not remember the presence of a construct over an arbitrarily long input ; this is necessary for a language in which , for example , a name must be declared before it may be referenced .
More powerful grammars that can express this constraint , however , can not be parsed efficiently .
Thus , it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs -LRB- that is , it accepts some invalid constructs -RRB- ; later , the unwanted constructs can be filtered out .
Overview of process Flow of data in a typical parser The following example demonstrates the common case of parsing a computer language with two levels of grammar : lexical and syntactic .
The first stage is the token generation , or lexical analysis , by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions .
For example , a calculator program would look at an input such as `` 12 \* -LRB- 3 +4 -RRB- ^ 2 '' and split it into the tokens 12 , \* , -LRB- , 3 , + , 4 , -RRB- , ^ , 2 , each of which is a meaningful symbol in the context of an arithmetic expression .
The lexer would contain rules to tell it that the characters \* , + , ^ , -LRB- and -RRB- mark the start of a new token , so meaningless tokens like `` 12 \* '' or '' -LRB- 3 '' will not be generated .
The next stage is parsing or syntactic analysis , which is checking that the tokens form an allowable expression .
This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear .
However , not all rules defining programming languages can be expressed by context-free grammars alone , for example type validity and proper declaration of identifiers .
These rules can be formally expressed with attribute grammars .
The final phase is semantic parsing or analysis , which is working out the implications of the expression just validated and taking the appropriate action .
In the case of a calculator or interpreter , the action is to evaluate the expression or program , a compiler , on the other hand , would generate some kind of code .
Attribute grammars can also be used to define these actions .
Types of parser The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar .
This can be done in essentially two ways : Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules .
Tokens are consumed from left to right .
Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules .
Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol .
Intuitively , the parser attempts to locate the most basic elements , then the elements containing these , and so on .
LR parsers are examples of bottom-up parsers .
Another term used for this type of parser is Shift-Reduce parsing .
LL parsers and recursive-descent parser are examples of top-down parsers which can not accommodate left recursive productions .
Although it has been believed that simple implementations of top-down parsing can not accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars , more sophisticated algorithms for top-down parsing have been created by Frost , Hafiz , and Callaghan which accommodate ambiguity and left recursion in polynomial time and which generate polynomial-size representations of the potentially exponential number of parse trees .
Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given CFG -LRB- context-free grammar -RRB- .
An important distinction with regard to parsers is whether a parser generates a leftmost derivation or a rightmost derivation -LRB- see context-free grammar -RRB- .
LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation -LRB- although usually in reverse -RRB- .
In information retrieval and natural language processing -LRB- NLP -RRB- , question answering -LRB- QA -RRB- is the task of automatically answering a question posed in natural language .
To find the answer to a question , a QA computer program may use either a pre-structured database or a collection of natural language documents -LRB- a text corpus such as the World Wide Web or some local collection -RRB- .
QA research attempts to deal with a wide range of question types including : fact , list , definition , How , Why , hypothetical , semantically constrained , and cross-lingual questions .
Search collections vary from small local document collections , to internal organization documents , to compiled newswire reports , to the World Wide Web .
Closed-domain question answering deals with questions under a specific domain -LRB- for example , medicine or automotive maintenance -RRB- , and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies .
Alternatively , closed-domain might refer to a situation where only a limited type of questions are accepted , such as questions asking for descriptive rather than procedural information .
Open-domain question answering deals with questions about nearly anything , and can only rely on general ontologies and world knowledge .
On the other hand , these systems usually have much more data available from which to extract the answer .
In contrast , current QA systems use text documents as their underlying knowledge source and combine various natural language processing techniques to search for the answers .
Current QA systems typically include a question classifier module that determines the type of question and the type of answer .
After the question is analyzed , the system typically uses several modules that apply increasingly complex NLP techniques on a gradually reduced amount of text .
Thus , a document retrieval module uses search engines to identify the documents or paragraphs in the document set that are likely to contain the answer .
Subsequently a filter preselects small text fragments that contain strings of the same type as the expected answer .
For example , if the question is `` Who invented Penicillin '' the filter returns text that contain names of people .
Finally , an answer extraction module looks for further clues in the text to determine if the answer candidate can indeed answer the question .
Question answering methods QA is very dependent on a good search corpus - for without documents containing the answer , there is little any QA system can do .
It thus makes sense that larger collection sizes generally lend well to better QA performance , unless the question domain is orthogonal to the collection .
The notion of data redundancy in massive collections , such as the web , means that nuggets of information are likely to be phrased in many different ways in differing contexts and documents , leading to two benefits : By having the right information appear in many forms , the burden on the QA system to perform complex NLP techniques to understand the text is lessened .
Correct answers can be filtered from false positives by relying on the correct answer to appear more times in the documents than instances of incorrect ones .
Issues In 2002 a group of researchers wrote a roadmap of research in question answering .
The following issues were identified .
Question classes Different types of questions -LRB- e.g. , `` What is the capital of Lichtenstein ? ''
vs. `` Why does a rainbow form ? ''
vs. `` Did Marilyn Monroe and Cary Grant ever appear in a movie together ? '' -RRB-
require the use of different strategies to find the answer .
Question classes are arranged hierarchically in taxonomies .
Question processing The same information request can be expressed in various ways , some interrogative -LRB- `` Who is the president of the United States ? '' -RRB-
and some assertive -LRB- `` Tell me the name of the president of the United States . '' -RRB- .
A semantic model of question understanding and processing would recognize equivalent questions , regardless of how they are presented .
This model would enable the translation of a complex question into a series of simpler questions , would identify ambiguities and treat them in context or by interactive clarification .
Context and QA Questions are usually asked within a context and answers are provided within that specific context .
The context can be used to clarify a question , resolve ambiguities or keep track of an investigation performed through a series of questions .
-LRB- For example , the question , `` Why did Joe Biden visit Iraq in January 2010 ? ''
might be asking why Vice President Biden visited and not President Obama , why he went to Iraq and not Afghanistan or some other country , why he went in January 2010 and not before or after , or what Biden was hoping to accomplish with his visit .
If the question is one of a series of related questions , the previous questions and their answers might shed light on the questioner 's intent . -RRB-
Data sources for QA Before a question can be answered , it must be known what knowledge sources are available and relevant .
If the answer to a question is not present in the data sources , no matter how well the question processing , information retrieval and answer extraction is performed , a correct result will not be obtained .
Answer extraction Answer extraction depends on the complexity of the question , on the answer type provided by question processing , on the actual data where the answer is searched , on the search method and on the question focus and context .
Answer formulation The result of a QA system should be presented in a way as natural as possible .
In some cases , simple extraction is sufficient .
For example , when the question classification indicates that the answer type is a name -LRB- of a person , organization , shop or disease , etc. -RRB- , a quantity -LRB- monetary value , length , size , distance , etc. -RRB- or a date -LRB- e.g. the answer to the question , `` On what day did Christmas fall in 1989 ? '' -RRB-
the extraction of a single datum is sufficient .
For other cases , the presentation of the answer may require the use of fusion techniques that combine the partial answers from multiple documents .
Real time question answering There is need for developing Q&A systems that are capable of extracting answers from large data sets in several seconds , regardless of the complexity of the question , the size and multitude of the data sources or the ambiguity of the question .
Multilingual -LRB- or cross-lingual -RRB- question answering The ability to answer a question posed in one language using an answer corpus in another language -LRB- or even several -RRB- .
This allows users to consult information that they can not use directly .
-LRB- See also Machine translation . -RRB-
Interactive QA It is often the case that the information need is not well captured by a QA system , as the question processing part may fail to classify properly the question or the information needed for extracting and generating the answer is not easily retrieved .
In such cases , the questioner might want not only to reformulate the question , but to have a dialogue with the system .
-LRB- For example , the system might ask for a clarification of what sense a word is being used , or what type of information is being asked for . -RRB-
Advanced reasoning for QA More sophisticated questioners expect answers that are outside the scope of written texts or structured databases .
To upgrade a QA system with such capabilities , it would be necessary to integrate reasoning components operating on a variety of knowledge bases , encoding world knowledge and common-sense reasoning mechanisms , as well as knowledge specific to a variety of domains .
User profiling for QA The user profile captures data about the questioner , comprising context data , domain of interest , reasoning schemes frequently used by the questioner , common ground established within different dialogues between the system and the user , and so forth .
The profile may be represented as a predefined template , where each template slot represents a different profile feature .
Profile templates may be nested one within another .
History Some of the early AI systems were question answering systems .
Two of the most famous QA systems of that time are BASEBALL and LUNAR , both of which were developed in the 1960s .
BASEBALL answered questions about the US baseball league over a period of one year .
LUNAR , in turn , answered questions about the geological analysis of rocks returned by the Apollo moon missions .
Both QA systems were very effective in their chosen domains .
In fact , LUNAR was demonstrated at a lunar science convention in 1971 and it was able to answer 90 % of the questions in its domain posed by people untrained on the system .
Further restricted-domain QA systems were developed in the following years .
The common feature of all these systems is that they had a core database or knowledge system that was hand-written by experts of the chosen domain .
Some of the early AI systems included question-answering abilities .
Two of the most famous early systems are SHRDLU and ELIZA .
SHRDLU simulated the operation of a robot in a toy world -LRB- the `` blocks world '' -RRB- , and it offered the possibility to ask the robot questions about the state of the world .
Again , the strength of this system was the choice of a very specific domain and a very simple world with rules of physics that were easy to encode in a computer program .
ELIZA , in contrast , simulated a conversation with a psychologist .
ELIZA was able to converse on any topic by resorting to very simple rules that detected important words in the person 's input .
It had a very rudimentary way to answer questions , and on its own it led to a series of chatterbots such as the ones that participate in the annual Loebner prize .
The 1970s and 1980s saw the development of comprehensive theories in computational linguistics , which led to the development of ambitious projects in text comprehension and question answering .
One example of such a system was the Unix Consultant -LRB- UC -RRB- , a system that answered questions pertaining to the Unix operating system .
The system had a comprehensive hand-crafted knowledge base of its domain , and it aimed at phrasing the answer to accommodate various types of users .
Another project was LILOG , a text-understanding system that operated on the domain of tourism information in a German city .
The systems developed in the UC and LILOG projects never went past the stage of simple demonstrations , but they helped the development of theories on computational linguistics and reasoning .
An increasing number of systems include the World Wide Web as one more corpus of text . .
However , these tools mostly work by using shallow methods , as described above -- thus returning a list of documents , usually with an excerpt containing the probable answer highlighted , plus some context .
Furthermore , highly-specialized natural language question-answering engines , such as EAGLi for health and life scientists , have been made available .
The Future of Question Answering QA systems have been extended in recent years to explore critical new scientific and practical dimensions For example , systems have been developed to automatically answer temporal and geospatial questions , definitional questions , biographical questions , multilingual questions , and questions from multimedia -LRB- e.g. , audio , imagery , video -RRB- .
Additional aspects such as interactivity -LRB- often required for clarification of questions or answers -RRB- , answer reuse , and knowledge representation and reasoning to support question answering have been explored .
Future research may explore what kinds of questions can be asked and answered about social media , including sentiment analysis .
A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts , typically from text or XML documents .
The task is very similar to that of information extraction -LRB- IE -RRB- , but IE additionally requires the removal of repeated relations -LRB- disambiguation -RRB- and generally refers to the extraction of many different relationships .
Approaches One approach to this problem involves the use of domain ontologies .
Another approach involves visual detection of meaningful relationships in parametric values of objects listed on a data table that shift positions as the table is permuted automatically as controlled by the software user .
The poor coverage , rarity and development cost related to structured resources such as semantic lexicons -LRB- e.g. WordNet , UMLS -RRB- and domain ontologies -LRB- e.g. the Gene Ontology -RRB- has given rise to new approaches based on broad , dynamic background knowledge on the Web .
For instance , the ARCHILES technique uses only Wikipedia and search engine page count for acquiring coarse-grained relations to construct lightweight ontologies .
The relationships can be represented using a variety of formalisms\/languages .
One such representation language for data on the Web is RDF .
Jump to : navigation , search Sentence boundary disambiguation -LRB- SBD -RRB- , also known as sentence breaking , is the problem in natural language processing of deciding where sentences begin and end .
Often natural language processing tools require their input to be divided into sentences for a number of reasons .
However sentence boundary identification is challenging because punctuation marks are often ambiguous .
For example , a period may denote an abbreviation , decimal point , an ellipsis , or an email address - not the end of a sentence .
About 47 % of the periods in the Wall Street Journal corpus denote abbreviations .
As well , question marks and exclamation marks may appear in embedded quotations , emoticons , computer code , and slang .
Languages like Japanese and Chinese have unambiguous sentence-ending markers .
-LRB- b -RRB- If the preceding token is on my hand-compiled list of abbreviations , then it does n't end a sentence .
-LRB- c -RRB- If the next token is capitalized , then it ends a sentence .
This strategy gets about 95 % of sentences correct .
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked .
Solutions have been based on a maximum entropy model .
The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5 % accuracy .
Sentiment analysis or opinion mining refers to the application of natural language processing , computational linguistics , and text analytics to identify and extract subjective information in source materials .
Generally speaking , sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document .
The attitude may be his or her judgement or evaluation -LRB- see appraisal theory -RRB- , affective state -LRB- that is to say , the emotional state of the author when writing -RRB- , or the intended emotional communication -LRB- that is to say , the emotional effect the author wishes to have on the reader -RRB- .
Advanced , `` beyond polarity '' sentiment classification looks , for instance , at emotional states such as `` angry , '' `` sad , '' and `` happy . ''
Early work in that area includes Turney and Pang who applied different methods for detecting the polarity of product reviews and movie reviews respectively .
This work is at the document level .
One can also classify a document 's polarity on a multi-way scale , which was attempted by Pang and Snyder -LRB- among others -RRB- : expanded the basic task of classifying a movie review as either positive or negative to predicting star ratings on either a 3 or a 4 star scale , while Snyder performed an in-depth analysis of restaurant reviews , predicting ratings for various aspects of the given restaurant , such as the food and atmosphere -LRB- on a five-star scale -RRB- .
A different method for determining sentiment is the use of a scaling system whereby words commonly associated with having a negative , neutral or positive sentiment with them are given an associated number on a -5 to +5 scale -LRB- most negative up to most positive -RRB- and when a piece of unstructured text is analyzed using natural language processing , the subsequent concepts are analyzed for an understanding of these words and how they relate to the concept -LRB- citation needed -RRB- .
Each concept is then given a score based on the way sentiment words relate to the concept , and their associated score .
This allows movement to a more sophisticated understanding of sentiment based on an 11 point scale .
Alternatively , texts can be given a positive and negative sentiment strength score if the goal is to determine the sentiment in a text rather than the overall polarity and strength of the text .
Another research direction is subjectivity\/objectivity identification .
This task is commonly defined as classifying a given text -LRB- usually a sentence -RRB- into one of two classes : objective or subjective .
This problem can sometimes be more difficult than polarity classification : the subjectivity of words and phrases may depend on their context and an objective document may contain subjective sentences -LRB- e.g. , a news article quoting people 's opinions -RRB- .
Moreover , as mentioned by Su , results are largely dependent on the definition of subjectivity used when annotating texts .
However , Pang showed that removing objective sentences from a document before classifying its polarity helped improve performance .
The more fine-grained analysis model is called the feature\/aspect-based sentiment analysis .
It refers to determining the opinions or sentiments expressed on different features or aspects of entities , e.g. , of a cell phone , a digital camera , or a bank .
A feature or aspect is an attribute or component of an entity , e.g. , the screen of a cell phone , or the picture quality of a camera .
This problem involves several sub-problems , e.g. , identifying relevant entities , extracting their features\/aspects , and determining whether an opinion expressed on each feature\/aspect is positive , negative or neutral .
More detailed discussions about this level of sentiment analysis can be found in Liu 's NLP Handbook chapter , `` Sentiment Analysis and Subjectivity '' .
Methods Computers can perform automated sentiment analysis of digital texts , using elements from machine learning such as latent semantic analysis , support vector machines , `` bag of words '' and Semantic Orientation -- Pointwise Mutual Information -LRB- See Peter Turney 's work in this area -RRB- .
More sophisticated methods try to detect the holder of a sentiment -LRB- i.e. the person who maintains that affective state -RRB- and the target -LRB- i.e. the entity about which the affect is felt -RRB- .
To mine the opinion in context and get the feature which has been opinionated , the grammatical relationships of words are used .
Grammatical dependency relations are obtained by deep parsing of the text .
Open source software tools deploy machine learning , statistics , and natural language processing techniques to automate sentiment analysis on large collections of texts , including web pages , online news , internet discussion groups , online reviews , web blogs , and social media .
Evaluation The accuracy of a sentiment analysis system is , in principle , how well it agrees with human judgments .
This is usually measured by precision and recall .
However , human raters typically agree about 70 % -LRB- citation needed -RRB- of the time -LRB- see Inter-rater reliability -RRB- .
Thus , a 70 % accurate program is doing as well as humans , even though such accuracy may not sound impressive .
If a program were `` right '' 100 % of the time , humans would still disagree with it about 30 % of the time , since they disagree that much about any answer .
More sophisticated measures can be applied , but evaluation of sentiment analysis systems remains a complex matter .
For sentiment analysis tasks returning a scale rather than a binary judgement , correlation is a better measure than precision because it takes into account how close the predicted value is to the target value .
Sentiment analysis was used to test the relationship between Internet financial message boards and the behavior of the stock market to find a strong correlation between posts and volume of stock .
Sentiment analysis and Web 2.0 The rise of social media such as blogs and social networks has fueled interest in sentiment analysis .
With the proliferation of reviews , ratings , recommendations and other forms of online expression , online opinion has turned into a kind of virtual currency for businesses looking to market their products , identify new opportunities and manage their reputations .
As businesses look to automate the process of filtering out the noise , understanding the conversations , identifying the relevant content and actioning it appropriately , many are now looking to the field of sentiment analysis .
If web 2.0 was all about democratizing publishing , then the next stage of the web may well be based on democratizing data mining of all the content that is getting published .
One step towards this aim is accomplished in research .
Several research teams in universities around the world currently focus on understanding the dynamics of sentiment in e-communities through sentiment analysis .
The CyberEmotions project , for instance , recently identified the role of negative emotions in driving social networks discussions .
Sentiment analysis could therefore help understand why certain e-communities die or fade away -LRB- e.g. , MySpace -RRB- while others seem to grow without limits -LRB- e.g. , Facebook -RRB- .
The problem is that most sentiment analysis algorithms use simple terms to express sentiment about a product or service .
However , cultural factors , linguistic nuances and differing contexts make it extremely difficult to turn a string of written text into a simple pro or con sentiment .
The fact that humans often disagree on the sentiment of text illustrates how big a task it is for computers to get this right .
The shorter the string of text , the harder it becomes .
n Computer Science , Speech recognition is the translation of spoken words into text .
It is also known as `` automatic speech recognition '' , `` ASR '' , `` computer speech recognition '' , `` speech to text '' , or just `` STT '' .
Speech Recognition is technology that can translate spoken words into text .
Some SR systems use `` training '' where an individual speaker reads sections of text into the SR system .
These systems analyze the person 's specific voice and use it to fine tune the recognition of that person 's speech , resulting in more accurate transcription .
Systems that do not use training are called `` Speaker Independent '' systems .
Systems that use training are called `` Speaker Dependent '' systems .
Speech recognition applications include voice user interfaces such as voice dialing -LRB- e.g. , `` Call home '' -RRB- , call routing -LRB- e.g. , `` I would like to make a collect call '' -RRB- , domotic appliance control , search -LRB- e.g. , find a podcast where particular words were spoken -RRB- , simple data entry -LRB- e.g. , entering a credit card number -RRB- , preparation of structured documents -LRB- e.g. , a radiology report -RRB- , speech-to-text processing -LRB- e.g. , word processors or emails -RRB- , and aircraft -LRB- usually termed Direct Voice Input -RRB- .
The term voice recognition refers to finding the identity of `` who '' is speaking , rather than what they are saying .
Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific person 's voices or it can be used to authenticate or verify the identity of a speaker as part of a security process .
Front-End speech recognition is where the provider dictates into a speech-recognition engine , the recognized words are displayed as they are spoken , and the dictator is responsible for editing and signing off on the document .
Back-End or deferred speech recognition is where the provider dictates into a digital dictation system , the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the editor , where the draft is edited and report finalized .
Deferred speech recognition is widely used in the industry currently .
Many Electronic Medical Records -LRB- EMR -RRB- applications can be more effective and may be performed more easily when deployed in conjunction with a speech-recognition engine .
Searches , queries , and form filling may all be faster to perform by voice than by using a keyboard .
One of the major issues relating to the use of speech recognition in healthcare is that the American Recovery and Reinvestment Act of 2009 -LRB- ARRA -RRB- provides for substantial financial benefits to physicians who utilize an EMR according to `` Meaningful Use '' standards .
These standards require that a substantial amount of data be maintained by the EMR -LRB- now more commonly referred to as an Electronic Health Record or EHR -RRB- .
Unfortunately , in many instances , the use of speech recognition within an EHR will not lead to data maintained within a database , but rather to narrative text .
For this reason , substantial resources are being expended to allow for the use of front-end SR while capturing data within the EHR .
Military High-performance fighter aircraft Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft .
Of particular note is the U.S. program in speech recognition for the Advanced Fighter Technology Integration -LRB- AFTI -RRB- \/ F-16 aircraft -LRB- F-16 VISTA -RRB- , and a program in France installing speech recognition systems on Mirage aircraft , and also programs in the UK dealing with a variety of aircraft platforms .
In these programs , speech recognizers have been operated successfully in fighter aircraft , with applications including : setting radio frequencies , commanding an autopilot system , setting steer-point coordinates and weapons release parameters , and controlling flight displays .
Working with Swedish pilots flying in the JAS-39 Gripen cockpit , Englund -LRB- 2004 -RRB- found recognition deteriorated with increasing G-loads .
It was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly .
Contrary to what might be expected , no effects of the broken English of the speakers were found .
It was evident that spontaneous speech caused problems for the recognizer , as could be expected .
A restricted vocabulary , and above all , a proper syntax , could thus be expected to improve recognition accuracy substantially .
The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent system , i.e. it requires each pilot to create a template .
The system is not used for any safety critical or weapon critical tasks , such as weapon release or lowering of the undercarriage , but is used for a wide range of other cockpit functions .
Voice commands are confirmed by visual and\/or aural feedback .
The system is seen as a major design feature in the reduction of pilot workload , and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands .
Speaker independent systems are also being developed and are in testing for the F35 Lightning II -LRB- JSF -RRB- and the Alenia Aermacchi M-346 Master lead-in fighter trainer .
These systems have produced word accuracies in excess of 98 % .
Helicopters The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the jet fighter environment .
The acoustic noise problem is actually more severe in the helicopter environment , not only because of the high noise levels but also because the helicopter pilot , in general , does not wear a facemask , which would reduce acoustic noise in the microphone .
Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters , notably by the U.S. Army Avionics Research and Development Activity -LRB- AVRADA -RRB- and by the Royal Aerospace Establishment -LRB- RAE -RRB- in the UK .
Work in France has included speech recognition in the Puma helicopter .
There has also been much useful work in Canada .
Results have been encouraging , and voice applications have included : control of communication radios , setting of navigation systems , and control of an automated target handover system .
As in fighter applications , the overriding issue for voice in helicopters is the impact on pilot effectiveness .
Encouraging results are reported for the AVRADA tests , although these represent only a feasibility demonstration in a test environment .
Much remains to be done both in speech recognition and in overall speech recognition technology , in order to consistently achieve performance improvements in operational settings .
Battle management Question book-new .
svg This unreferenced section requires citations to ensure verifiability .
In general , Battle Management command centres require rapid access to and control of large , rapidly changing information databases .
Commanders and system operators need to query these databases as conveniently as possible , in an eyes-busy environment where much of the information is presented in a display format .
Human-machine interaction by voice has the potential to be very useful in these environments .
A number of efforts have been undertaken to interface commercially available isolated-word recognizers into battle management environments .
In one feasibility study , speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications .
Users were very optimistic about the potential of the system , although capabilities were limited .
Speech understanding programs sponsored by the Defense Advanced Research Projects Agency -LRB- DARPA -RRB- in the U.S. has focused on this problem of natural speech interface .
Speech recognition efforts have focused on a database of continuous speech recognition -LRB- CSR -RRB- , large-vocabulary speech designed to be representative of the naval resource management task .
Significant advances in the state-of-the-art in CSR have been achieved , and current efforts are focused on integrating speech recognition and natural language processing to allow spoken language interaction with a naval resource management system .
Training air traffic controllers Training for air traffic controllers -LRB- ATC -RRB- represents an excellent application for speech recognition systems .
Many ATC training systems currently require a person to act as a `` pseudo-pilot '' , engaging in a voice dialog with the trainee controller , which simulates the dialog that the controller would have to conduct with pilots in a real ATC situation .
Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot , thus reducing training and support personnel .
In theory , Air controller tasks are also characterized by highly structured speech as the primary output of the controller , hence reducing the difficulty of the speech recognition task should be possible .
In practice , this is rarely the case .
The FAA document 7110.65 details the phrases that should be used by air traffic controllers .
While this document gives less than 150 examples of such phrases , the number of phrases supported by one of the simulation vendors speech recognition systems is in excess of 500,000 .
The USAF , USMC , US Army , US Navy , and FAA as well as a number of international ATC training organizations such as the Royal Australian Air Force and Civil Aviation Authorities in Italy , Brazil , and Canada are currently using ATC simulators with speech recognition from a number of different vendors .
Telephony and other domains ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread .
Despite the high level of integration with word processing in general personal computing .
However , ASR in the field of document production has not seen the expected -LRB- by whom ? -RRB-
increases in use .
The improvement of mobile processor speeds made feasible the speech-enabled Symbian and Windows Mobile Smartphones .
Speech is used mostly as a part of User Interface , for creating pre-defined or custom speech commands .
Leading software vendors in this field are : Microsoft Corporation -LRB- Microsoft Voice Command -RRB- , Digital Syphon -LRB- Sonic Extractor -RRB- , Nuance Communications -LRB- Nuance Voice Control -RRB- , Speech Technology Center , Vito Technology -LRB- VITO Voice2Go -RRB- , Speereo Software -LRB- Speereo Voice Translator -RRB- , Verbyx VRX and SVOX .
Further applications Aerospace -LRB- e.g. space exploration , spacecraft , etc. -RRB- NASA 's Mars Polar Lander used speech recognition from technology Sensory , Inc. in the Mars Microphone on the Lander Automatic translation Automotive speech recognition -LRB- e.g. , OnStar , Ford Sync -RRB- Court reporting -LRB- Realtime Speech Writing -RRB- Hands-free computing : Speech recognition computer user interface Home automation Interactive voice response Mobile telephony , including mobile email Multimodal interaction Pronunciation evaluation in computer-aided language learning applications Robotics Speech-to-text reporter -LRB- transcription of speech into text , video captioning , Court reporting -RRB- Telematics -LRB- e.g. , vehicle Navigation Systems -RRB- Transcription -LRB- digital speech-to-text -RRB- Video games , with Tom Clancy 's EndWar and Lifeline as working examples Performance The performance of speech recognition systems is usually evaluated in terms of accuracy and speed .
Accuracy is usually rated with word error rate -LRB- WER -RRB- , whereas speed is measured with the real time factor .
Other measures of accuracy include Single Word Error Rate -LRB- SWER -RRB- and Command Success Rate -LRB- CSR -RRB- .
However , speech recognition -LRB- by a machine -RRB- is a very complex problem .
Vocalizations vary in terms of accent , pronunciation , articulation , roughness , nasality , pitch , volume , and speed .
Speech is distorted by a background noise and echoes , electrical characteristics .
Accuracy of speech recognition vary with the following : Vocabulary size and confusability Speaker dependence vs. independence Isolated , discontinuous , or continuous speech Task and language constraints Read vs. spontaneous speech Adverse conditions Accuracy of speech recognition As mentioned earlier in this article accuracy of speech recogniton vary in following : Error Rates Increase as the Vocabulary Size Grows : e.g. The 10 digits `` zero '' to `` nine '' can be recognized essentially perfectly , but vocabulary sizes of 200 , 5000 or 100000 may have error rates of 3 % , 7 % or 45 % .
Vocabulary is Hard to Recognize if it Contains Confusable Words : e.g. The 26 letters of the English alphabet are difficult to discriminate because they are confusable words -LRB- most notoriously , the E-set : `` B , C , D , E , G , P , T , V , Z '' -RRB- ; An 8 % error rate is considered good for this vocabulary .
Speaker Dependence vs. Independence : A speaker dependent system is intended for use by a single speaker .
A speaker independent system is intended for use by any speaker , more difficult .
Isolated , Discontinuous or Continuous speech With isolated speech single words are used , therefore it becomes easier to recognize the speech .
With discontinuous speech full sentenced separated by silence are used , therefore it becomes easier to recognize the speech as well as with isolated speech .
With continuous speech naturally spoken sentences are used , therefore it becomes harder to recognize the speech , different from both isloated and discontinuous speech .
Task and Language Constraints e.g. Querying application may dismiss the hypothesis `` The apple is red . ''
e.g. Constraints may be semantic ; rejecting `` The apple is angry . ''
e.g. Syntactic ; rejecting `` Red is apple the . ''
Constraints are often represented by a grammar .
Read vs. Spontaneous Speech When a person reads it 's usually in a context that has been previously prepared , but when a person uses spontaneous speech , it is difficult to recognize the speech .
because of the disfluences -LRB- like `` uh '' and `` um '' , false starts , incomplete sentences , stutering , coughing , and laughter -RRB- and limited vocabulary .
Adverse conditions Environmental noise -LRB- e.g. Noise in a car or a factory -RRB- Acoustical distortions -LRB- e.g. echoes , room acoustics -RRB- Speech recognition is a multileveled pattern recognition task .
Acoustical signals are structured into a hierarchy of units ; e.g. Phonemes , Words , Phrases , and Sentences ; Each level provides additional constraints ; e.g. Known word pronunciations or legal word sequences , which can compensate for errors or uncertainties at lower level ; This hierarchy of constraints are exploited ; By combining decisions probabilistically at all lower levels , and making more deterministic decisions only at the highest level ; Speech recogniton by a machine is a process broken into several phases .
Computationally , it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human .
Every acoustic signal can be broken in smaller more basic sub-signals .
As the more complex sound signal is broken into the smaller sub-sounds , different levels are created , where at the top level we have complex sounds , which are made of simpler sounds on lower level , and going to lower levels even more , we create more basic and shorter and simpler sounds .
The lowest level , where the sounds are the most fundamental , a machine would check for simple and more probabilistic rules of what sound should represent .
Once these sounds are put together into more complex sound on upper level , a new set of more deterministic rules should predict what new complex sound should represent .
The most upper level of a deterministic rule should figure out the meaning of complex expressions .
In order to expand our knowledge about speech recognition we need to take into a consideration neural networks .
There are four steps of neural network approaches : Digitize the speech that we want to recognize For telephone speech the sampling rate is 8000 samples per second ; Compute features of spectral-domain of the speech -LRB- with Fourier transform -RRB- ; Computed every 10msec , with one 10msec section called a frame ; Analysis of four step neural network approaches can be explained by further information .
Sound is produced by air -LRB- or some other medium -RRB- vibration , which we register by ears , but machines by receivers .
Basic sound creates a wave which has 2 descriptions ; Amplitude -LRB- how strong is it -RRB- , and frequency -LRB- how often it vibrates per second -RRB- .
Digitized Sound Graph This is the same as the wave in the water .
Big wave is strong and smaller ones are usually faster but weaker .
That is how air is distorted , but we do n't see it easily , in order for sound to travel .
These waves can be digitalized : Sample a strength at short intervals like in picture above to get bunch of numbers that approximate at each time step the strength of a wave .
Collection of these numbers represent analog wave .
This new wave is digital .
Sound waves are complicated because they superimpose one on top of each other .
Like the waves would .
This way they create odd looking waves .
For example , if there are two waves that interact with each other we can add them which creates new odd looking wave as is shown in the picture on the right .
Neural Network classifies features into phonetic-based categories ; Given basic sound blocks , that machine digitized , we have a bunch of numbers which describe a wave and waves describe words .
Each frame has a unit block of sound , which are broken into basic sound waves and represented by numbers after Fourier Transform , can be statistically evaluated to set to which class of sounds it belongs to .
The nodes in the figure on a slide represent a feature of a sound in which a feature of a wave from first layer of nodes to a second layer of nodes based on some statistical analysis .
This analysis depends on programer 's instructions .
At this point , a second layer of nodes represents a higher level features of a sound input which is again statistically evaluated to see what class they belong to .
Last level of nodes should be output nodes that tell us with high probability what original sound really was .
Search to match the neural-network output scores for the best word , to determine the word that was most likely uttered ; A machine speech recognition using neural network is still just a fancy statistics .
Artificial neural network has specialized output nodes for results , unlike brain .
Our brain recognizes the meaning of words in fundamentally different way .
Our brain is entirely committed into the perception of sound .
When we hear sound , our life experience is brought together to action of listening to set a sound into a appropriate perspective so it is meaningful .
Brain has a purpose when it listens for a sound that is steered toward actions .
In 1982 , Kurzweil Applied Intelligence and Dragon Systems released speech recognition products .
By 1985 , Kurzweil 's software had a vocabulary of 1,000 words -- if uttered one word at a time .
Two years later , in 1987 , its lexicon reached 20,000 words , entering the realm of human vocabularies , which range from 10,000 to 150,000 words .
But recognition accuracy was only 10 % in 1993 .
Two years later , the error rate crossed below 50 % .
Dragon Systems released `` Naturally Speaking '' in 1997 , which recognized normal human speech .
Progress mainly came from improved computer performance and larger source text databases .
The Brown Corpus was the first major database available , containing several million words .
In 2006 , Google published a trillion-word corpus , while Carnegie Mellon University researchers found no significant increase in recognition accuracy .
Algorithms Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms .
Hidden Markov models -LRB- HMMs -RRB- are widely used in many systems .
Language modeling has many other applications such as smart keyboard and document classification .
Hidden Markov models Main article : Hidden Markov model Modern general-purpose speech recognition systems are based on Hidden Markov Models .
These are statistical models that output a sequence of symbols or quantities .
HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal .
In a short time-scales -LRB- e.g. , 10 milliseconds -RRB- , speech can be approximated as a stationary process .
Speech can be thought of as a Markov model for many stochastic purposes .
Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use .
In speech recognition , the hidden Markov model would output a sequence of n-dimensional real-valued vectors -LRB- with n being a small integer , such as 10 -RRB- , outputting one of these every 10 milliseconds .
The vectors would consist of cepstral coefficients , which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform , then taking the first -LRB- most significant -RRB- coefficients .
The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians , which will give a likelihood for each observed vector .
Each word , or -LRB- for more general speech recognition systems -RRB- , each phoneme , will have a different output distribution ; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes .
Described above are the core elements of the most common , HMM-based approach to speech recognition .
Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above .
A typical large-vocabulary system would need context dependency for the phonemes -LRB- so phonemes with different left and right context have different realizations as HMM states -RRB- ; it would use cepstral normalization to normalize for different speaker and recording conditions ; for further speaker normalization it might use vocal tract length normalization -LRB- VTLN -RRB- for male-female normalization and maximum likelihood linear regression -LRB- MLLR -RRB- for more general speaker adaptation .
The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis -LRB- HLDA -RRB- ; or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform -LRB- also known as maximum likelihood linear transform , or MLLT -RRB- .
Many systems use so-called discriminative training techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data .
Examples are maximum mutual information -LRB- MMI -RRB- , minimum classification error -LRB- MCE -RRB- and minimum phone error -LRB- MPE -RRB- .
Decoding of the speech -LRB- the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence -RRB- would probably use the Viterbi algorithm to find the best path , and here there is a choice between dynamically creating a combination hidden Markov model , which includes both the acoustic and language model information , and combining it statically beforehand -LRB- the finite state transducer , or FST , approach -RRB- .
A possible improvement to decoding is to keep a set of good candidates instead of just keeping the best candidate , and to use a better scoring function -LRB- rescoring -RRB- to rate these good candidates so that we may pick the best one according to this refined score .
The set of candidates can be kept either as a list -LRB- the N-best list approach -RRB- or as a subset of the models -LRB- a lattice -RRB- .
Rescoring is usually done by trying to minimize the Bayes risk -LRB- or an approximation thereof -RRB- : Instead of taking the source sentence with maximal probability , we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions -LRB- i.e. , we take the sentence that minimizes the average distance to other possible sentences weighted by their estimated probability -RRB- .
The loss function is usually the Levenshtein distance , though it can be different distances for specific tasks ; the set of possible transcriptions is , of course , pruned to maintain tractability .
Efficient algorithms have been devised to rescore lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions .
Dynamic time warping -LRB- DTW -RRB- - based speech recognition Main article : Dynamic time warping Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach .
Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed .
For instance , similarities in walking patterns would be detected , even if in one video the person was walking slowly and if in another he or she were walking more quickly , or even if there were accelerations and decelerations during the course of one observation .
DTW has been applied to video , audio , and graphics -- indeed , any data that can be turned into a linear representation can be analyzed with DTW .
A well-known application has been automatic speech recognition , to cope with different speaking speeds .
In general , it is a method that allows a computer to find an optimal match between two given sequences -LRB- e.g. , time series -RRB- with certain restrictions .
That is , the sequences are `` warped '' non-linearly to match each other .
This sequence alignment method is often used in the context of hidden Markov models ... .
Neural networks Main article : Neural networks Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s .
Since then , neural networks have been used in many aspects of speech recognition such as phoneme classification , isolated word recognition , and speaker adaptation .
In contrast to HMMs , neural networks make no assumptions about feature statistical properties and have several qualities making them attractive recognition models for speech recognition .
When used to estimate the probabilities of a speech feature segment , neural networks allow discriminative training in a natural and efficient manner .
Few assumptions on the statistics of input features are made with neural networks .
However , in spite of their effectiveness in classifying short-time units such as individual phones and isolated words , neural networks are rarely successful for continuous recognition tasks , largely because of their lack of ability to model temporal dependencies .
Thus , one alternative approach is to use neural networks as a pre-processing e.g. feature transformation , dimensionality reduction , for the HMM based recognition .
Further information Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe , ICASSP , Eurospeech\/ICSLP -LRB- now named Interspeech -RRB- and the IEEE ASRU .
Conferences in the field of Natural language processing , such as ACL , NAACL , EMNLP , and HLT , are beginning to include papers on speech processing .
Important journals include the IEEE Transactions on Speech and Audio Processing -LRB- now named IEEE Transactions on Audio , Speech and Language Processing -RRB- , Computer Speech and Language , and Speech Communication .
Books like `` Fundamentals of Speech Recognition '' by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date -LRB- 1993 -RRB- .
A very recent book -LRB- Dec. 2011 -RRB- , `` Fundamentals of Speaker Recognition '' by Homayoon Beigi covers the more recent developments in some detail .
Although the title concentrates on speaker recognition , but a large portion of the book applies directly to speech recognition , with a lot of valuable detailed background material .
Another good source can be `` Statistical Methods for Speech Recognition '' by Frederick Jelinek and `` Spoken Language Processing -LRB- 2001 -RRB- '' by Xuedong Huang etc. .
More up to date is `` Computer Speech '' , by Manfred R. Schroeder , second edition published in 2004 .
The recently updated textbook of `` Speech and Language Processing -LRB- 2008 -RRB- '' by Jurafsky and Martin presents the basics and the state of the art for ASR .
A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA -LRB- the largest speech recognition-related project ongoing as of 2007 is the GALE project , which involves both speech recognition and translation components -RRB- .
In terms of freely available resources , Carnegie Mellon University 's SPHINX toolkit is one place to start to both learn about speech recognition and to start experimenting .
Another resource -LRB- free as in free beer , not as in free speech -RRB- is the HTK book -LRB- and the accompanying HTK toolkit -RRB- .
The AT&T libraries GRM library , and DCD library are also general software libraries for large-vocabulary speech recognition .
For more software resources , see List of speech recognition software .
A useful review of the area of robustness in ASR is provided by Junqua and Haton -LRB- 1995 -RRB- .
People with disabilities People with disabilities can benefit from speech recognition programs .
For individuals that are Deaf or Hard of Hearing , speech recognition software is used to automatically generate a closed-captioning of conversations such as discussions in conference rooms , classroom lectures , and\/or religious services .
Speech recognition is also very useful for people who have difficulty using their hands , ranging from mild repetitive stress injuries to involved disabilities that preclude using conventional computer input devices .
In fact , people who used the keyboard a lot and developed RSI became an urgent early market for speech recognition .
Speech recognition is used in deaf telephony , such as voicemail to text , relay services , and captioned telephone .
Individuals with learning disabilities who have problems with thought-to-paper communication -LRB- essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper -RRB- can benefit from the software -LRB- citation needed -RRB- .
-LRB- icon -RRB- This section requires expansion .
Current research and funding Measuring progress in speech recognition performance is difficult and controversial .
Some speech recognition tasks are much more difficult than others .
Word error rates on some tasks are less than 1 % .
On others they can be as high as 50 % .
Sometimes it even appears that performance is going backward , as researchers undertake harder tasks that have higher error rates .
Because progress is slow and is difficult to measure , there is some perception that performance has plateaued and that funding has dried up or shifted priorities .
Such perceptions are not new .
In 1969 , John Pierce wrote an open letter that did cause much funding to dry up for several years .
In 1993 there was a strong feeling that performance had plateaued and there were workshops dedicated to the issue .
However , in the 1990s , funding continued more or less uninterrupted and performance continued , slowly but steadily , to improve .
For the past thirty years , speech recognition research has been characterized by the steady accumulation of small incremental improvements .
There has also been a trend to change focus to more difficult tasks due both to progress in speech recognition performance and to the availability of faster computers .
In particular , this shifting to more difficult tasks has characterized DARPA funding of speech recognition since the 1980s .
In the last decade , it has continued with the EARS project , which undertook recognition of Mandarin and Arabic in addition to English , and the GALE project , which focused solely on Mandarin and Arabic and required translation simultaneously with speech recognition .
Commercial research and other academic research also continue to focus on increasingly difficult problems .
One key area is to improve robustness of speech recognition performance , not just robustness against noise but robustness against any condition that causes a major degradation in performance .
Another key area of research is focused on an opportunity rather than a problem .
This research attempts to take advantage of the fact that in many applications there is a large quantity of speech data available , up to millions of hours .
It is too expensive to have humans transcribe such large quantities of speech , so the research focus is on developing new methods of machine learning that can effectively utilize large quantities of unlabeled data .
Another area of research is better understanding of human capabilities and to use this understanding to improve machine recognition performance .
Speech segmentation is the process of identifying the boundaries between words , syllables , or phonemes in spoken natural languages .
The term applies both to the mental processes used by humans , and to artificial processes of natural language processing .
Speech segmentation is an important subproblem of speech recognition , and can not be adequately solved in isolation .
As in most natural language processing problems , one must take into account context , grammar , and semantics , and even so the result is often a probabilistic division rather than a categorical .
A comprehensive survey of speech segmentation problems and techniques can be seen in .
Some writing systems indicate speech segmentation between words by a word divider , such as the space .
The difficulty of this problem is compounded by the phenomenon of co-articulation of speech sounds , where one may be modified in various ways by the adjacent sounds : it may blend smoothly with them , fuse with them , split , or even disappear .
This phenomenon may happen between adjacent words just as easily as within a single word .
The notion that speech is produced like writing , as a sequence of distinct vowels and consonants , is a relic of our alphabetic heritage -LRB- citation needed -RRB- .
In fact , the way we produce vowels depends on the surrounding consonants and the way we produce consonants depends on the surrounding vowels .
For example , when we say ` kit ' , the -LRB- k -RRB- is farther forward than when we say ` caught ' .
But also the vowel in ` kick ' is phonetically different from the vowel in ` kit ' , though we normally do not hear this .
In addition , there are language-specific changes which occur on casual speech which makes it quite different from spelling .
For example , in English , the phrase ` hit you ' could often be more appropriately spelled ` hitcha ' .
Therefore , even with the best algorithms , the result of phonetic segmentation will usually be very distant from the standard written language .
For this reason , the lexical and syntactic parsing of spoken text normally requires specialized algorithms , distinct from those used for parsing written text .
Statistical models can be used to segment and align recorded speech to words or phones .
Applications include automatic lip-synch timing for cartoon animation , follow-the-bouncing-ball video sub-titling , and linguistic research .
Automatic segmentation and alignment software is commercially available .
Lexical segmentation In all natural languages , the meaning of a complex spoken sentence -LRB- which often has never been heard or uttered before -RRB- can be understood only by decomposing it into smaller lexical segments -LRB- roughly , the words of the language -RRB- , associating a meaning to each segment , and then combining those meanings according to the grammar rules of the language .
The recognition of each lexical segment in turn requires its decomposition into a sequence of discrete phonetic segments and mapping each segment to one element of a finite set of elementary sounds -LRB- roughly , the phonemes of the language -RRB- ; the meaning then can be found by standard table lookup algorithms .
For most spoken languages , the boundaries between lexical units are surprisingly difficult to identify .
One might expect that the inter-word spaces used by many written languages , like English or Spanish , would correspond to pauses in their spoken version ; but that is true only in very slow speech , when the speaker deliberately inserts those pauses .
In normal speech , one typically finds many consecutive words being said with no pauses between them , and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word .
Moreover , an utterance can have different meanings depending on how it is split into words .
A popular example , often quoted in the field , is the phrase How to wreck a nice beach , which sounds very similar to How to recognize speech .
As this example shows , proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience , and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer .
This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces , like Chinese and Japanese .
However , even for those languages , text segmentation is often much easier than speech segmentation , because the written language usually has little interference between adjacent words , and often contains additional clues not present in speech -LRB- such as the use of Chinese characters for word stems in Japanese -RRB- .
Text segmentation is the process of dividing written text into meaningful units , such as words , sentences , or topics .
The term applies both to mental processes used by humans when reading text , and to artificial processes implemented in computers , which are the subject of natural language processing .
The problem is non-trivial , because while some written languages have explicit word boundary markers , such as the word spaces of written English and the distinctive initial , medial and final letter shapes of Arabic , such signals are sometimes ambiguous and not present in all written languages .
Compare speech segmentation , the process of dividing speech into linguistically meaningful portions .
In English and many other languages using some form of the Latin alphabet , the space is a good approximation of a word delimiter .
-LRB- Some examples where the space character alone may not be sufficient include contractions like ca n't for can not . -RRB-
However the equivalent to this character is not found in all written scripts , and without it word segmentation is a difficult problem .
Languages which do not have a trivial word segmentation process include Chinese , Japanese , where sentences but not words are delimited , Thai and Lao , where phrases and sentences but not words are delimited , and Vietnamese , where syllables but not words are delimited .
In some writing systems however , such as the Ge'ez script used for Amharic and Tigrinya among other languages , words are explicitly delimited -LRB- at least historically -RRB- with a non-whitespace character .
The Unicode Consortium has published a Standard Annex on Text Segmentation , exploring the issues of segmentation in multiscript texts .
Word splitting is the process of parsing concatenated text -LRB- i.e. text that contains no spaces or other word separators -RRB- to infer where word breaks exist .
Word splitting may also refer to the process of hyphenation .
Sentence segmentation Sentence segmentation is the problem of dividing a string of written language into its component sentences .
In English and some other languages , using punctuation , particularly the full stop character is a reasonable approximation .
However even in English this problem is not trivial due to the use of the full stop character for abbreviations , which may or may not also terminate a sentence .
For example Mr. is not its own sentence in `` Mr. Smith went to the shops in Jones Street . ''
When processing plain text , tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries .
As with word segmentation , not all written languages contain punctuation characters which are useful for approximating sentence boundaries .
Other segmentation problems Processes may be required to segment text into segments besides words , including morphemes -LRB- a task usually called morphological analysis -RRB- , paragraphs , topics or discourse turns .
A document may contain multiple topics , and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly .
The topic boundaries may be apparent from section titles and paragraphs .
In other cases one needs to use techniques similar to those used in document classification .
Many different approaches have been tried .
Automatic segmentation approaches Automatic segmentation is the problem in natural language processing of implementing a computer process to segment text .
When punctuation and similar clues are not consistently available , the segmentation task often requires fairly non-trivial techniques , such as statistical decision-making , large dictionaries , as well as consideration of syntactic and semantic constraints .
Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources .
As an example , processing text used in medical records is a very different problem than processing news articles or real estate advertisements .
The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain .
There are two general approaches : Manual analysis of text and writing custom software Annotate the sample corpus with boundary information and use Machine Learning Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries .