-
Notifications
You must be signed in to change notification settings - Fork 2
/
wiki-en-train.word
1301 lines (1301 loc) · 200 KB
/
wiki-en-train.word
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Natural language processing -LRB- NLP -RRB- is a field of computer science , artificial intelligence -LRB- also called machine learning -RRB- , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages .
Specifically , it is the process of a computer extracting meaningful information from natural language input and\/or producing natural language output .
In theory , natural language processing is a very attractive method of human -- computer interaction .
Natural language understanding is sometimes referred to as an AI-complete problem because it seems to require extensive knowledge about the outside world and the ability to manipulate it .
Whether NLP is distinct from , or identical to , the field of computational linguistics is a matter of perspective .
The Association for Computational Linguistics defines the latter as focusing on the theoretical aspects of NLP .
On the other hand , the open-access journal `` Computational Linguistics '' , styles itself as `` the longest running publication devoted exclusively to the design and analysis of natural language processing systems '' -LRB- Computational Linguistics -LRB- Journal -RRB- -RRB- Modern NLP algorithms are grounded in machine learning , especially statistical machine learning .
Research into modern statistical NLP algorithms requires an understanding of a number of disparate fields , including linguistics , computer science , and statistics .
For a discussion of the types of algorithms currently used in NLP , see the article on pattern recognition .
An automated online assistant providing customer service on a web page , an example of an application where natural language processing is a major component .
In 1950 , Alan Turing published his famous article `` Computing Machinery and Intelligence '' which proposed what is now called the Turing test as a criterion of intelligence .
This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge , sufficiently well that the judge is unable to distinguish reliably -- on the basis of the conversational content alone -- between the program and a real human .
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English .
The authors claimed that within three or five years , machine translation would be a solved problem .
However , real progress was much slower , and after the ALPAC report in 1966 , which found that ten years long research had failed to fulfill the expectations , funding for machine translation was dramatically reduced .
Little further research in machine translation was conducted until the late 1980s , when the first statistical machine translation systems were developed .
Some notably successful NLP systems developed in the 1960s were SHRDLU , a natural language system working in restricted `` blocks worlds '' with restricted vocabularies , and ELIZA , a simulation of a Rogerian psychotherapist , written by Joseph Weizenbaum between 1964 to 1966 .
Using almost no information about human thought or emotion , ELIZA sometimes provided a startlingly human-like interaction .
When the `` patient '' exceeded the very small knowledge base , ELIZA might provide a generic response , for example , responding to `` My head hurts '' with `` Why do you say your head hurts ? '' .
During the 70 's many programmers began to write ` conceptual ontologies ' , which structured real-world information into computer-understandable data .
Examples are MARGIE -LRB- Schank , 1975 -RRB- , SAM -LRB- Cullingford , 1978 -RRB- , PAM -LRB- Wilensky , 1978 -RRB- , TaleSpin -LRB- Meehan , 1976 -RRB- , QUALM -LRB- Lehnert , 1977 -RRB- , Politics -LRB- Carbonell , 1979 -RRB- , and Plot Units -LRB- Lehnert 1981 -RRB- .
During this time , many chatterbots were written including PARRY , Racter , and Jabberwacky .
Up to the 1980s , most NLP systems were based on complex sets of hand-written rules .
Starting in the late 1980s , however , there was a revolution in NLP with the introduction of machine learning algorithms for language processing .
This was due both to the steady increase in computational power resulting from Moore 's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics -LRB- e.g. transformational grammar -RRB- , whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing .
Some of the earliest-used machine learning algorithms , such as decision trees , produced systems of hard if-then rules similar to existing hand-written rules .
Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to the features making up the input data .
The cache language models upon which many speech recognition systems now rely are examples of such statistical models .
Such models are generally more robust when given unfamiliar input , especially input that contains errors -LRB- as is very common for real-world data -RRB- , and produce more reliable results when integrated into a larger system comprising multiple subtasks .
Many of the notable early successes occurred in the field of machine translation , due especially to work at IBM Research , where successively more complicated statistical models were developed .
These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government .
However , most other systems depended on corpora specifically developed for the tasks implemented by these systems , which was -LRB- and often continues to be -RRB- a major limitation in the success of these systems .
As a result , a great deal of research has gone into methods of more effectively learning from limited amounts of data .
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms .
Such algorithms are able to learn from data that has not been hand-annotated with the desired answers , or using a combination of annotated and non-annotated data .
Generally , this task is much more difficult than supervised learning , and typically produces less accurate results for a given amount of input data .
However , there is an enormous amount of non-annotated data available -LRB- including , among other things , the entire content of the World Wide Web -RRB- , which can often make up for the inferior results .
NLP using machine learning As described above , modern approaches to natural language processing -LRB- NLP -RRB- are grounded in machine learning .
The paradigm of machine learning is different from that of most prior attempts at language processing .
Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules .
The machine-learning paradigm calls instead for using general learning algorithms -- often , although not always , grounded in statistical inference -- to automatically learn such rules through the analysis of large corpora of typical real-world examples .
A corpus -LRB- plural , `` corpora '' -RRB- is a set of documents -LRB- or sometimes , individual sentences -RRB- that have been hand-annotated with the correct values to be learned .
Consider the task of part of speech tagging , i.e. determining the correct part of speech of each word in a given sentence , typically one that has never been seen before .
A typical machine-learning-based implementation of a part of speech tagger proceeds in two steps , a training step and an evaluation step .
The first step -- the training step -- makes use of a corpus of training data , which consists of a large number of sentences , each of which has the correct part of speech attached to each word .
-LRB- An example of such a corpus in common use is the Penn Treebank .
This includes -LRB- among other things -RRB- a set of 500 texts from the Brown Corpus , containing examples of various genres of text , and 2500 articles from the Wall Street Journal . -RRB-
This corpus is analyzed and a learning model is generated from it , consisting of automatically created rules for determining the part of speech for a word in a sentence , typically based on the nature of the word in question , the nature of surrounding words , and the most likely part of speech for those surrounding words .
The model that is generated is typically the best model that can be found that simultaneously meets two conflicting objectives : To perform as well as possible on the training data , and to be as simple as possible -LRB- so that the model avoids overfitting the training data , i.e. so that it generalizes as well as possible to new data rather than only succeeding on sentences that have already been seen -RRB- .
In the second step -LRB- the evaluation step -RRB- , the model that has been learned is used to process new sentences .
An important part of the development of any learning algorithm is testing the model that has been learned on new , previously unseen data .
It is critical that the data used for testing is not the same as the data used for training ; otherwise , the testing accuracy will be unrealistically high .
Many different classes of machine learning algorithms have been applied to NLP tasks .
In common to all of these algorithms is that they take as input a large set of `` features '' that are generated from the input data .
As an example , for a part-of-speech tagger , typical features might be the identity of the word being processed , the identity of the words immediately to the left and right , the part-of-speech tag of the word to the left , and whether the word being considered or its immediate neighbors are content words or function words .
The algorithms differ , however , in the nature of the rules generated .
Some of the earliest-used algorithms , such as decision trees , produced systems of hard if-then rules similar to the systems of hand-written rules that were then common .
Increasingly , however , research has focused on statistical models , which make soft , probabilistic decisions based on attaching real-valued weights to each input feature .
Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one , producing more reliable results when such a model is included as a component of a larger system .
In addition , models that make soft decisions are generally more robust when given unfamiliar input , especially input that contains errors -LRB- as is very common for real-world data -RRB- .
Systems based on machine-learning algorithms have many advantages over hand-produced rules : The learning procedures used during machine learning automatically focus on the most common cases , whereas when writing rules by hand it is often not obvious at all where the effort should be directed .
Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input -LRB- e.g. containing words or structures that have not been seen before -RRB- and to erroneous input -LRB- e.g. with misspelled words or words accidentally omitted -RRB- .
Generally , handling such input gracefully with hand-written rules -- or more generally , creating systems of hand-written rules that make soft decisions -- is extremely difficult , error-prone and time-consuming .
Systems based on automatically learning the rules can be made more accurate simply by supplying more input data .
However , systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules , which is a much more difficult task .
In particular , there is a limit to the complexity of systems based on hand-crafted rules , beyond which the systems become more and more unmanageable .
However , creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked , generally without significant increases in the complexity of the annotation process .
Major tasks in NLP The following is a list of some of the most commonly researched tasks in NLP .
Note that some of these tasks have direct real-world applications , while others more commonly serve as subtasks that are used to aid in solving larger tasks .
What distinguishes these tasks from other potential and actual NLP tasks is not only the volume of research devoted to them but the fact that for each one there is typically a well-defined problem setting , a standard metric for evaluating the task , standard corpora on which the task can be evaluated , and competitions devoted to the specific task .
Automatic summarization : Produce a readable summary of a chunk of text .
Often used to provide summaries of text of a known type , such as articles in the financial section of a newspaper .
Coreference resolution : Given a sentence or larger chunk of text , determine which words -LRB- `` mentions '' -RRB- refer to the same objects -LRB- `` entities '' -RRB- .
Anaphora resolution is a specific example of this task , and is specifically concerned with matching up pronouns with the nouns or names that they refer to .
For example , in a sentence such as `` He entered John 's house through the front door '' , `` the front door '' is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John 's house -LRB- rather than of some other structure that might also be referred to -RRB- .
Discourse analysis : This rubric includes a number of related tasks .
One task is identifying the discourse structure of connected text , i.e. the nature of the discourse relationships between sentences -LRB- e.g. elaboration , explanation , contrast -RRB- .
Another possible task is recognizing and classifying the speech acts in a chunk of text -LRB- e.g. yes-no question , content question , statement , assertion , etc. -RRB- .
Machine translation : Automatically translate text from one human language to another .
This is one of the most difficult problems , and is a member of a class of problems colloquially termed `` AI-complete '' , i.e. requiring all of the different types of knowledge that humans possess -LRB- grammar , semantics , facts about the real world , etc. -RRB- in order to solve properly .
Morphological segmentation : Separate words into individual morphemes and identify the class of the morphemes .
The difficulty of this task depends greatly on the complexity of the morphology -LRB- i.e. the structure of words -RRB- of the language being considered .
English has fairly simple morphology , especially inflectional morphology , and thus it is often possible to ignore this task entirely and simply model all possible forms of a word -LRB- e.g. `` open , opens , opened , opening '' -RRB- as separate words .
In languages such as Turkish , however , such an approach is not possible , as each dictionary entry has thousands of possible word forms .
Named entity recognition -LRB- NER -RRB- : Given a stream of text , determine which items in the text map to proper names , such as people or places , and what the type of each such name is -LRB- e.g. person , location , organization -RRB- .
Note that , although capitalization can aid in recognizing named entities in languages such as English , this information can not aid in determining the type of named entity , and in any case is often inaccurate or insufficient .
For example , the first word of a sentence is also capitalized , and named entities often span several words , only some of which are capitalized .
Furthermore , many other languages in non-Western scripts -LRB- e.g. Chinese or Arabic -RRB- do not have any capitalization at all , and even languages with capitalization may not consistently use it to distinguish names .
For example , German capitalizes all nouns , regardless of whether they refer to names , and French and Spanish do not capitalize names that serve as adjectives .
Natural language generation : Convert information from computer databases into readable human language .
Natural language understanding : Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate .
Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural languages concepts .
Introduction and creation of language metamodel and ontology are efficient however empirical solutions .
An explicit formalization of natural languages semantics without confusions with implicit assumptions such as closed world assumption -LRB- CWA -RRB- vs. open world assumption , or subjective Yes\/No vs. objective True\/False is expected for the construction of a basis of semantics formalization .
Optical character recognition -LRB- OCR -RRB- : Given an image representing printed text , determine the corresponding text .
Part-of-speech tagging : Given a sentence , determine the part of speech for each word .
Many words , especially common ones , can serve as multiple parts of speech .
For example , `` book '' can be a noun -LRB- `` the book on the table '' -RRB- or verb -LRB- `` to book a flight '' -RRB- ; `` set '' can be a noun , verb or adjective ; and `` out '' can be any of at least five different parts of speech .
Note that some languages have more such ambiguity than others .
Languages with little inflectional morphology , such as English are particularly prone to such ambiguity .
Chinese is prone to such ambiguity because it is a tonal language during verbalization .
Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning .
Parsing : Determine the parse tree -LRB- grammatical analysis -RRB- of a given sentence .
The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses .
In fact , perhaps surprisingly , for a typical sentence there may be thousands of potential parses -LRB- most of which will seem completely nonsensical to a human -RRB- .
Question answering : Given a human-language question , determine its answer .
Typical questions have a specific right answer -LRB- such as `` What is the capital of Canada ? '' -RRB-
, but sometimes open-ended questions are also considered -LRB- such as `` What is the meaning of life ? '' -RRB- .
Relationship extraction : Given a chunk of text , identify the relationships among named entities -LRB- e.g. who is the wife of whom -RRB- .
Sentence breaking -LRB- also known as sentence boundary disambiguation -RRB- : Given a chunk of text , find the sentence boundaries .
Sentence boundaries are often marked by periods or other punctuation marks , but these same characters can serve other purposes -LRB- e.g. marking abbreviations -RRB- .
Sentiment analysis : Extract subjective information usually from a set of documents , often using online reviews to determine `` polarity '' about specific objects .
It is especially useful for identifying trends of public opinion in the social media , for the purpose of marketing .
Speech recognition : Given a sound clip of a person or people speaking , determine the textual representation of the speech .
This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed `` AI-complete '' -LRB- see above -RRB- .
In natural speech there are hardly any pauses between successive words , and thus speech segmentation is a necessary subtask of speech recognition -LRB- see below -RRB- .
Note also that in most spoken languages , the sounds representing successive letters blend into each other in a process termed coarticulation , so the conversion of the analog signal to discrete characters can be a very difficult process .
Speech segmentation : Given a sound clip of a person or people speaking , separate it into words .
A subtask of speech recognition and typically grouped with it .
Topic segmentation and recognition : Given a chunk of text , separate it into segments each of which is devoted to a topic , and identify the topic of the segment .
Word segmentation : Separate a chunk of continuous text into separate words .
For a language like English , this is fairly trivial , since words are usually separated by spaces .
However , some written languages like Chinese , Japanese and Thai do not mark word boundaries in such a fashion , and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language .
Word sense disambiguation : Many words have more than one meaning ; we have to select the meaning which makes the most sense in context .
For this problem , we are typically given a list of words and associated word senses , e.g. from a dictionary or from an online resource such as WordNet .
In some cases , sets of related tasks are grouped into subfields of NLP that are often considered separately from NLP as a whole .
Examples include : Information retrieval -LRB- IR -RRB- : This is concerned with storing , searching and retrieving information .
It is a separate field within computer science -LRB- closer to databases -RRB- , but IR relies on some NLP methods -LRB- for example , stemming -RRB- .
Some current research and applications seek to bridge the gap between IR and NLP .
Information extraction -LRB- IE -RRB- : This is concerned in general with the extraction of semantic information from text .
This covers tasks such as named entity recognition , coreference resolution , relationship extraction , etc. .
Speech processing : This covers speech recognition , text-to-speech and related tasks .
Other tasks include : Stemming Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing Statistical NLP Main article : statistical natural language processing Statistical natural-language processing uses stochastic , probabilistic and statistical methods to resolve some of the difficulties discussed above , especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars , yielding thousands or millions of possible analyses .
Methods for disambiguation often involve the use of corpora and Markov models .
Statistical NLP comprises all quantitative approaches to automated language processing , including probabilistic modeling , information theory , and linear algebra .
The technology for statistical NLP comes mainly from machine learning and data mining , both of which are fields of artificial intelligence that involve learning from data .
Evaluation of natural language processing Objectives The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system , in order to determine whether -LRB- or to what extent -RRB- the system answers the goals of its designers , or meets the needs of its users .
Research in NLP evaluation has received considerable attention , because the definition of proper evaluation criteria is one way to specify precisely an NLP problem , going thus beyond the vagueness of tasks defined only as language understanding or language generation .
A precise set of evaluation criteria , which includes mainly evaluation data and evaluation metrics , enables several teams to compare their solutions to a given NLP problem .
Short history of evaluation in NLP The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 -LRB- Pallet 1998 -RRB- .
Then , the Parseval\/GEIG project compared phrase-structure grammars -LRB- Black 1991 -RRB- .
A series of campaigns within Tipster project were realized on tasks like summarization , translation and searching -LRB- Hirschman 1998 -RRB- .
In 1994 , in Germany , the Morpholympics compared German taggers .
Then , the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation .
In 1996 , the Sparkle campaign compared syntactic parsers in four different languages -LRB- English , French , German and Italian -RRB- .
In France , the Grace project compared a set of 21 taggers for French in 1997 -LRB- Adda 1999 -RRB- .
In 2004 , during the Technolangue\/Easy project , 13 parsers for French were compared .
Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007 .
In Italy , the EVALITA campaign was conducted in 2007 and 2009 to compare various NLP and speech tools for Italian ; the 2011 campaign is in full progress - EVALITA web site .
In France , within the ANR-Passage project -LRB- end of 2007 -RRB- , 10 parsers for French were compared - passage web site .
Adda G. , Mariani J. , Paroubek P. , Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discors pour le français .
Langues vol-2 Black E. , Abney S. , Flickinger D. , Gdaniec C. , Grishman R. , Harrison P. , Hindle D. , Ingria R. , Jelinek F. , Klavans J. , Liberman M. , Marcus M. , Reukos S. , Santoni B. , Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars .
DARPA Speech and Natural Language Workshop Hirschman L. 1998 Language understanding evaluation : lessons learned from MUC and ATIS .
LREC Granada Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests .
LREC Granada Different types of evaluation Depending on the evaluation procedures , a number of distinctions are traditionally made in NLP evaluation .
Intrinsic vs. extrinsic evaluation Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result , pre-defined by the evaluators .
Extrinsic evaluation , also called evaluation in use considers the NLP system in a more complex setting , either as an embedded system or serving a precise function for a human user .
The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user .
For example , consider a syntactic parser that is based on the output of some new part of speech -LRB- POS -RRB- tagger .
An intrinsic evaluation would run the POS tagger on some labeled data , and compare the system output of the POS tagger to the gold standard -LRB- correct -RRB- output .
An extrinsic evaluation would run the parser with some other POS tagger , and then with the new POS tagger , and compare the parsing accuracy .
Black-box vs. glass-box evaluation Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process -LRB- speed , reliability , resource consumption -RRB- and , most importantly , to the quality of the result -LRB- e.g. the accuracy of data annotation or the fidelity of a translation -RRB- .
Glass-box evaluation looks at the design of the system , the algorithms that are implemented , the linguistic resources it uses -LRB- e.g. vocabulary size -RRB- , etc. .
Given the complexity of NLP problems , it is often difficult to predict performance only on the basis of glass-box evaluation , but this type of evaluation is more informative with respect to error analysis or future developments of a system .
Automatic vs. manual evaluation In many cases , automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard -LRB- or desired -RRB- one .
Although the cost of producing the gold standard can be quite high , automatic evaluation can be repeated as often as needed without much additional costs -LRB- on the same input data -RRB- .
However , for many NLP problems , the definition of a gold standard is a complex task , and can prove impossible when inter-annotator agreement is insufficient .
Manual evaluation is performed by human judges , which are instructed to estimate the quality of a system , or most often of a sample of its output , based on a number of criteria .
Although , thanks to their linguistic competence , human judges can be considered as the reference for a number of language processing tasks , there is also considerable variation across their ratings .
This is why automatic evaluation is sometimes referred to as objective evaluation , while the human kind appears to be more subjective .
Shared tasks -LRB- Campaigns -RRB- BioCreative Message Understanding Conference Technolangue\/Easy Text Retrieval Conference Evaluation exercises on Semantic Evaluation -LRB- SemEval -RRB- MorphoChallenge Semi-supervised and Unsupervised Morpheme Analysis Standardization in NLP An ISO sub-committee is working in order to ease interoperability between lexical resources and NLP programs .
The sub-committee is part of ISO\/TC37 and is called ISO\/TC37\/SC4 .
Some ISO standards are already published but most of them are under construction , mainly on lexicon representation -LRB- see LMF -RRB- , annotation and data category registry .
Automatic summarization is the creation of a shortened version of a text by a computer program .
The product of this procedure still contains the most important points of the original text .
Discourse analysis -LRB- DA -RRB- , or discourse studies , is a general term for a number of approaches to analyzing written , spoken , signed language use or any significant semiotic event .
Machine translation , sometimes referred to by the abbreviation MT -LRB- not to be confused with computer-aided translation , machine-aided human translation MAHT and interactive translation -RRB- is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another .
On a basic level , MT performs simple substitution of words in one natural language for words in another , but that alone usually can not produce a good translation of a text , because recognition of whole phrases and their closest counterparts in the target language is needed .
Solving this problem with corpus and statistical techniques is a rapidly growing field that is leading to better translations , handling differences in linguistic typology , translation of idioms , and the isolation of anomalies .
-LRB- citation needed -RRB- Current machine translation software often allows for customisation by domain or profession -LRB- such as weather reports -RRB- , improving output by limiting the scope of allowable substitutions .
This technique is particularly effective in domains where formal or formulaic language is used .
It follows that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text .
Improved output quality can also be achieved by human intervention : for example , some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names .
With the assistance of these techniques , MT has proven useful as a tool to assist human translators and , in a very limited number of cases , can even produce output that can be used as is -LRB- e.g. , weather reports -RRB- .
The progress and potential of machine translation has been debated much through its history .
Since the 1950s , a number of scholars have questioned the possibility of achieving fully automatic machine translation of high quality .
Some critics claim that there are in-principle obstacles to automatizing the translation process .
In 1629 , René Descartes proposed a universal language , with equivalent ideas in different tongues sharing one symbol .
In the 1950s , The Georgetown experiment -LRB- 1954 -RRB- involved fully automatic translation of over sixty Russian sentences into English .
The experiment was a great success and ushered in an era of substantial funding for machine-translation research .
The authors claimed that within three to five years , machine translation would be a solved problem .
Real progress was much slower , however , and after the ALPAC report -LRB- 1966 -RRB- , which found that the ten-year-long research had failed to fulfill expectations , funding was greatly reduced .
Beginning in the late 1980s , as computational power increased and became less expensive , more interest was shown in statistical models for machine translation .
The idea of using digital computers for translation of natural languages was proposed as early as 1946 by A. D. Booth and possibly others .
Warren Weaver wrote an important memorandum `` Translation '' in 1949 .
The Georgetown experiment was by no means the first such application , and a demonstration was made in 1954 on the APEXC machine at Birkbeck College -LRB- University of London -RRB- of a rudimentary translation of English into French .
Several papers on the topic were published at the time , and even articles in popular journals -LRB- see for example Wireless World , Sept. 1955 , Cleave and Zacharov -RRB- .
A similar application , also pioneered at Birkbeck College at the time , was reading and composing Braille texts by computer .
Translation process Main article : Translation process The human translation process may be described as : Decoding the meaning of the source text ; and Re-encoding this meaning in the target language .
Behind this ostensibly simple procedure lies a complex cognitive operation .
To decode the meaning of the source text in its entirety , the translator must interpret and analyze all the features of the text , a process that requires in-depth knowledge of the grammar , semantics , syntax , idioms , etc. , of the source language , as well as the culture of its speakers .
The translator needs the same in-depth knowledge to re-encode the meaning in the target language .
Therein lies the challenge in machine translation : how to program a computer that will `` understand '' a text as a person does , and that will `` create '' a new text in the target language that `` sounds '' as if it has been written by a person .
This problem may be approached in a number of ways .
Approaches Bernard Vauquois ' pyramid showing comparative depths of intermediary representation , interlingual machine translation at the peak , followed by transfer-based , then direct translation .
Machine translation can use a method based on linguistic rules , which means that words will be translated in a linguistic way -- the most suitable -LRB- orally speaking -RRB- words of the target language will replace the ones in the source language .
It is often argued that the success of machine translation requires the problem of natural language understanding to be solved first .
Generally , rule-based methods parse a text , usually creating an intermediary , symbolic representation , from which the text in the target language is generated .
According to the nature of the intermediary representation , an approach is described as interlingual machine translation or transfer-based machine translation .
These methods require extensive lexicons with morphological , syntactic , and semantic information , and large sets of rules .
Given enough data , machine translation programs often work well enough for a native speaker of one language to get the approximate meaning of what is written by the other native speaker .
The difficulty is getting enough data of the right kind to support the particular method .
For example , the large multilingual corpus of data needed for statistical methods to work is not necessary for the grammar-based methods .
But then , the grammar methods need a skilled linguist to carefully design the grammar that they use .
To translate between closely related languages , a technique referred to as shallow-transfer machine translation may be used .
Rule-based The rule-based machine translation paradigm includes transfer-based machine translation , interlingual machine translation and dictionary-based machine translation paradigms .
Main article : Rule-based machine translation Transfer-based machine translation Main article : Transfer-based machine translation Interlingual Main article : Interlingual machine translation Interlingual machine translation is one instance of rule-based machine-translation approaches .
In this approach , the source language , i.e. the text to be translated , is transformed into an interlingual , i.e. source - \/ target-language-independent representation .
The target language is then generated out of the interlingua .
Dictionary-based Main article : Dictionary-based machine translation Machine translation can use a method based on dictionary entries , which means that the words will be translated as they are by a dictionary .
Statistical Main article : Statistical machine translation Statistical machine translation tries to generate translations using statistical methods based on bilingual text corpora , such as the Canadian Hansard corpus , the English-French record of the Canadian parliament and EUROPARL , the record of the European Parliament .
Where such corpora are available , impressive results can be achieved translating texts of a similar kind , but such corpora are still very rare .
The first statistical machine translation software was CANDIDE from IBM .
Google used SYSTRAN for several years , but switched to a statistical translation method in October 2007 .
Recently , they improved their translation capabilities by inputting approximately 200 billion words from United Nations materials to train their system .
Accuracy of the translation has improved .
Example-based Main article : Example-based machine translation Example-based machine translation -LRB- EBMT -RRB- approach was proposed by Makoto Nagao in 1984 .
It is often characterised by its use of a bilingual corpus as its main knowledge base , at run-time .
It is essentially a translation by analogy and can be viewed as an implementation of case-based reasoning approach of machine learning .
Hybrid MT Hybrid machine translation -LRB- HMT -RRB- leverages the strengths of statistical and rule-based translation methodologies .
Several MT companies -LRB- Asia Online , LinguaSys , Systran , PangeaMT , UPV -RRB- are claiming to have a hybrid approach using both rules and statistics .
The approaches differ in a number of ways : Rules post-processed by statistics : Translations are performed using a rules based engine .
Statistics are then used in an attempt to adjust\/correct the output from the rules engine .
Statistics guided by rules : Rules are used to pre-process data in an attempt to better guide the statistical engine .
Rules are also used to post-process the statistical output to perform functions such as normalization .
This approach has a lot more power , flexibility and control when translating .
Major issues Disambiguation Main article : Word sense disambiguation Word-sense disambiguation concerns finding a suitable translation when a word can have more than one meaning .
The problem was first raised in the 1950s by Yehoshua Bar-Hillel .
He pointed out that without a `` universal encyclopedia '' , a machine would never be able to distinguish between the two meanings of a word .
Today there are numerous approaches designed to overcome this problem .
They can be approximately divided into `` shallow '' approaches and `` deep '' approaches .
Shallow approaches assume no knowledge of the text .
They simply apply statistical methods to the words surrounding the ambiguous word .
Deep approaches presume a comprehensive knowledge of the word .
So far , shallow approaches have been more successful .
-LRB- citation needed -RRB- The late Claude Piron , a long-time translator for the United Nations and the World Health Organization , wrote that machine translation , at its best , automates the easier part of a translator 's job ; the harder and more time-consuming part usually involves doing extensive research to resolve ambiguities in the source text , which the grammatical and lexical exigencies of the target language require to be resolved : Why does a translator need a whole workday to translate five pages , and not an hour or two ?
... About 90 % of an average text corresponds to these simple conditions .
But unfortunately , there 's the other 10 % .
It 's that part that requires six -LRB- more -RRB- hours of work .
There are ambiguities one has to resolve .
For instance , the author of the source text , an Australian physician , cited the example of an epidemic which was declared during World War II in a `` Japanese prisoner of war camp '' .
Was he talking about an American camp with Japanese prisoners or a Japanese camp with American prisoners ?
The English has two senses .
It 's necessary therefore to do research , maybe to the extent of a phone call to Australia .
The ideal deep approach would require the translation software to do all the research necessary for this kind of disambiguation on its own ; but this would require a higher degree of AI than has yet been attained .
A shallow approach which simply guessed at the sense of the ambiguous English phrase that Piron mentions -LRB- based , perhaps , on which kind of prisoner-of-war camp is more often mentioned in a given corpus -RRB- would have a reasonable chance of guessing wrong fairly often .
A shallow approach that involves `` ask the user about each ambiguity '' would , by Piron 's estimate , only automate about 25 % of a professional translator 's job , leaving the harder 75 % still to be done by a human .
The objects of discourse analysis -- discourse , writing , conversation , communicative event , etc. -- are variously defined in terms of coherent sequences of sentences , propositions , speech acts or turns-at-talk .
Contrary to much of traditional linguistics , discourse analysts not only study language use ` beyond the sentence boundary ' , but also prefer to analyze ` naturally occurring ' language use , and not invented examples .
Text linguistics is related .
The essential difference between discourse analysis and text linguistics is that it aims at revealing socio-psychological characteristics of a person\/persons rather than text structure .
Discourse analysis has been taken up in a variety of social science disciplines , including linguistics , sociology , anthropology , social work , cognitive psychology , social psychology , international relations , human geography , communication studies and translation studies , each of which is subject to its own assumptions , dimensions of analysis , and methodologies .
The examples and perspective in this article deal primarily with the United States and do not represent a worldwide view of the subject .
Please improve this article and discuss the issue on the talk page .
-LRB- December 2010 -RRB- Some scholars -LRB- which ? -RRB-
consider the Austrian emigre Leo Spitzer 's Stilstudien -LRB- Style Studies -RRB- of 1928 the earliest example of discourse analysis -LRB- DA -RRB- ; Michel Foucault himself translated it into French .
But the term first came into general use following the publication of a series of papers by Zellig Harris beginning in 1952 and reporting on work from which he developed transformational grammar in the late 1930s .
Formal equivalence relations among the sentences of a coherent discourse are made explicit by using sentence transformations to put the text in a canonical form .
Words and sentences with equivalent information then appear in the same column of an array .
This work progressed over the next four decades -LRB- see references -RRB- into a science of sublanguage analysis -LRB- Kittredge & Lehrberger 1982 -RRB- , culminating in a demonstration of the informational structures in texts of a sublanguage of science , that of immunology , -LRB- Harris et al. 1989 -RRB- and a fully articulated theory of linguistic informational content -LRB- Harris 1991 -RRB- .
During this time , however , most linguists decided a succession of elaborate theories of sentence-level syntax and semantics .
Although Harris had mentioned the analysis of whole discourses , he had not worked out a comprehensive model , as of January , 1952 .
A linguist working for the American Bible Society , James A. Lauriault\/Loriot , needed to find answers to some fundamental errors in translating Quechua , in the Cuzco area of Peru .
He took Harris 's idea , recorded all of the legends and , after going over the meaning and placement of each word with a native speaker of Quechua , was able to form logical , mathematical rules that transcended the simple sentence structure .
He then applied the process to another language of Eastern Peru , Shipibo .
He taught the theory in Norman , Oklahoma , in the summers of 1956 and 1957 and entered the University of Pennsylvania in the interim year .
He tried to publish a paper Shipibo Paragraph Structure , but it was delayed until 1970 -LRB- Loriot & Hollenbach 1970 -RRB- .
In the meantime , Dr. Kenneth Lee Pike , a professor at University of Michigan , Ann Arbor , taught the theory , and one of his students , Robert E. Longacre , was able to disseminate it in a dissertation .
Harris 's methodology was developed into a system for the computer-aided analysis of natural language by a team led by Naomi Sager at NYU , which has been applied to a number of sublanguage domains , most notably to medical informatics .
The software for the Medical Language Processor is publicly available on SourceForge .
In the late 1960s and 1970s , and without reference to this prior work , a variety of other approaches to a new cross-discipline of DA began to develop in most of the humanities and social sciences concurrently with , and related to , other disciplines , such as semiotics , psycholinguistics , sociolinguistics , and pragmatics .
Many of these approaches , especially those influenced by the social sciences , favor a more dynamic study of oral talk-in-interaction .
Mention must also be made of the term `` Conversational analysis '' , which was influenced by the Sociologist Harold Garfinkel who is the founder of Ethnomethodology .
In Europe , Michel Foucault became one of the key theorists of the subject , especially of discourse , and wrote The Archaeology of Knowledge on the subject .
Topics of interest Topics of discourse analysis include : The various levels or dimensions of discourse , such as sounds -LRB- intonation , etc. -RRB- , gestures , syntax , the lexicon , style , rhetoric , meanings , speech acts , moves , strategies , turns and other aspects of interaction Genres of discourse -LRB- various types of discourse in politics , the media , education , science , business , etc. -RRB- The relations between discourse and the emergence of syntactic structure The relations between text -LRB- discourse -RRB- and context The relations between discourse and power The relations between discourse and interaction The relations between discourse and cognition and memory Political discourse Political discourse analysis is a field of discourse analysis which focuses on discourse in political forums -LRB- such as debates , speeches , and hearings -RRB- as the phenomenon of interest .
Political discourse is the informal exchange of reasoned views as to which of several alternative courses of action should be taken to solve a societal problem .
It is a science that has been used through the history of the United States .
It is the essence of democracy .
Full of problems and persuasion , political discourse is used in many debates , candidacies and in our everyday life .
Perspectives The following are some of the specific theoretical perspectives and analytical approaches used in linguistic discourse analysis : Emergent grammar Text grammar -LRB- or ` discourse grammar ' -RRB- Cohesion and relevance theory Functional grammar Rhetoric Stylistics -LRB- linguistics -RRB- Interactional sociolinguistics Ethnography of communication Pragmatics , particularly speech act theory Conversation analysis Variation analysis Applied linguistics Cognitive psychology , often under the label discourse processing , studying the production and comprehension of discourse .
Discursive psychology Response based therapy -LRB- counselling -RRB- Critical discourse analysis Sublanguage analysis Genre Analysis & Critical Genre Analysis Although these approaches emphasize different aspects of language use , they all view language as social interaction , and are concerned with the social contexts in which discourse is embedded .
Often a distinction is made between ` local ' structures of discourse -LRB- such as relations among sentences , propositions , and turns -RRB- and ` global ' structures , such as overall topics and the schematic organization of discourses and conversations .
For instance , many types of discourse begin with some kind of global ` summary ' , in titles , headlines , leads , abstracts , and so on .
A problem for the discourse analyst is to decide when a particular feature is relevant to the specification is required .
Are there general principles which will determine the relevance or nature of the specification .
Prominent discourse analysts This article contains embedded lists that may be poorly defined , unverified or indiscriminate .
Please help to clean it up to meet Wikipedia 's quality standards .
-LRB- May 2012 -RRB- Marc Angenot , Robert de Beaugrande , Jan Blommaert , Adriana Bolivar , Carmen Rosa Caldas-Coulthard , Robyn Carston , Wallace Chafe , Paul Chilton , Guy Cook , Malcolm Coulthard , James Deese , Paul Drew , John Du Bois , Alessandro Duranti , Brenton D. Faber , Norman Fairclough , Michel Foucault , Roger Fowler , James Paul Gee , Talmy Givón , Charles Goodwin , Art Graesser , Michael Halliday , Zellig Harris , John Heritage , Janet Holmes , David R. Howarth , Paul Hopper , Gail Jefferson , Barbara Johnstone , Walter Kintsch , Richard Kittredge , Adam Jaworski , William Labov , George Lakoff , Jay Lemke , Stephen H. Levinsohn , James A. Lauriault\/Loriot , Robert E. Longacre , Jim Martin , Aletta Norval , David Nunan , Elinor Ochs , Gina Poncini , Jonathan Potter , Edward Robinson , Nikolas Rose , Harvey Sacks , Svenka Savic Naomi Sager , Emanuel Schegloff , Deborah Schiffrin , Michael Schober , Stef Slembrouck , Michael Stubbs , John Swales , Deborah Tannen , Sandra Thompson , Teun A. van Dijk , Theo van Leeuwen , Jef Verschueren , Henry Widdowson , Carla Willig , Deirdre Wilson , Ruth Wodak , Margaret Wetherell , Ernesto Laclau , Chantal Mouffe , Judith M. De Guzman , Cynthia Hardy , Louise J. Phillips .
-LRB- citation needed -RRB- Bhatia , V.J. , John Swales , Zellig Harris The phenomenon of information overload has meant that access to coherent and correctly-developed summaries is vital .
As access to data has increased so has interest in automatic summarization .
An example of the use of summarization technology is search engines such as Google .
Technologies that can make a coherent summary , of any kind of text , need to take into account several variables such as length , writing style and syntax to make a useful summary .
Extractive methods work by selecting a subset of existing words , phrases , or sentences in the original text to form the summary .
In contrast , abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate .
Such a summary might contain words not explicitly present in the original .
The state-of-the-art abstractive methods are still quite weak , so most research has focused on extractive methods , and this is what we will cover .
Two particular types of summarization often addressed in the literature are keyphrase extraction , where the goal is to select individual words or phrases to `` tag '' a document , and document summarization , where the goal is to select whole sentences to create a short paragraph summary .
Extraction and abstraction Broadly , one distinguishes two approaches : extraction and abstraction .
Extraction techniques merely copy the information deemed most important by the system to the summary -LRB- for example , key clauses , sentences or paragraphs -RRB- , while abstraction involves paraphrasing sections of the source document .
In general , abstraction can condense a text more strongly than extraction , but the programs that can do this are harder to develop as they require the use of natural language generation technology , which itself is a growing field .
Types of summaries There are different types of summaries depending what the summarization program focuses on to make the summary of the text , for example generic summaries or query relevant summaries -LRB- sometimes called query-biased summaries -RRB- .
Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs .
Summarization of multimedia documents , e.g. pictures or movies , is also possible .
Some systems will generate a summary based on a single source document , while others can use multiple source documents -LRB- for example , a cluster of news stories on the same topic -RRB- .
These systems are known as multi-document summarization systems .
Keyphrase extraction Task description and example The task is the following .
You are given a piece of text , such as a journal article , and you must produce a list of keywords or keyphrases that capture the primary topics discussed in the text .
In the case of research articles , many authors provide manually assigned keywords , but most text lacks pre-existing keyphrases .
For example , news articles rarely have keyphrases attached , but it would be useful to be able to automatically do so for a number of applications discussed below .
Consider the example text from a recent news article : `` The Army Corps of Engineers , rushing to meet President Bush 's promise to protect New Orleans by the start of the 2006 hurricane season , installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm , according to documents obtained by The Associated Press '' .
An extractive keyphrase extractor might select `` Army Corps of Engineers '' , `` President Bush '' , `` New Orleans '' , and `` defective flood-control pumps '' as keyphrases .
These are pulled directly from the text .
In contrast , an abstractive keyphrase system would somehow internalize the content and generate keyphrases that might be more descriptive and more like what a human would produce , such as `` political negligence '' or `` inadequate protection from floods '' .
Note that these terms do not appear in the text and require a deep understanding , which makes it difficult for a computer to produce such keyphrases .
Keyphrases have many applications , such as to improve document browsing by providing a short summary .
Also , keyphrases can improve information retrieval -- if documents have keyphrases assigned , a user could search by keyphrase to produce more reliable hits than a full-text search .
Also , automatic keyphrase extraction can be useful in generating index entries for a large text corpus .
Keyphrase extraction as supervised learning Beginning with the Turney paper , many researchers have approached keyphrase extraction as a supervised machine learning problem .
Given a document , we construct an example for each unigram , bigram , and trigram found in the text -LRB- though other text units are also possible , as discussed below -RRB- .
We then compute various features describing each example -LRB- e.g. , does the phrase begin with an upper-case letter ? -RRB- .
We assume there are known keyphrases available for a set of training documents .
Using the known keyphrases , we can assign positive or negative labels to the examples .
Then we learn a classifier that can discriminate between positive and negative examples as a function of the features .
Some classifiers make a binary classification for a test example , while others assign a probability of being a keyphrase .
For instance , in the above text , we might learn a rule that says phrases with initial capital letters are likely to be keyphrases .
After training a learner , we can select keyphrases for test documents in the following manner .
We apply the same example-generation strategy to the test documents , then run each example through the learner .
We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model .
If probabilities are given , a threshold is used to select the keyphrases .
Keyphrase extractors are generally evaluated using precision and recall .
Precision measures how many of the proposed keyphrases are actually correct .
Recall measures how many of the true keyphrases your system proposed .
The two measures can be combined in an F-score , which is the harmonic mean of the two -LRB- F = 2PR \/ -LRB- P + R -RRB- -RRB- .
Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization .
Design choices Designing a supervised keyphrase extraction system involves deciding on several choices -LRB- some of these apply to unsupervised , too -RRB- : What are the examples ?
The first choice is exactly how to generate examples .
Turney and others have used all possible unigrams , bigrams , and trigrams without intervening punctuation and after removing stopwords .
Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags .
Ideally , the mechanism for generating examples produces all the known labeled keyphrases as candidates , though this is often not the case .
For example , if we use only unigrams , bigrams , and trigrams , then we will never be able to extract a known keyphrase containing four words .
Thus , recall may suffer .
However , generating too many examples can also lead to low precision .
What are the features ?
We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non - keyphrases .
Typically features involve various term frequencies -LRB- how many times a phrase appears in the current text or in a larger corpus -RRB- , the length of the example , relative position of the first occurrence , various boolean syntactic features -LRB- e.g. , contains all caps -RRB- , etc. .
The Turney paper used about 12 such features .
Hulth uses a reduced set of features , which were found most successful in the KEA -LRB- Keyphrase Extraction Algorithm -RRB- work derived from Turney 's seminal paper .
How many keyphrases to return ?
In the end , the system will need to return a list of keyphrases for a test document , so we need to have a way to limit the number .
Ensemble methods -LRB- i.e. , using votes from several classifiers -RRB- have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases .
This is the technique used by Turney with C4 .5 decision trees .
Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number .
What learning algorithm ?
Once examples and features are created , we need a way to learn to predict keyphrases .
Virtually any supervised learning algorithm could be used , such as decision trees , Naive Bayes , and rule induction .
In the case of Turney 's GenEx algorithm , a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm .
The extractor follows a series of heuristics to identify keyphrases .
The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases .
Unsupervised keyphrase extraction : TextRank While supervised methods have some nice properties , like being able to produce interpretable rules for what features characterize a keyphrase , they also require a large amount of training data .
Many documents with known keyphrases are needed .
Furthermore , training on a specific domain tends to customize the extraction process to that domain , so the resulting classifier is not necessarily portable , as some of Turney 's results demonstrate .
Unsupervised keyphrase extraction removes the need for training data .
It approaches the problem from a different angle .
Instead of trying to learn explicit features that characterize keyphrases , the TextRank algorithm exploits the structure of the text itself to determine keyphrases that appear `` central '' to the text in the same way that PageRank selects important Web pages .
Recall this is based on the notion of `` prestige '' or `` recommendation '' from social networks .
In this way , TextRank does not rely on any previous training data at all , but rather can be run on any arbitrary piece of text , and it can produce output simply based on the text 's intrinsic properties .
Thus the algorithm is easily portable to new domains and languages .
TextRank is a general purpose graph-based ranking algorithm for NLP .
Essentially , it runs PageRank on a graph specially designed for a particular NLP task .
For keyphrase extraction , it builds a graph using some set of text units as vertices .
Edges are based on some measure of semantic or lexical similarity between the text unit vertices .
Unlike PageRank , the edges are typically undirected and can be weighted to reflect a degree of similarity .
Once the graph is constructed , it is used to form a stochastic matrix , combined with a damping factor -LRB- as in the `` random surfer model '' -RRB- , and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 -LRB- i.e. , the stationary distribution of the random walk on the graph -RRB- .
Design choices What should vertices be ?
The vertices should correspond to what we want to rank .
Potentially , we could do something similar to the supervised methods and create a vertex for each unigram , bigram , trigram , etc. .
However , to keep the graph small , the authors decide to rank individual unigrams in a first step , and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases .
This has a nice side effect of allowing us to produce keyphrases of arbitrary length .
For example , if we rank unigrams and find that `` advanced '' , `` natural '' , `` language '' , and `` processing '' all get high ranks , then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together .
Note that the unigrams placed in the graph can be filtered by part of speech .
The authors found that adjectives and nouns were the best to include .
Thus , some linguistic knowledge comes into play in this step .
How should we create edges ?
Edges are created based on word co-occurrence in this application of TextRank .
Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text .
N is typically around 2 -- 10 .
Thus , `` natural '' and `` language '' might be linked in a text about NLP .
`` Natural '' and `` processing '' would also be linked because they would both appear in the same string of N words .
These edges build on the notion of `` text cohesion '' and the idea that words that appear near each other are likely related in a meaningful way and `` recommend '' each other to the reader .
How are the final keyphrases formed ?
Since this method simply ranks the individual vertices , we need a way to threshold or produce a limited number of keyphrases .
The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph .
Then the top T vertices\/unigrams are selected based on their stationary probabilities .
A post - processing step is then applied to merge adjacent instances of these T unigrams .
As a result , potentially more or less than T final keyphrases will be produced , but the number should be roughly proportional to the length of the original text .
Why it works It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases .
One way to think about it is the following .
A word that appears multiple times throughout a text may have many different co-occurring neighbors .
For example , in a text about machine learning , the unigram `` learning '' might co-occur with `` machine '' , supervised '' , `` un-supervised '' , and `` semi-supervised '' in four different sentences .
Thus , the `` learning '' vertex would be a central `` hub '' that connects to these other modifying words .
Running PageRank\/TextRank on the graph is likely to rank `` learning '' highly .
Similarly , if the text contains the phrase `` supervised classification '' , then there would be an edge between `` supervised '' and `` classification '' .
If `` classification '' appears several other places and thus has many neighbors , it is importance would contribute to the importance of `` supervised '' .
If it ends up with a high rank , it will be selected as one of the top T unigrams , along with `` learning '' and probably `` classification '' .
In the final post-processing step , we would then end up with keyphrases `` supervised learning '' and `` supervised classification '' .
In short , the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts .
A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters .
This is similar to densely connected Web pages getting ranked highly by PageRank .
Document summarization Like keyphrase extraction , document summarization hopes to identify the essence of a text .
The only real difference is that now we are dealing with larger text units -- whole sentences instead of words and phrases .
While some work has been done in abstractive summarization -LRB- creating an abstract synopsis like that of a human -RRB- , the majority of summarization systems are extractive -LRB- selecting a subset of sentences to place in a summary -RRB- .
Before getting into the details of some summarization methods , we will mention how summarization systems are typically evaluated .
The most common way is using the so-called ROUGE -LRB- Recall-Oriented Understudy for Gisting Evaluation -RRB- measure -LRB- http:\/\/haydn.isi.edu\/ROUGE\/ -RRB- .
This is a recall-based measure that determines how well a system-generated summary covers the content present in one or more human-generated model summaries known as references .
It is recall-based to encourage systems to include all the important topics in the text .
Recall can be computed with respect to unigram , bigram , trigram , or 4-gram matching , though ROUGE-1 -LRB- unigram matching -RRB- has been shown to correlate best with human assessments of system-generated summaries -LRB- i.e. , the summaries with highest ROUGE-1 values correlate with the summaries humans deemed the best -RRB- .
ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary .
If there are multiple references , the ROUGE-1 scores are averaged .
Because ROUGE is based only on content overlap , it can determine if the same general concepts are discussed between an automatic summary and a reference summary , but it can not determine if the result is coherent or the sentences flow together in a sensible manner .
High-order n-gram ROUGE measures try to judge fluency to some degree .
Note that ROUGE is similar to the BLEU measure for machine translation , but BLEU is precision - based , because translation systems favor accuracy .
A promising line in document summarization is adaptive document\/text summarization .
The idea of adaptive summarization involves preliminary recognition of document\/text genre and subsequent application of summarization algorithms optimized for this genre .
First summarizes that perform adaptive summarization have been created .
Overview of supervised learning approaches Supervised text summarization is very much like supervised keyphrase extraction , and we will not spend much time on it .
Basically , if you have a collection of documents and human-generated summaries for them , you can learn features of sentences that make them good candidates for inclusion in the summary .
Features might include the position in the document -LRB- i.e. , the first few sentences are probably important -RRB- , the number of words in the sentence , etc. .
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as `` in summary '' or `` not in summary '' .
This is not typically how people create summaries , so simply using journal abstracts or existing summaries is usually not sufficient .
The sentences in these summaries do not necessarily match up with sentences in the original text , so it would difficult to assign labels to examples for training .
Note , however , that these natural summaries can still be used for evaluation purposes , since ROUGE-1 only cares about unigrams .
Unsupervised approaches : TextRank and LexRank The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data .
Some unsupervised summarization approaches are based on finding a `` centroid '' sentence , which is the mean word vector of all the sentences in the document .
Then the sentences can be ranked with regard to their similarity to this centroid sentence .
A more principled way to estimate sentence importance is using random walks and eigenvector centrality .
LexRank is an algorithm essentially identical to TextRank , and both use this approach for document summarization .
The two methods were developed by different groups at the same time , and LexRank simply focused on summarization , but could just as easily be used for keyphrase extraction or any other NLP ranking task .
Design choices What are the vertices ?
In both LexRank and TextRank , a graph is constructed by creating a vertex for each sentence in the document .
What are the edges ?
The edges between sentences are based on some form of semantic similarity or content overlap .
While LexRank uses cosine similarity of TF-IDF vectors , TextRank uses a very similar measure based on the number of words two sentences have in common -LRB- normalized by the sentences ' lengths -RRB- .
The LexRank paper explored using unweighted edges after applying a threshold to the cosine values , but also experimented with using edges with weights equal to the similarity score .
TextRank uses continuous similarity scores as weights .
How are summaries formed ?
In both algorithms , the sentences are ranked by applying PageRank to the resulting graph .
A summary is formed by combining the top ranking sentences , using a threshold or length cutoff to limit the size of the summary .
TextRank and LexRank differences It is worth noting that TextRank was applied to summarization exactly as described here , while LexRank was used as part of a larger summarization system -LRB- MEAD -RRB- that combines the LexRank score -LRB- stationary probability -RRB- with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights .
In this case , some training documents might be needed , though the TextRank results show the additional features are not absolutely necessary .
Another important distinction is that TextRank was used for single document summarization , while LexRank has been applied to multi-document summarization .
The task remains the same in both cases -- only the number of sentences to choose from has grown .
However , when summarizing multiple documents , there is a greater risk of selecting duplicate or highly redundant sentences to place in the same summary .
Imagine you have a cluster of news articles on a particular event , and you want to produce one summary .
Each article is likely to have many similar sentences , and you would only want to include distinct ideas in the summary .
To address this issue , LexRank applies a heuristic post-processing step that builds up a summary by adding sentences in rank order , but discards any sentences that are too similar to ones already placed in the summary .
The method used is called Cross-Sentence Information Subsumption -LRB- CSIS -RRB- .
Why unsupervised summarization works These methods work based on the idea that sentences `` recommend '' other similar sentences to the reader .
Thus , if one sentence is very similar to many others , it will likely be a sentence of great importance .
The importance of this sentence also stems from the importance of the sentences `` recommending '' it .
Thus , to get ranked highly and placed in a summary , a sentence must be similar to many sentences that are in turn also similar to many other sentences .
This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text .
The methods are domain-independent and easily portable .
One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain .
However , the unsupervised `` recommendation '' - based approach applies to any domain .
Incorporating diversity : GRASSHOPPER algorithm As mentioned above , multi-document extractive summarization faces a problem of potential redundancy .
Ideally , we would like to extract sentences that are both `` central '' -LRB- i.e. , contain the main ideas -RRB- and `` diverse '' -LRB- i.e. , they differ from one another -RRB- .
LexRank deals with diversity as a heuristic final stage using CSIS , and other systems have used similar methods , such as Maximal Marginal Relevance -LRB- MMR -RRB- , in trying to eliminate redundancy in information retrieval results .
We have developed a general purpose graph-based ranking algorithm like Page\/Lex\/TextRank that handles both `` centrality '' and `` diversity '' in a unified mathematical framework based on absorbing Markov chain random walks .
-LRB- An absorbing random walk is like a standard random walk , except some states are now absorbing states that act as `` black holes '' that cause the walk to end abruptly at that state . -RRB-
The algorithm is called GRASSHOPPER for reasons that should soon become clear .
In addition to explicitly promoting diversity during the ranking process , GRASSHOPPER incorporates a prior ranking -LRB- based on sentence position in the case of summarization -RRB- .
Maximum entropy-based summarization It is an abstractive method .
Even though automating abstractive summarization is the goal of summarization research , most practical systems are based on some form of extractive summarization .
Extracted sentences can form a valid summary in itself or form a basis for further condensation operations .
Furthermore , evaluation of extracted summaries can be automated , since it is essentially a classification task .
During the DUC 2001 and 2002 evaluation workshops , TNO developed a sentence extraction system for multi-document summarization in the news domain .
The system was based on a hybrid system using a naive Bayes classifier and statistical language models for modeling salience .
Although the system exhibited good results , we wanted to explore the effectiveness of a maximum entropy -LRB- ME -RRB- classifier for the meeting summarization task , as ME is known to be robust against feature dependencies .
Maximum entropy has also been applied successfully for summarization in the broadcast news domain .
Aided summarization Machine learning techniques from closely related fields such as information retrieval or text mining have been successfully adapted to help automatic summarization .
Apart from Fully Automated Summarizers -LRB- FAS -RRB- , there are systems that aid users with the task of summarization -LRB- MAHS = Machine Aided Human Summarization -RRB- , for example by highlighting candidate passages to be included in the summary , and there are systems that depend on post-processing by a human -LRB- HAMS = Human Aided Machine Summarization -RRB- .
Evaluation An ongoing issue in this field is that of evaluation .
Evaluation techniques fall into intrinsic and extrinsic , inter-texual and intra-texual .
An intrinsic evaluation tests the summarization system in of itself while an extrinsic evaluation tests the summarization based on how it affects the completion of some other task .
Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries .
Extrinsic evaluations , on the other hand , have tested the impact of summarization on tasks like relevance assessment , reading comprehension , etc. .
Intra-texual methods assess the output of a specific summarization system , and the inter-texual ones focus on contrastive analysis of outputs of several summarization systems .
Human judgement often has wide variance on what is considered a `` good '' summary , which means that making the evaluation process automatic is particularly difficult .
Manual evaluation can be used , but this is both time and labor intensive as it requires humans to read not only the summaries but also the source documents .
Other issues are those concerning coherence and coverage .
One of the metrics used in NIST 's annual Document Understanding Conferences , in which research groups submit their systems for both summarization and translation tasks , is the ROUGE metric -LRB- Recall-Oriented Understudy for Gisting Evaluation -RRB- .
It essentially calculates n-gram overlaps between automatically generated summaries and previously-written human summaries .
A high level of overlap should indicate a high level of shared concepts between the two summaries .
Note that overlap metrics like this are unable to provide any feedback on a summary 's coherence .
Anaphor resolution remains another problem yet to be fully solved .
Evaluating summaries , either manually or automatically , is a hard task .
The main difficulty in evaluation comes from the impossibility of building a fair gold-standard against which the results of the systems can be compared .
Furthermore , it is also very hard to determine what a correct summary is , because there is always the possibility of a system to generate a good summary that is quite different from any human summary used as an approximation to the correct output .
Current difficulties in evaluating summaries automatically The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries .
However , as content selection is not a deterministic problem , different people would choose different sentences , and even , the same person may chose different sentences at different times , showing evidence of low agreement among humans as to which sentences are good summary sentences .
Besides the human variability , the semantic equivalence is another problem , because two distinct sentences can express the same meaning but not using the same words .
This phenomenon is known as paraphrase .
We can find an approach to automatically evaluating summaries using paraphrases -LRB- ParaEval -RRB- .
Moreover , most summarization systems perform an extractive approach , selecting and copying important sentences from the source documents .
Although humans can also cut and paste relevant information of a text , most of the times they rephrase sentences when necessary , or they join different related information into one sentence .
Evaluating summaries qualitatively The main drawback of the evaluation systems existing so far is that we need at least one reference summary , and for some methods more than one , to be able to compare automatic summaries with models .
This is a hard and expensive task .
Much effort has to be done in order to have corpus of texts and their corresponding summaries .
Furthermore , for some methods presented in the previous Section , not only do we need to have human-made summaries available for comparison , but also manual annotation has to be performed in some of them -LRB- e.g. SCU in the Pyramid Method -RRB- .
In any case , what the evaluation methods need as an input , is a set of summaries to serve as gold standards and a set of automatic summaries .
Moreover , they all perform a quantitative evaluation with regard to different similarity metrics .
To overcome these problems , we think that the quantitative evaluation might not be the only way to evaluate summaries , and a qualitative automatic evaluation would be also important .
Therefore , the second aim of this paper is to suggest a novel proposal for evaluating automatically the quality of a summary in a qualitative manner rather than in a quantitative one .
Our evaluation approach is a preliminary approach which has to be studied more deeply , and developed in the future .
Its main underlying idea is to define several quality criteria and check how a generated summary tackles each of these , in such a way that a reference model would not be necessary anymore , taking only into consideration the automatic summary and the original source .
Once performed , it could be used together with any other automatic methodology to measure summary 's informativeness .
Natural Language Generation -LRB- NLG -RRB- is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form .
Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations .
In a sense , one can say that an NLG system is like a translator that converts a computer based representation into a natural language representation .
However , the methods to produce the final language are very different from those of a compiler due to the inherent expressivity of natural languages .
NLG may be viewed as the opposite of natural language understanding .
The difference can be put this way : whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language , in NLG the system needs to make decisions about how to put a concept into words .
The simplest -LRB- and perhaps trivial -RRB- examples are systems that generate form letters .
Such systems do not typically involve grammar rules , but may generate a letter to a consumer , e.g. stating that a credit card spending limit is about to be reached .
More complex NLG systems dynamically create texts to meet a communicative goal .
As in other areas of natural language processing , this can be done using either explicit models of language -LRB- e.g. , grammars -RRB- and the domain , or using statistical models derived by analyzing human-written texts .
NLG is a fast-evolving field .
The best single source for up-to-date research in the area is the SIGGEN portion of the ACL Anthology .
Perhaps the closest the field comes to a specialist textbook is Reiter and Dale -LRB- 2000 -RRB- , but this book does not describe developments in the field since 2000 .
This system takes as input six numbers , which give predicted pollen levels in different parts of Scotland .
From these numbers , the system generates a short textual summary of pollen levels as its output .
For example , using the historical data for 1-July-2005 , the software produces Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country .
However , in Northern areas , pollen levels will be moderate with values of 4 .
In contrast , the actual forecast -LRB- written by a human meteorologist -RRB- from this data was Pollen counts are expected to remain high at level 6 over most of Scotland , and even level 7 in the south east .
The only relief is in the Northern Isles and far northeast of mainland Scotland with medium levels of pollen count .
Comparing these two illustrates some of the choices that NLG systems must make ; these are further discussed below .
Stages The process to generate text can be as simple as keeping a list of canned text that is copied and pasted , possibly linked with some glue text .
The results may be satisfactory in simple domains such as horoscope machines or generators of personalised business letters .
However , a sophisticated NLG system needs to include stages of planning and merging of information to enable the generation of text that looks natural and does not become repetitive .
Typical stages are : Content determination : Deciding what information to mention in the text .
For instance , in the pollen example above , deciding whether to explicitly mention that pollen level is 7 in the south east .
Document structuring : Overall organization of the information to convey .
For example , deciding to describe the areas with high pollen levels first , instead of the areas with low pollen levels .
Aggregation : Merging of similar sentences to improve readability and naturalness .
For instance , merging the two sentences Grass pollen levels for Friday have increased from the moderate to high levels of yesterday and Grass pollen levels will be around 6 to 7 across most parts of the country into the single sentence Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country .
Lexical choice : Putting words to the concepts .
For example , deciding whether medium or moderate should be used when describing a pollen level of 4 .
Referring expression generation : Creating referring expressions that identify objects and regions .
For example , deciding to use in the Northern Isles and far northeast of mainland Scotland to refer to a certain region in Scotland .
This task also includes making decisions about pronouns and other types of anaphora .
Realisation : Creating the actual text , which should be correct according to the rules of syntax , morphology , and orthography .
For example , using will be for the future tense of to be .
Applications The popular media has been especially interested in NLG systems which generate jokes -LRB- see computational humor -RRB- .
But from a commercial perspective , the most successful NLG applications have been data-to-text systems which generate textual summaries of databases and data sets ; these systems usually perform data analysis as well as text generation .
In particular , several systems have been built that produce textual weather forecasts from weather data .
The earliest such system to be deployed was FoG , which was used by Environment Canada to generate weather forecasts in French and English in the early 1990s .
The success of FoG triggered other work , both research and commercial .
Recent research in this area include an experiment which showed that users sometimes preferred computer-generated weather forecasts to human-written ones , in part because the computer forecasts used more consistent terminology , and a demonstration that statistical techniques could be used to generate high-quality weather forecasts .
Recent applications include the ARNS system used to summarise conditions in US ports .
In the 1990s there was considerable interest in using NLG to summarise financial and business data .
For example the SPOTLIGHT system developed at A.C. Nielsen automatically generated readable English text based on the analysis of large amounts of retail sales data .
More recently there is growing interest in using NLG to summarise electronic medical records .
Commercial applications in this area are starting to appear , and researchers have shown that NLG summaries of medical data can be effective decision-support aids for medical professionals .
There is also growing interest in using NLG to enhance accessibility , for example by describing graphs and data sets to blind people .
An example for a highly interactive use of NLG is the WYSIWYM framework .
It stands for What you see is what you meant and allows users to see and manipulate the continuously rendered view -LRB- NLG output -RRB- of an underlying formal language document -LRB- NLG input -RRB- , thereby editing the formal language without having to learn it .
Evaluation As in other scientific fields , NLG researchers need to be able to test how well their systems , modules , and algorithms work .
This is called evaluation .
There are three basic techniques for evaluating NLG systems : task-based -LRB- extrinsic -RRB- evaluation : give the generated text to a person , and assess how well it helps him perform a task -LRB- or otherwise achieves its communicative goal -RRB- .
For example , a system which generates summaries of medical data can be evaluated by giving these summaries to doctors , and assessing whether the summaries helps doctors make better decisions .
human ratings : give the generated text to a person , and ask him or her to rate the quality and usefulness of the text .
metrics : compare generated texts to texts written by people from the same input data , using an automatic metric such as BLEU .
Generally speaking , what we ultimately want to know is how useful NLG systems are at helping people , which is the first of the above techniques .
However , task-based evaluations are time-consuming and expensive , and can be difficult to carry out -LRB- especially if they require subjects with specialised expertise , such as doctors -RRB- .
Hence -LRB- as in other areas of NLP -RRB- task-based evaluations are the exception , not the norm .
In recent years researchers have started trying to assess how well human-ratings and metrics correlate with -LRB- predict -RRB- task-based evaluations .
Much of this work is being conducted in the context of Generation Challenges shared-task events .
Initial results suggest that human ratings are much better than metrics in this regard .
In other words , human ratings usually do predict task-effectiveness at least to some degree -LRB- although there are exceptions -RRB- , while ratings produced by metrics often do not predict task-effectiveness well .
These results are very preliminary , hopefully better data will be available soon .
In any case , human ratings are currently the most popular evaluation technique in NLG ; this is contrast to machine translation , where metrics are very widely used .
Natural language understanding is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension .
The process of disassembling and parsing input is more complex than the reverse process of assembling output in natural language generation because of the occurrence of unknown and unexpected features in the input and the need to determine the appropriate syntactic and semantic schemes to apply to it , factors which are pre-determined when outputting language .
There is considerable commercial interest in the field because of its application to news-gathering , text categorization , voice-activation , archiving and large-scale content-analysis .
Eight years after John McCarthy coined the term artificial intelligence , Bobrow 's dissertation -LRB- titled Natural Language Input for a Computer Problem Solving System -RRB- showed how a computer can understand simple natural language input to solve algebra word problems .
A year later , in 1965 , Joseph Weizenbaum at MIT wrote ELIZA , an interactive program that carried on a dialogue in English on any topic , the most popular being psychotherapy .
ELIZA worked by simple parsing and substitution of key words into canned phrases and Weizenbaum sidestepped the problem of giving the program a database of real-world knowledge or a rich lexicon .
Yet ELIZA gained surprising popularity as a toy project and can be seen as a very early precursor to current commercial systems such as those used by Ask.com .
In 1969 Roger Schank at Stanford University introduced the conceptual dependency theory for natural language understanding .
This model , partially influenced by the work of Sydney Lamb , was extensively used by Schank 's students at Yale University , such as Robert Wilensky , Wendy Lehnert , and Janet Kolodner .
In 1970 , William A. Woods introduced the augmented transition network -LRB- ATN -RRB- to represent natural language input .
Instead of phrase structure rules ATNs used an equivalent set of finite state automata that were called recursively .
ATNs and their more general format called `` generalized ATNs '' continued to be used for a number of years .
In 1971 Terry Winograd finished writing SHRDLU for his PhD thesis at MIT .
SHRDLU could understand simple English sentences in a restricted world of children 's blocks to direct a robotic arm to move items .
The successful demonstration of SHRDLU provided significant momentum for continued research in the field .
Winograd continued to be a major influence in the field with the publication of his book Language as a Cognitive Process .
At Stanford , Winograd was later the adviser for Larry Page , who co-founded Google .
In the 1970s and 1980s the natural language processing group at SRI International continued research and development in the field .
A number of commercial efforts based on the research were undertaken , e.g. , in 1982 Gary Hendrix formed Symantec Corporation originally as a company for developing a natural language interface for database queries on personal computers .
However , with the advent of mouse driven , graphic user interfaces Symantec changed direction .
A number of other commercial efforts were started around the same time , e.g. , Larry R. Harris at the Artificial Intelligence Corporation and Roger Schank and his students at Cognitive Systems corp. .
In 1983 , Michael Dyer developed the BORIS system at Yale which bore similarities to the work of Roger Schank and W. G. Lehnart .
Scope and context The umbrella term `` natural language understanding '' can be applied to a diverse set of computer applications , ranging from small , relatively simple tasks such as short commands issued to robots , to highly complex endeavors such as the full comprehension of newspaper articles or poetry passages .
Many real world applications fall between the two extremes , for instance text classification for the automatic analysis of emails and their routing to a suitable department in a corporation does not require in depth understanding of the text , but is far more complex than the management of simple queries to database tables with fixed schemata .
Throughout the years various attempts at processing natural language or English-like sentences presented to computers have taken place at varying degrees of complexity .
Some attempts have not resulted in systems with deep understanding , but have helped overall system usability .
For example , Wayne Ratliff originally developed the Vulcan program with an English-like syntax to mimic the English speaking computer in Star Trek .
Vulcan later became the dBase system whose easy-to-use syntax effectively launched the personal computer database industry .
Systems with an easy to use or English like syntax are , however , quite distinct from systems that use a rich lexicon and include an internal representation -LRB- often as first order logic -RRB- of the semantics of natural language sentences .
Hence the breadth and depth of `` understanding '' aimed at by a system determine both the complexity of the system -LRB- and the implied challenges -RRB- and the types of applications it can deal with .
The `` breadth '' of a system is measured by the sizes of its vocabulary and grammar .
The `` depth '' is measured by the degree to which its understanding approximates that of a fluent native speaker .
At the narrowest and shallowest , English-like command interpreters require minimal complexity , but have a small range of applications .
Narrow but deep systems explore and model mechanisms of understanding , but they still have limited application .
Systems that attempt to understand the contents of a document such as a news release beyond simple keyword matching and to judge its suitability for a user are broader and require significant complexity , but they are still somewhat shallow .
Systems that are both very broad and very deep are beyond the current state of the art .
Components and architecture Regardless of the approach used , some common components can be identified in most natural language understanding systems .
The system needs a lexicon of the language and a parser and grammar rules to break sentences into an internal representation .
The construction of a rich lexicon with a suitable ontology requires significant effort , e.g. , the Wordnet lexicon required many person-years of effort .
The system also needs a semantic theory to guide the comprehension .
The interpretation capabilities of a language understanding system depend on the semantic theory it uses .
Competing semantic theories of language have specific trade offs in their suitability as the basis of computer automated semantic interpretation .
These range from naive semantics or stochastic semantic analysis to the use of pragmatics to derive meaning from context .
Advanced applications of natural language understanding also attempt to incorporate logical inference within their framework .
This is generally achieved by mapping the derived meaning into a set of assertions in predicate logic , then using logical deduction to arrive at conclusions .
Systems based on functional languages such as Lisp hence need to include a subsystem for the representation of logical assertions , while logic oriented systems such as those using the language Prolog generally rely on an extension of the built in logical representation framework .
The management of context in natural language understanding can present special challenges .
A large variety of examples and counter examples have resulted in multiple approaches to the formal modeling of context , each with specific strengths and weaknesses .
Optical character recognition , usually abbreviated to OCR , is the mechanical or electronic conversion of scanned images of handwritten , typewritten or printed text into machine-encoded text .
It is widely used as a form of data entry from some sort of original paper data source , whether documents , sales receipts , mail , or any number of printed records .
It is crucial to the computerization of printed texts so that they can be electronically searched , stored more compactly , displayed on-line , and used in machine processes such as machine translation , text-to-speech and text mining .
OCR is a field of research in pattern recognition , artificial intelligence and computer vision .
Early versions needed to be programmed with images of each character , and worked on one font at a time .
`` Intelligent '' systems with a high degree of recognition accuracy for most fonts are now common .
Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images , columns and other non-textual components .
In 1914 , Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code .
-LRB- citation needed -RRB- Around the same time , Edmund Fournier d'Albe developed the Optophone , a handheld scanner that when moved across a printed page , produced tones that corresponded to specific letters or characters .
Goldberg continued to develop OCR technology for data entry .
Later , he proposed photographing data records and then , using photocells , matching the photos against a template containing the desired identification pattern .
In 1929 Gustav Tauschek had similar ideas , and obtained a patent on OCR in Germany .
Paul W. Handel also obtained a US patent on such template-matching OCR technology in USA in 1933 -LRB- U.S. Patent 1,915,993 -RRB- .
In 1935 Tauschek was also granted a US patent on his method -LRB- U.S. Patent 2,026,329 -RRB- .
In 1949 RCA engineers worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration , but instead of converting the printed characters to machine language , their device converted it to machine language and then spoke the letters : an early text-to-speech technology .
It proved far too expensive and was not pursued after testing .
In 1950 , David H. Shepard , a cryptanalyst at the Armed Forces Security Agency in the United States , addressed the problem of converting printed messages into machine language for computer processing and built a machine to do this , called `` Gismo . '' .
He received a patent for his `` Apparatus for Reading '' in 1953 U.S. Patent 2,663,758 .
`` Gismo '' could read 23 letters of the English alphabet , comprehend Morse Code , read musical notations , read aloud from printed pages , and duplicate typewritten pages .
Shepard went on to found Intelligent Machines Research Corporation -LRB- IMR -RRB- , which soon developed the world 's first commercial OCR systems .
In 1955 , the first commercial system was installed at the Reader 's Digest , which used OCR to input sales reports into a computer .
It converted the typewritten reports into punched cards for input into the computer in the magazine 's subscription department , for help in processing the shipment of 15-20 million books a year .
The second system was sold to the Standard Oil Company for reading credit card imprints for billing purposes .
Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages .
IBM and others were later licensed on Shepard 's OCR patents .
In about 1965 , Reader 's Digest and RCA collaborated to build an OCR Document reader designed to digitize the serial numbers on Reader 's Digest coupons returned from advertisements .
The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font .
The reader was connected directly to an RCA 301 computer -LRB- one of the first solid state computers -RRB- .
This reader was followed by a specialised document reader installed at TWA where the reader processed Airline Ticket stock .
The readers processed documents at a rate of 1,500 documents per minute , and checked each document , rejecting those it was not able to process correctly .
The product became part of the RCA product line as a reader designed to process `` Turn around Documents '' such as those utility and insurance bills returned with payments .
The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow .
The first use of OCR in Europe was by the British General Post Office -LRB- GPO -RRB- .
In 1965 it began planning an entire banking system , the National Giro , using OCR technology , a process that revolutionized bill payment systems in the UK .
Canada Post has been using OCR systems since 1971 -LRB- citation needed -RRB- .
OCR systems read the name and address of the addressee at the first mechanized sorting center , and print a routing bar code on the envelope based on the postal code .
To avoid confusion with the human-readable address field which can be located anywhere on the letter , special ink -LRB- orange in visible light -RRB- is used that is clearly visible under ultraviolet light .
Envelopes may then be processed with equipment based on simple bar code readers .
Importance of OCR to the Blind In 1974 Ray Kurzweil started the company Kurzweil Computer Products , Inc. and continued development of omni-font OCR , which could recognize text printed in virtually any font .
He decided that the best application of this technology would be to create a reading machine for the blind , which would allow blind people to have a computer read text to them out loud .
This device required the invention of two enabling technologies -- the CCD flatbed scanner and the text-to-speech synthesizer .
On January 13 , 1976 the successful finished product was unveiled during a widely-reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind -LRB- citation needed -RRB- .
In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program .
LexisNexis was one of the first customers , and bought the program to upload paper legal and news documents onto its nascent online databases .
Two years later , Kurzweil sold his company to Xerox , which had an interest in further commercializing paper-to-computer text conversion .
Xerox eventually spun it off as Scansoft , which merged with Nuance Communications -LRB- citation needed -RRB- .
OCR software Desktop & Server OCR Software OCR software and ICR software technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases .
Based on the analysis of sequential lines and curves , OCR and ICR make ` best guesses ' at characters using database look-up tables to closely associate or match the strings of characters that form words .
WebOCR & OnlineOCR With IT technology development , the platform for people to use software has been changed from single PC platform to multi-platforms such as PC + Web-based + Cloud Computing + Mobile devices .
After 30 years development , OCR software started to adapt to new application requirements .
WebOCR also known as OnlineOCR or Web-based OCR service , has been a new trend to meet larger volume and larger group of users after 30 years development of the desktop OCR .
Internet and broadband technologies have made WebOCR & OnlineOCR practically available to both individual users and enterprise customers .
Since 2000 , some major OCR vendors began offering WebOCR & Online software , a number of new entrants companies to seize the opportunity to develop innovative Web-based OCR service , some of which are free of charge services .
Application-Oriented OCR Since OCR technology has been more and more widely applied to paper-intensive industry , it is facing more complex images environment in the real world .
For example : complicated backgrounds , degraded-images , heavy-noise , paper skew , picture distortion , low-resolution , disturbed by grid & lines , text image consisting of special fonts , symbols , glossary words and etc. .
All the factors affect OCR products ' stability in recognition accuracy .
In recent years , the major OCR technology providers began to develop dedicated OCR systems , each for special types of images .
They combine various optimization methods related to the special image , such as business rules , standard expression , glossary or dictionary and rich information contained in color images , to improve the recognition accuracy .
Such strategy to customize OCR technology is called `` Application-Oriented OCR '' or `` Customized OCR '' , widely used in the fields of Business-card OCR , Invoice OCR , Screenshot OCR , ID card OCR , Driver-license OCR or Auto plant OCR , and so on .
See also : List of optical character recognition software Current state of OCR technology This section needs additional citations for verification .
Please help improve this article by adding citations to reliable sources .
Unsourced material may be challenged and removed .
-LRB- May 2009 -RRB- Commissioned by the U.S. Department of Energy -LRB- DOE -RRB- , the Information Science Research Institute -LRB- ISRI -RRB- had the mission to foster the improvement of automated technologies for understanding machine printed documents -LRB- citation needed -RRB- , and it conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s .
Recognition of Latin-script , typewritten text is still not 100 % accurate even where clear imaging is available .
One study based on recognition of 19th - and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71 % to 98 % ; total accuracy can be achieved only by human review .
Other areas -- including recognition of hand printing , cursive handwriting , and printed text in other scripts -LRB- especially those East Asian language characters which have many strokes for a single character -RRB- -- are still the subject of active research .
Accuracy rates can be measured in several ways , and how they are measured can greatly affect the reported accuracy rate .
For example , if word context -LRB- basically a lexicon of words -RRB- is not used to correct software finding non-existent words , a character error rate of 1 % -LRB- 99 % accuracy -RRB- may result in an error rate of 5 % -LRB- 95 % accuracy -RRB- or worse if the measurement is based on whether each whole word was recognized with no incorrect letters .
On-line character recognition is sometimes confused with Optical Character Recognition -LRB- see Handwriting recognition -RRB- .
OCR is an instance of off-line character recognition , where the system recognizes the fixed static shape of the character , while on-line character recognition instead recognizes the dynamic motion during handwriting .
For example , on-line recognition , such as that used for gestures in the Penpoint OS or the Tablet PC can tell whether a horizontal mark was drawn right-to-left , or left-to-right .
On-line character recognition is also referred to by other terms such as dynamic character recognition , real-time character recognition , and Intelligent Character Recognition or ICR .
On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years -LRB- see Tablet PC history -RRB- .
Among these are the input devices for personal digital assistants such as those running Palm OS .
The Apple Newton pioneered this product .
The algorithms used in these devices take advantage of the fact that the order , speed , and direction of individual lines segments at input are known .
Also , the user can be retrained to use only specific letter shapes .
These methods can not be used in software that scans paper documents , so accurate recognition of hand-printed documents is still largely an open problem .
Accuracy rates of 80 % to 90 % on neat , clean hand-printed characters can be achieved , but that accuracy rate still translates to dozens of errors per page , making the technology useful only in very limited applications .
Recognition of cursive text is an active area of research , with recognition rates even lower than that of hand-printed text .
Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information .
For example , recognizing entire words from a dictionary is easier than trying to parse individual characters from script .
Reading the Amount line of a cheque -LRB- which is always a written-out number -RRB- is an example where using a smaller dictionary can increase recognition rates greatly .
Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun , for example , allowing greater accuracy .
The shapes of individual cursive characters themselves simply do not contain enough information to accurately -LRB- greater than 98 % -RRB- recognize all handwritten cursive script .
It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications .
Due to this , an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology .
For more complex recognition problems , intelligent character recognition systems are generally used , as artificial neural networks can be made indifferent to both affine and non-linear transformations .
A technique which is having considerable success in recognizing difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA system .
In corpus linguistics , part-of-speech tagging -LRB- POS tagging or POST -RRB- , also called grammatical tagging or word-category disambiguation , is the process of marking up a word in a text -LRB- corpus -RRB- as corresponding to a particular part of speech , based on both its definition , as well as its context -- i.e. relationship with adjacent and related words in a phrase , sentence , or paragraph .
A simplified form of this is commonly taught to school-age children , in the identification of words as nouns , verbs , adjectives , adverbs , etc. .
Once performed by hand , POS tagging is now done in the context of computational linguistics , using algorithms which associate discrete terms , as well as hidden parts of speech , in accordance with a set of descriptive tags .
POS-tagging algorithms fall into two distinctive groups : rule-based and stochastic .
E. Brill 's tagger , one of the first and widely used English POS-taggers , employs rule-based algorithms .
This is not rare -- in natural languages -LRB- as opposed to many artificial languages -RRB- , a large percentage of word-forms are ambiguous .
For example , even `` dogs '' , which is usually thought of as just a plural noun , can also be a verb : The sailor dogs the barmaid .
Performing grammatical tagging will indicate that `` dogs '' is a verb , and not the more common plural noun , since one of the words must be the main verb , and the noun reading is less likely following `` sailor '' -LRB- sailor !
→ dogs -RRB- .
Semantic analysis can then extrapolate that `` sailor '' and `` barmaid '' implicate `` dogs '' as 1 -RRB- in the nautical context -LRB- sailor → <verb> ← barmaid -RRB- and 2 -RRB- an action applied to the object `` barmaid '' -LRB- -LRB- subject -RRB- dogs → barmaid -RRB- .
In this context , `` dogs '' is a nautical term meaning `` fastens -LRB- a watertight barmaid -RRB- securely ; applies a dog to '' .
`` Dogged '' , on the other hand , can be either an adjective or a past-tense verb .
Just which parts of speech a word can represent varies greatly .
Trained linguists can identify the grammatical parts of speech to various fine degrees depending on the tagging system .
Schools commonly teach that there are 9 parts of speech in English : noun , verb , article , adjective , preposition , pronoun , adverb , conjunction , and interjection .
However , there are clearly many more categories and sub-categories .
For nouns , plural , possessive , and singular forms can be distinguished .
In many languages words are also marked for their `` case '' -LRB- role as subject , object , etc. -RRB- , grammatical gender , and so on ; while verbs are marked for tense , aspect , and other things .
In part-of-speech tagging by computer , it is typical to distinguish from 50 to 150 separate parts of speech for English , for example , NN for singular common nouns , NNS for plural common nouns , NP for singular proper nouns -LRB- see the POS tags used in the Brown Corpus -RRB- .
Work on stochastic methods for tagging Koine Greek -LRB- DeRose 1990 -RRB- has used over 1,000 parts of speech , and found that about as many words were ambiguous there as in English .
A morphosyntactic descriptor in the case of morphologically rich languages can be expressed like Ncmsan , which means Category = Noun , Type = common , Gender = masculine , Number = singular , Case = accusative , Animate = no. .
History The Brown Corpus Research on part-of-speech tagging has been closely tied to corpus linguistics .
The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and Nelson Francis , in the mid-1960s .
It consists of about 1,000,000 words of running English prose text , made up of 500 samples from randomly chosen publications .
Each sample is 2,000 or more words -LRB- ending at the first sentence-end after 2,000 words , so that the corpus contains only complete sentences -RRB- .
The Brown Corpus was painstakingly `` tagged '' with part-of-speech markers over many years .
A first approximation was done with a program by Greene and Rubin , which consisted of a huge handmade list of what categories could co-occur at all .
For example , article then noun can occur , but article verb -LRB- arguably -RRB- can not .
The program got about 70 % correct .
Its results were repeatedly reviewed and corrected by hand , and later users sent in errata , so that by the late 70s the tagging was nearly perfect -LRB- allowing for some cases on which even human speakers might not agree -RRB- .
This corpus has been used for innumerable studies of word-frequency and of part-of-speech , and inspired the development of similar `` tagged '' corpora in many other languages .
Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems , such as CLAWS -LRB- linguistics -RRB- and VOLSUNGA .
However , by this time -LRB- 2005 -RRB- it has been superseded by larger corpora such as the 100 million word British National Corpus .
For some time , part-of-speech tagging was considered an inseparable part of natural language processing , because there are certain cases where the correct part of speech can not be decided without understanding the semantics or even the pragmatics of the context .
This is extremely expensive , especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word .
Use of Hidden Markov Models In the mid 1980s , researchers in Europe began to use hidden Markov models -LRB- HMMs -RRB- to disambiguate parts of speech , when working to tag the Lancaster-Oslo-Bergen Corpus of British English .
HMMs involve counting cases -LRB- such as from the Brown Corpus -RRB- , and making a table of the probabilities of certain sequences .
For example , once you 've seen an article such as ` the ' , perhaps the next word is a noun 40 % of the time , an adjective 40 % , and a number 20 % .
Knowing this , a program can decide that `` can '' in `` the can '' is far more likely to be a noun than a verb or a modal .
The same method can of course be used to benefit from knowledge about following words .
More advanced -LRB- `` higher order '' -RRB- HMMs learn the probabilities not only of pairs , but triples or even larger sequences .
So , for example , if you 've just seen an article and a verb , the next item may be very likely a preposition , article , or noun , but much less likely another verb .
When several ambiguous words occur together , the possibilities multiply .
However , it is easy to enumerate every combination and to assign a relative probability to each one , by multiplying together the probabilities of each choice in turn .
The combination with highest probability is then chosen .
The European group developed CLAWS , a tagging program that did exactly this , and achieved accuracy in the 93-95 % range .
It is worth remembering , as Eugene Charniak points out in Statistical techniques for natural language parsing , that merely assigning the most common tag to each known word and the tag `` proper noun '' to all unknowns , will approach 90 % accuracy because many words are unambiguous .
CLAWS pioneered the field of HMM-based part of speech tagging , but was quite expensive since it enumerated all possibilities .
It sometimes had to resort to backup methods when there were simply too many -LRB- the Brown Corpus contains a case with 17 ambiguous words in a row , and there are words such as `` still '' that can represent as many as 7 distinct parts of speech -RRB- .
HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm .
Dynamic Programming methods In 1987 , Steven DeRose and Ken Church independently developed dynamic programming algorithms to solve the same problem in vastly less time .
Their methods were similar to the Viterbi algorithm known for some time in other fields .
DeRose used a table of pairs , while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus -LRB- actual measurement of triple probabilities would require a much larger corpus -RRB- .
Both methods achieved accuracy over 95 % .
DeRose 's 1990 dissertation at Brown University included analyses of the specific error types , probabilities , and other related data , and replicated his work for Greek , where it proved similarly effective .
These findings were surprisingly disruptive to the field of natural language processing .
The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis : syntax , morphology , semantics , and so on .
CLAWS , DeRose 's and Church 's methods did fail for some of the known cases where semantics is required , but those proved negligibly rare .
This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing ; this in turn simplified the theory and practice of computerized language analysis , and encouraged researchers to find ways to separate out other pieces as well .
Markov Models are now the standard method for part-of-speech assignment .
Unsupervised taggers The methods already discussed involve working from a pre-existing corpus to learn tag probabilities .
It is , however , also possible to bootstrap using `` unsupervised '' tagging .
Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction .
That is , they observe patterns in word use , and derive part-of-speech categories themselves .
For example , statistics readily reveal that `` the '' , `` a '' , and `` an '' occur in similar contexts , while `` eat '' occurs in very different ones .
With sufficient iteration , similarity classes of words emerge that are remarkably similar to those human linguists would expect ; and the differences themselves sometimes suggest valuable new insights .
These two categories can be further subdivided into rule-based , stochastic , and neural approaches .
Other taggers and methods Some current major algorithms for part-of-speech tagging include the Viterbi algorithm , Brill Tagger , Constraint Grammar , and the Baum-Welch algorithm -LRB- also known as the forward-backward algorithm -RRB- .
Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm .
The Brill tagger is unusual in that it learns a set of patterns , and then applies those patterns rather than optimizing a statistical quantity .
Many machine learning methods have also been applied to the problem of POS tagging .
Methods such as SVM , Maximum entropy classifier , Perceptron , and Nearest-neighbor have all been tried , and most can achieve accuracy above 95 % .
A direct comparison of several methods is reported -LRB- with references -RRB- at .
This comparison uses the Penn tag set on some of the Penn Treebank data , so the results are directly comparable .
However , many significant taggers are not included -LRB- perhaps because of the labor involved in reconfiguring them for this particular dataset -RRB- .
Thus , it should not be assumed that the results reported there are the best that can be achieved with a given approach ; nor even the best that have been achieved with a given approach .
Issues While there is broad agreement about basic categories , a number of edge cases make it difficult to settle on a single `` correct '' set of tags , even in a single language such as English .
For example , it is hard to say whether `` fire '' is functioning as an adjective or a noun in the big green fire truck A second important example is the use\/mention distinction , as in the following example , where `` blue '' is clearly not functioning as an adjective -LRB- the Brown Corpus tag set appends the suffix '' - NC '' in such cases -RRB- : the word `` blue '' has 4 letters .
Words in a language other than that of the `` main '' text , are commonly tagged as `` foreign '' , usually in addition to a tag for the role the foreign word is actually playing in context .
There are also many cases where POS categories and `` words '' do not map one to one , for example : David 's gonna do n't vice versa first-cut can not pre - and post-secondary look -LRB- a word -RRB- up In the last example , `` look '' and `` up '' arguably function as a single verbal unit , despite the possibility of other words coming between them .
Some tag sets -LRB- such as Penn -RRB- break hyphenated words , contractions , and possessives into separate tokens , thus avoiding some but far from all such problems .
It is unclear whether it is best to treat words such as `` be '' , `` have '' , and `` do '' as categories in their own right -LRB- as in the Brown Corpus -RRB- , or as simply verbs -LRB- as in the LOB Corpus and the Penn Treebank -RRB- .
`` be '' has more forms than other English verbs , and occurs in quite different grammatical contexts , complicating the issue .
The most popular `` tag set '' for POS tagging for American English is probably the Penn tag set , developed in the Penn Treebank project .
It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets , though much smaller .
In Europe , tag sets from the Eagles Guidelines see wide use , and include versions for multiple languages .
POS tagging work has been done in a variety of languages , and the set of POS tags used varies greatly with language .
Tags usually are designed to include overt morphological distinctions -LRB- this makes the tag sets for heavily inflected languages such as Greek and Latin very large ; and makes tagging words in agglutinative languages such an Inuit virtually impossible .
However , Petrov , D. Das , and R. McDonald -LRB- `` A Universal Part-of-Speech Tagset '' http:\/\/arxiv.org\/abs\/1104.2086 -RRB- have proposed a `` universal '' tag set , with 12 categories -LRB- for example , no subtypes of nouns , verbs , punctuation , etc. ; no distinction of `` to '' as an infinitive marker vs. preposition , etc. -RRB- .
Whether a very small set of very broad tags , or a much larger set of more precise ones , is preferable , depends on the purpose at hand .
Automatic tagging is easier on smaller tag-sets .
A different issue is that some cases are in fact ambiguous .
Beatrice Santorini gives examples in `` Part-of-speech Tagging Guidelines for the Penn Treebank Project , '' -LRB- 3rd rev , June 1990 -RRB- , including the following -LRB- p. 32 -RRB- case in which entertaining can function either as an adjective or a verb , and there is no evident way to decide : The Duchess was entertaining last night .
In computer science and linguistics , parsing , or , more formally , syntactic analysis , is the process of analyzing a text , made of a sequence of tokens -LRB- for example , words -RRB- , to determine its grammatical structure with respect to a given -LRB- more or less -RRB- formal grammar .
Parsing can also be used as a linguistic term , for instance when discussing how phrases are divided up in garden path sentences .
Parsing is also an earlier term for the diagramming of sentences of natural languages , and is still used for the diagramming of inflected languages , such as the Romance languages or Latin .
The term parsing comes from Latin pars -LRB- ōrātiōnis -RRB- , meaning part -LRB- of speech -RRB- .
Parsing is a common term used in psycholinguistics when describing language comprehension .
In this context , parsing refers to the way that human beings , rather than computers , analyze a sentence or phrase -LRB- in spoken language or text -RRB- `` in terms of grammatical constituents , identifying the parts of speech , syntactic relations , etc. '' This term is especially common when discussing what linguistic cues help speakers to parse garden-path sentences .
The parser often uses a separate lexical analyser to create tokens from the sequence of input characters .
Parsers may be programmed by hand or may be -LRB- semi - -RRB- automatically generated -LRB- in some programming languages -RRB- by a tool .
Human languages See also : Category : Natural language parsing In some machine translation and natural language processing systems , human languages are parsed by computer programs .
Human sentences are not easily parsed by programs , as there is substantial ambiguity in the structure of human language , whose usage is to convey meaning -LRB- or semantics -RRB- amongst a potentially unlimited range of possibilities but only some of which are germane to the particular case .
So an utterance `` Man bites dog '' versus `` Dog bites man '' is definite on one detail but in another language might appear as `` Man dog bites '' with a reliance on the larger context to distinguish between those two possibilities , if indeed that difference was of concern .
It is difficult to prepare formal rules to describe informal behavior even though it is clear that some rules are being followed .
In order to parse natural language data , researchers must first agree on the grammar to be used .
The choice of syntax is affected by both linguistic and computational concerns ; for instance some parsing systems use lexical functional grammar , but in general , parsing for grammars of this type is known to be NP-complete .
Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community , but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank .
Shallow parsing aims to find only the boundaries of major constituents such as noun phrases .
Another popular strategy for avoiding linguistic controversy is dependency grammar parsing .
Most modern parsers are at least partly statistical ; that is , they rely on a corpus of training data which has already been annotated -LRB- parsed by hand -RRB- .
This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts .
-LRB- See machine learning . -RRB-
Approaches which have been used include straightforward PCFGs -LRB- probabilistic context-free grammars -RRB- , maximum entropy , and neural nets .
Most of the more successful systems use lexical statistics -LRB- that is , they consider the identities of the words involved , as well as their part of speech -RRB- .
However such systems are vulnerable to overfitting and require some kind of smoothing to be effective .
-LRB- citation needed -RRB- Parsing algorithms for natural language can not rely on the grammar having ` nice ' properties as with manually designed grammars for programming languages .
As mentioned earlier some grammar formalisms are very difficult to parse computationally ; in general , even if the desired structure is not context-free , some kind of context-free approximation to the grammar is used to perform a first pass .
Algorithms which use context-free grammars often rely on some variant of the CKY algorithm , usually with some heuristic to prune away unlikely analyses to save time .
-LRB- See chart parsing . -RRB-
However some systems trade speed for accuracy using , e.g. , linear-time versions of the shift-reduce algorithm .
A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses , and a more complex system selects the best option .
Programming languages The most common use of a parser is as a component of a compiler or interpreter .
This parses the source code of a computer programming language to create some form of internal representation .
Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them .
Parsers are written by hand or generated by parser generators .
Context-free grammars are limited in the extent to which they can express all of the requirements of a language .
Informally , the reason is that the memory of such a language is limited .
The grammar can not remember the presence of a construct over an arbitrarily long input ; this is necessary for a language in which , for example , a name must be declared before it may be referenced .
More powerful grammars that can express this constraint , however , can not be parsed efficiently .
Thus , it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs -LRB- that is , it accepts some invalid constructs -RRB- ; later , the unwanted constructs can be filtered out .
Overview of process Flow of data in a typical parser The following example demonstrates the common case of parsing a computer language with two levels of grammar : lexical and syntactic .
The first stage is the token generation , or lexical analysis , by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions .
For example , a calculator program would look at an input such as `` 12 \* -LRB- 3 +4 -RRB- ^ 2 '' and split it into the tokens 12 , \* , -LRB- , 3 , + , 4 , -RRB- , ^ , 2 , each of which is a meaningful symbol in the context of an arithmetic expression .
The lexer would contain rules to tell it that the characters \* , + , ^ , -LRB- and -RRB- mark the start of a new token , so meaningless tokens like `` 12 \* '' or '' -LRB- 3 '' will not be generated .
The next stage is parsing or syntactic analysis , which is checking that the tokens form an allowable expression .
This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear .
However , not all rules defining programming languages can be expressed by context-free grammars alone , for example type validity and proper declaration of identifiers .
These rules can be formally expressed with attribute grammars .
The final phase is semantic parsing or analysis , which is working out the implications of the expression just validated and taking the appropriate action .
In the case of a calculator or interpreter , the action is to evaluate the expression or program , a compiler , on the other hand , would generate some kind of code .
Attribute grammars can also be used to define these actions .
Types of parser The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar .
This can be done in essentially two ways : Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules .
Tokens are consumed from left to right .
Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules .
Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol .
Intuitively , the parser attempts to locate the most basic elements , then the elements containing these , and so on .
LR parsers are examples of bottom-up parsers .
Another term used for this type of parser is Shift-Reduce parsing .
LL parsers and recursive-descent parser are examples of top-down parsers which can not accommodate left recursive productions .
Although it has been believed that simple implementations of top-down parsing can not accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars , more sophisticated algorithms for top-down parsing have been created by Frost , Hafiz , and Callaghan which accommodate ambiguity and left recursion in polynomial time and which generate polynomial-size representations of the potentially exponential number of parse trees .
Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given CFG -LRB- context-free grammar -RRB- .
An important distinction with regard to parsers is whether a parser generates a leftmost derivation or a rightmost derivation -LRB- see context-free grammar -RRB- .
LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation -LRB- although usually in reverse -RRB- .
In information retrieval and natural language processing -LRB- NLP -RRB- , question answering -LRB- QA -RRB- is the task of automatically answering a question posed in natural language .
To find the answer to a question , a QA computer program may use either a pre-structured database or a collection of natural language documents -LRB- a text corpus such as the World Wide Web or some local collection -RRB- .
QA research attempts to deal with a wide range of question types including : fact , list , definition , How , Why , hypothetical , semantically constrained , and cross-lingual questions .
Search collections vary from small local document collections , to internal organization documents , to compiled newswire reports , to the World Wide Web .
Closed-domain question answering deals with questions under a specific domain -LRB- for example , medicine or automotive maintenance -RRB- , and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies .
Alternatively , closed-domain might refer to a situation where only a limited type of questions are accepted , such as questions asking for descriptive rather than procedural information .
Open-domain question answering deals with questions about nearly anything , and can only rely on general ontologies and world knowledge .
On the other hand , these systems usually have much more data available from which to extract the answer .
In contrast , current QA systems use text documents as their underlying knowledge source and combine various natural language processing techniques to search for the answers .
Current QA systems typically include a question classifier module that determines the type of question and the type of answer .
After the question is analyzed , the system typically uses several modules that apply increasingly complex NLP techniques on a gradually reduced amount of text .
Thus , a document retrieval module uses search engines to identify the documents or paragraphs in the document set that are likely to contain the answer .
Subsequently a filter preselects small text fragments that contain strings of the same type as the expected answer .
For example , if the question is `` Who invented Penicillin '' the filter returns text that contain names of people .
Finally , an answer extraction module looks for further clues in the text to determine if the answer candidate can indeed answer the question .
Question answering methods QA is very dependent on a good search corpus - for without documents containing the answer , there is little any QA system can do .
It thus makes sense that larger collection sizes generally lend well to better QA performance , unless the question domain is orthogonal to the collection .
The notion of data redundancy in massive collections , such as the web , means that nuggets of information are likely to be phrased in many different ways in differing contexts and documents , leading to two benefits : By having the right information appear in many forms , the burden on the QA system to perform complex NLP techniques to understand the text is lessened .
Correct answers can be filtered from false positives by relying on the correct answer to appear more times in the documents than instances of incorrect ones .
Issues In 2002 a group of researchers wrote a roadmap of research in question answering .
The following issues were identified .
Question classes Different types of questions -LRB- e.g. , `` What is the capital of Lichtenstein ? ''
vs. `` Why does a rainbow form ? ''
vs. `` Did Marilyn Monroe and Cary Grant ever appear in a movie together ? '' -RRB-
require the use of different strategies to find the answer .
Question classes are arranged hierarchically in taxonomies .
Question processing The same information request can be expressed in various ways , some interrogative -LRB- `` Who is the president of the United States ? '' -RRB-
and some assertive -LRB- `` Tell me the name of the president of the United States . '' -RRB- .
A semantic model of question understanding and processing would recognize equivalent questions , regardless of how they are presented .
This model would enable the translation of a complex question into a series of simpler questions , would identify ambiguities and treat them in context or by interactive clarification .
Context and QA Questions are usually asked within a context and answers are provided within that specific context .
The context can be used to clarify a question , resolve ambiguities or keep track of an investigation performed through a series of questions .
-LRB- For example , the question , `` Why did Joe Biden visit Iraq in January 2010 ? ''
might be asking why Vice President Biden visited and not President Obama , why he went to Iraq and not Afghanistan or some other country , why he went in January 2010 and not before or after , or what Biden was hoping to accomplish with his visit .
If the question is one of a series of related questions , the previous questions and their answers might shed light on the questioner 's intent . -RRB-
Data sources for QA Before a question can be answered , it must be known what knowledge sources are available and relevant .
If the answer to a question is not present in the data sources , no matter how well the question processing , information retrieval and answer extraction is performed , a correct result will not be obtained .
Answer extraction Answer extraction depends on the complexity of the question , on the answer type provided by question processing , on the actual data where the answer is searched , on the search method and on the question focus and context .
Answer formulation The result of a QA system should be presented in a way as natural as possible .
In some cases , simple extraction is sufficient .
For example , when the question classification indicates that the answer type is a name -LRB- of a person , organization , shop or disease , etc. -RRB- , a quantity -LRB- monetary value , length , size , distance , etc. -RRB- or a date -LRB- e.g. the answer to the question , `` On what day did Christmas fall in 1989 ? '' -RRB-
the extraction of a single datum is sufficient .
For other cases , the presentation of the answer may require the use of fusion techniques that combine the partial answers from multiple documents .
Real time question answering There is need for developing Q&A systems that are capable of extracting answers from large data sets in several seconds , regardless of the complexity of the question , the size and multitude of the data sources or the ambiguity of the question .
Multilingual -LRB- or cross-lingual -RRB- question answering The ability to answer a question posed in one language using an answer corpus in another language -LRB- or even several -RRB- .
This allows users to consult information that they can not use directly .
-LRB- See also Machine translation . -RRB-
Interactive QA It is often the case that the information need is not well captured by a QA system , as the question processing part may fail to classify properly the question or the information needed for extracting and generating the answer is not easily retrieved .
In such cases , the questioner might want not only to reformulate the question , but to have a dialogue with the system .
-LRB- For example , the system might ask for a clarification of what sense a word is being used , or what type of information is being asked for . -RRB-
Advanced reasoning for QA More sophisticated questioners expect answers that are outside the scope of written texts or structured databases .
To upgrade a QA system with such capabilities , it would be necessary to integrate reasoning components operating on a variety of knowledge bases , encoding world knowledge and common-sense reasoning mechanisms , as well as knowledge specific to a variety of domains .
User profiling for QA The user profile captures data about the questioner , comprising context data , domain of interest , reasoning schemes frequently used by the questioner , common ground established within different dialogues between the system and the user , and so forth .
The profile may be represented as a predefined template , where each template slot represents a different profile feature .
Profile templates may be nested one within another .
History Some of the early AI systems were question answering systems .
Two of the most famous QA systems of that time are BASEBALL and LUNAR , both of which were developed in the 1960s .
BASEBALL answered questions about the US baseball league over a period of one year .
LUNAR , in turn , answered questions about the geological analysis of rocks returned by the Apollo moon missions .
Both QA systems were very effective in their chosen domains .
In fact , LUNAR was demonstrated at a lunar science convention in 1971 and it was able to answer 90 % of the questions in its domain posed by people untrained on the system .
Further restricted-domain QA systems were developed in the following years .
The common feature of all these systems is that they had a core database or knowledge system that was hand-written by experts of the chosen domain .
Some of the early AI systems included question-answering abilities .
Two of the most famous early systems are SHRDLU and ELIZA .
SHRDLU simulated the operation of a robot in a toy world -LRB- the `` blocks world '' -RRB- , and it offered the possibility to ask the robot questions about the state of the world .
Again , the strength of this system was the choice of a very specific domain and a very simple world with rules of physics that were easy to encode in a computer program .
ELIZA , in contrast , simulated a conversation with a psychologist .
ELIZA was able to converse on any topic by resorting to very simple rules that detected important words in the person 's input .
It had a very rudimentary way to answer questions , and on its own it led to a series of chatterbots such as the ones that participate in the annual Loebner prize .
The 1970s and 1980s saw the development of comprehensive theories in computational linguistics , which led to the development of ambitious projects in text comprehension and question answering .
One example of such a system was the Unix Consultant -LRB- UC -RRB- , a system that answered questions pertaining to the Unix operating system .
The system had a comprehensive hand-crafted knowledge base of its domain , and it aimed at phrasing the answer to accommodate various types of users .
Another project was LILOG , a text-understanding system that operated on the domain of tourism information in a German city .
The systems developed in the UC and LILOG projects never went past the stage of simple demonstrations , but they helped the development of theories on computational linguistics and reasoning .
An increasing number of systems include the World Wide Web as one more corpus of text . .
However , these tools mostly work by using shallow methods , as described above -- thus returning a list of documents , usually with an excerpt containing the probable answer highlighted , plus some context .
Furthermore , highly-specialized natural language question-answering engines , such as EAGLi for health and life scientists , have been made available .
The Future of Question Answering QA systems have been extended in recent years to explore critical new scientific and practical dimensions For example , systems have been developed to automatically answer temporal and geospatial questions , definitional questions , biographical questions , multilingual questions , and questions from multimedia -LRB- e.g. , audio , imagery , video -RRB- .
Additional aspects such as interactivity -LRB- often required for clarification of questions or answers -RRB- , answer reuse , and knowledge representation and reasoning to support question answering have been explored .
Future research may explore what kinds of questions can be asked and answered about social media , including sentiment analysis .
A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts , typically from text or XML documents .
The task is very similar to that of information extraction -LRB- IE -RRB- , but IE additionally requires the removal of repeated relations -LRB- disambiguation -RRB- and generally refers to the extraction of many different relationships .
Approaches One approach to this problem involves the use of domain ontologies .
Another approach involves visual detection of meaningful relationships in parametric values of objects listed on a data table that shift positions as the table is permuted automatically as controlled by the software user .
The poor coverage , rarity and development cost related to structured resources such as semantic lexicons -LRB- e.g. WordNet , UMLS -RRB- and domain ontologies -LRB- e.g. the Gene Ontology -RRB- has given rise to new approaches based on broad , dynamic background knowledge on the Web .
For instance , the ARCHILES technique uses only Wikipedia and search engine page count for acquiring coarse-grained relations to construct lightweight ontologies .
The relationships can be represented using a variety of formalisms\/languages .
One such representation language for data on the Web is RDF .
Jump to : navigation , search Sentence boundary disambiguation -LRB- SBD -RRB- , also known as sentence breaking , is the problem in natural language processing of deciding where sentences begin and end .
Often natural language processing tools require their input to be divided into sentences for a number of reasons .
However sentence boundary identification is challenging because punctuation marks are often ambiguous .
For example , a period may denote an abbreviation , decimal point , an ellipsis , or an email address - not the end of a sentence .
About 47 % of the periods in the Wall Street Journal corpus denote abbreviations .
As well , question marks and exclamation marks may appear in embedded quotations , emoticons , computer code , and slang .
Languages like Japanese and Chinese have unambiguous sentence-ending markers .
-LRB- b -RRB- If the preceding token is on my hand-compiled list of abbreviations , then it does n't end a sentence .
-LRB- c -RRB- If the next token is capitalized , then it ends a sentence .
This strategy gets about 95 % of sentences correct .
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked .
Solutions have been based on a maximum entropy model .
The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5 % accuracy .
Sentiment analysis or opinion mining refers to the application of natural language processing , computational linguistics , and text analytics to identify and extract subjective information in source materials .
Generally speaking , sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document .
The attitude may be his or her judgement or evaluation -LRB- see appraisal theory -RRB- , affective state -LRB- that is to say , the emotional state of the author when writing -RRB- , or the intended emotional communication -LRB- that is to say , the emotional effect the author wishes to have on the reader -RRB- .
Advanced , `` beyond polarity '' sentiment classification looks , for instance , at emotional states such as `` angry , '' `` sad , '' and `` happy . ''
Early work in that area includes Turney and Pang who applied different methods for detecting the polarity of product reviews and movie reviews respectively .
This work is at the document level .
One can also classify a document 's polarity on a multi-way scale , which was attempted by Pang and Snyder -LRB- among others -RRB- : expanded the basic task of classifying a movie review as either positive or negative to predicting star ratings on either a 3 or a 4 star scale , while Snyder performed an in-depth analysis of restaurant reviews , predicting ratings for various aspects of the given restaurant , such as the food and atmosphere -LRB- on a five-star scale -RRB- .
A different method for determining sentiment is the use of a scaling system whereby words commonly associated with having a negative , neutral or positive sentiment with them are given an associated number on a -5 to +5 scale -LRB- most negative up to most positive -RRB- and when a piece of unstructured text is analyzed using natural language processing , the subsequent concepts are analyzed for an understanding of these words and how they relate to the concept -LRB- citation needed -RRB- .
Each concept is then given a score based on the way sentiment words relate to the concept , and their associated score .
This allows movement to a more sophisticated understanding of sentiment based on an 11 point scale .
Alternatively , texts can be given a positive and negative sentiment strength score if the goal is to determine the sentiment in a text rather than the overall polarity and strength of the text .
Another research direction is subjectivity\/objectivity identification .
This task is commonly defined as classifying a given text -LRB- usually a sentence -RRB- into one of two classes : objective or subjective .
This problem can sometimes be more difficult than polarity classification : the subjectivity of words and phrases may depend on their context and an objective document may contain subjective sentences -LRB- e.g. , a news article quoting people 's opinions -RRB- .
Moreover , as mentioned by Su , results are largely dependent on the definition of subjectivity used when annotating texts .