-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathNEWS
153 lines (146 loc) · 8.13 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
May 5, 2023
* Release 0.6
* Change in generated Orthography - Adapted to the official orthograpy of the Academia Aragonesa de la Lengua (https://www.academiaaragonesadelalengua.org/sites/default/files/ficheros-pdf/Ortograf%C3%ADa%20de%20l%27aragon%C3%A9s_web_an.pdf)
* Keeps analysis compatibility with previous orthography (apertium-arg)
* New entries added.
* Minor fixes performed.
* Bilingual dictionary:
> 26437 entries (8015 proper nouns).
* Main pending issues: improve disambiguation with CG, add support for generating dialectal preferences
8 January 2021
* Release 0.5 (with change in the language codes apertium-es-an --> apertium-spa-arg)
* Added constraint grammar support (included in spa->arg direction, added support in arg->spa direction)
* Changed arg.prob file
* Adapted to changes in apertium-spa and apertium-arg
* Added support for new np labels and acronyms
* Definite articles generated according to "Gramatica Basica de l'Aragonés" neutral standard paradigm.
* Support for short ordinals and digital hour
* Improved apostrophation rules in post generation. Fix problems with "de + en (prep)".
* Added support for superlatives with "muito/muit/molt"
* Minor fixes performed
* Bilingual dictionary:
> 25040 entries (8038 proper nouns)
* Main pending Issues:
* Still room to improve on coverage (in apertium-arg).
* Still much to work on transfer.
* Improve morphological disambiguation, mainly in arg->spa direction (GC rules)
* Develop lexical selection rules.
* Release 0.4 (with change in the language codes apertium-es-an --> apertium-spa-arg)
* Main changes since version 0.3:
* Adoption of three-letter codes: spa (old es) and arg (old an).
* Use monolingual packages apertium-spa and apertium-arg (separated monolingual and bilingual data).
* Trimming of the monolingual dictionary (in both directions, although only the spa monodix is really trimmed since arg monodix has grown in paralell to the bidix).
* Added lexical selection support (translate-to-default-equivalent.xsl not used anymore).
* Improvement in coverage and morphology of Aragonese monolingual dictionary.
> 552 paradigms,
> 21271 lemmae (26103 entries, of which, 8150 proper nouns), including 2518 multi-words.
Na�ve coverage: from 89.8% in an.wikipedia to 92.4% in a corpus of narrative texts.
* Some improvements and fixes in verbal morphology.
* Several agrammatical or rare combinations of enclitic pronouns are now ignored.
* Some changes in the preferred (generated form): 'pa' (instead of 'ta') to translate 'para', 'cantatz' (instead of 'cantat') as p2.pl imperative, 'coixeyo' rather than 'coixeo'...
* Extended support for dialectal verbal morphology analysis ('anatz', 'veden', 'anirem', 'quisto', benasques strong infinitives)
* new paradigms ("coch/�n__n", "troz__n", "restaur/ant__n", "contpa__pr")
* duplicated paradigms eliminated (e.g. "tapi/z__n")
* Main pending Issues:
* Still room to improve on coverage.
* Still much to work on transfer.
* Develop lexical selection rules
Tue 29 Dec 2012 9:00:00 UTC
* Release 0.3
* Improvements since version 0.2:
* Improved tagger (PoS desambiguator) for Aragonese. Changes in the tagger
rules (...an.tsx) and unsupervised training in the Wikipedia corpus.
* Improvement in coverage and morphology of Aragonese monolingual dictionary.
> 546 paradigms,
> 21073 lemmae (of which, 8047 proper nouns), including 1921 multi-words.
~568 000 different analyzable surface forms,
~117 500 different generated surface forms.
Na�ve coverage: from 88.4% in an.wikipedia to 94.0% in a corpus of narrative texts.
* Some paradigms fixed (e.g. seguir)
* Some rules fixed (e.g. prep+el+que --> prep+o+que)
* Some rules added (v�ase --> se veiga, l�aselo --> se lo leiga, behaviour of
buen, mal before a noun)
* Improved distinction beteen a<pr> and a<det> (a2) and two meanings of "ta" so
that the postgenerator can apply different rules to them.
* Use of apocopated posesives mi, tu, su is dealt with in analysis.
* Periphrastic past dealt with in analysis (synthetic forms are used for
generation, although rules for using periphrastic instead are written and
commented).
* Main pending Issues:
* Still room to improve on coverage.
* Still much to work on transfer.
Tue 6 Sept 2011 9:00:00 UTC
* Release 0.2.1
* Solved postgeneration of a (preposition) + a (article f sg) using double
label <a/><a/> for a (article f sg).
* See release 0.2.0 for main pending issues.
Sun 10 Jul 2011 19:20:14 IST
* Release 0.2.0
* First operative bidirectional release.
* Improvements:
* Aragonese monodix and bidix in the initial release (0.1.0) were checked
and corrected.
* Huge improvement in the completeness of verbal and nominal morphology.
Wide paradigms analysing most dialectal forms and orthographic variations
(see Note on orthography below).
* Closed classes completed.
* Treatment of enclitics (up to combinations of two enclitics), including
dialectal non-standard combinations.
* Treatment of apostrophation (analysis and (post)-generation).
* Orthographic variation concentrated on monodix.
* Dialectal variation concentrated (when possible) on monodix.
* Possesives fixed.
* Impersonal haber-ie dealt with.
* Important increase in the dictionary size and coverage.
> 510 paradigms,
> 17500 lemmae (of which, 8025 proper nouns), including 1786 multi-words.
~500 000 analyzable surface forms,
~120 000 generated surface forms.
86.3% na�ve coverage in wikipedia.
* es monodix trimmed to bidix using ignore label (i="yes").
* Main pending Issues:
* Still room to improve on coverage.
* Still much to work on transfer (current version based on a reduced and
tuned subset of rules from es-ca).
* Current POS disambiguator is directly taken from es-ca.
Note: Orthography is taken from "Academia de l'Aragon�s - Estudio de
Filoloch�a Aragonesa", a board created in the II Congreso de l'Aragon�s.
http://www.academiadelaragones.org/biblio/EDACAR7_2.pdf .
It is the orthography used in Wikipedia in Aragonese
(http://an.wikipedia.org).
There is partial compatibility in analysis with other previously used
orthographies.
Sun Sep 26 15:42:44 IST 2010
* Initial release (0.1.0)
* Caveats:
- Functions only in an->es direction
- Several closed category words missing from an analyser
(including "ir")
- "Cowboys, Ted!"
This system has been put together in a very shoddy, MacGuyver-ish
way:
The majority of the lexicon has been composed on the basis
of presumed cognates. For the most part, this has been
restricted to Latin derivatives, but on more than one occasion, I
simply went nuts and pulled in anything the Spanish analyser would
recognise.
The only bitexts available were the UN Declaration of Human Rights
and the welcome message for new users of the Aragonese Wikipedia.
Statistical methods were not widely employed.
To deal with the spelling variations, I abused the heck out of sed,
filtering unknowns repeatedly before passing the result through the
analyser, to pluck out the results. Much of the ~8000 words in the
bilingual lexicon are mere variations. (In a particularly ironic
twist, it has 3 variations of 'normalizaci�n'). These variants will
need to be sorted out to have es->an: the first translation made with
this system before release was of the document on an.wikipedia
describing the new spelling rules.
Although I got some notes from Juan Pablo Mart�nez on the equivalents
of ser and estar, I was not able to get further information. My
"solution" is to ignore the issue and come back to it later.
Also, Juan Pablo added some vocabulary to the analyser, most of which
I have not been able to use for lack of translations. Hopefully, we
can get these reinstated soon.
A tagger has yet to be trained for Aragonese; during development, I found
the Spanish tagger to be sufficient, and so have used that. This is a
temporary measure.