Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The vro tokeniser-disamb-gt-desc.pmhfst has problem with UTF-8 combination t AND U+0301 in lemma readout (Bugzilla Bug 2647) #1

Closed
albbas opened this issue Feb 21, 2020 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@albbas
Copy link
Contributor

albbas commented Feb 21, 2020

This issue was created automatically with bugzilla2github

Bugzilla Bug 2647

Date: 2020-02-21T16:31:08+01:00
From: Jack Rueter <<rueter.jack>>
To: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
CC: trond.trosterud

Last updated: 2020-02-21T16:31:08+01:00

@albbas
Copy link
Contributor Author

albbas commented Feb 21, 2020

Comment 13845

Date: 2020-02-21 16:31:08 +0100
From: Jack Rueter <<rueter.jack>>

Created attachment 229
png of tokeniser output for vro text with lemma containing U+0301

cd main/langs/vro

head config.log
$ ./configure --with-hfst --without-xfst --enable-tokenisers --enable-reversed-intersect --enable-spellers --enable-alignment --enable-apertium --enable-dicts --enable-morpher --with-giella-shared=/Users/rueter/main/giella-shared --with-giella-core=/Users/rueter/main/giella-core GIELLA_CORE=/Users/rueter/main/giella-core/dir GTCORE=/Users/rueter/./main/giella-core GIELLA_SHARED=/Users/rueter/main/giella-shared/dir

echo 'mitte' | hfst-tokenise --giella-cg -W $GTHOME/langs/vro/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |less

""
"mit"t́ N Pl Gen
"mit"t́ N Pl Ill
"mit"t́ N Pl Par
"mi"t́"mä" V Act Ind Prt Sg3
:\n

In lemma-final position, the t AND U+0301 combination are left outside of the lemma, see "mit"t́

In non-final position, subsequent lemma material is quoted, see "mi"t́"mä"

Attached file: vro-tokeniser-problem-2020-02-22.png (image/png, 149077 bytes)
Description: png of tokeniser output for vro text with lemma containing U+0301

@albbas albbas transferred this issue from giellalt/bugzilla-dummy Sep 3, 2024
@flammie
Copy link
Contributor

flammie commented Sep 6, 2024

works today:

$ echo 'mitte' | hfst-tokenise --giella-cg -W tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<mitte>"
	"mitt́" N Pl Gen
	"mitt́" N Pl Ill
	"mitt́" N Pl Par
	"mit́mä" V Act Ind Prt Sg3
:\n

@flammie flammie closed this as completed Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants