This project provides a script to convert HiTS (Historical Tagset) tags to STTS (Stuttgart-Tübingen Tagset) tags. The main script hits_to_stts.py
reads input tags, applies the conversion rules, and outputs the mapped tags, highlighting any unknown tags encountered.
This project aims to facilitate the conversion of part-of-speech (POS) tags from the HiTS format, used for Middle High German texts, to the more modern STTS format. This is particularly useful for linguistic research and the development of language processing tools that need to work with historical texts.
HiTS-tagset.txt
: Contains the HiTS tags to be converted.Tiger_v8_tagset.txt
: Contains the STTS tags for comparison and validation.hits_to_stts.py
: The main script that handles the conversion of tags from HiTS to STTS.
To run the script, you need Python installed on your system. Clone this repository and navigate to the project directory.
git clone https://github.com/yourusername/hits-to-stts.git
cd hits-to-stts
Run the script using Python:
python hits_to_stts.py
The script reads the tags from HiTS-tagset.txt
, applies the conversion rules, and outputs the mapped tags, along with any unknown tags encountered during the conversion.
The script hits_to_stts.py
implements a dictionary-based mapping of tags. Here’s a brief overview of the logic used:
- Dictionary Mapping: A dictionary
hits_to_stts
is used to map HiTS tags to STTS tags. - POS Extraction: The function
get_pos
extracts the POS component from the tag. - Conversion Function: The function
convert
uses the dictionary to convert HiTS tags to STTS tags. If a tag is not found in the dictionary, it is added to theunknown_pos
set. - Tag Mapping: The
map_tag
function handles more complex mappings, including handling multiple tag features and edge labels.
The tag ADJN.Masc.Nom.Sg
in HiTS would be converted to ADJA.Masc.Nom.Sg
in STTS based on the predefined rules in the dictionary.
- Enhancements in Tagger: Improve the accuracy of tagging historical texts by refining the tagger.
- Expanding the Tag Mapping: Address non-deterministic mappings and collisions more effectively.
- Integration with Modern Parsers: Explore further integration with modern parsing tools for better performance and accuracy.
This project is licensed under the MIT License. See the LICENSE
file for more details.