A neural-symbolic dependency parser for Coptic texts, combining state-of-the-art neural dependency parsing with symbolic Prolog-based grammatical validation.
The Coptic Dependency Parser is a specialized NLP tool designed for analyzing Coptic texts using Universal Dependencies formalism. It integrates:
- Neural Parsing: DiaParser (biaffine attention parser) for accurate dependency structure prediction
- Symbolic Validation: Prolog-based grammatical rules for Coptic-specific syntactic patterns
- Text Normalization: Automatic handling of combining diacritical marks to prevent unknown tokens
- Interactive Visualization: Graphical dependency tree display with transliteration
This hybrid neural-symbolic approach enhances parsing accuracy by leveraging both data-driven learning and explicit linguistic knowledge of Coptic grammar, including:
- VSO (Verb-Subject-Object) word order patterns
- Tripartite nominal sentences (Subject-Copula-Predicate)
- Coptic article and pronoun systems
- Morphological agreement rules
- 🔍 Dependency Parsing: Neural dependency parsing using DiaParser trained on Coptic Scriptorium data
- 📝 Interactive GUI: User-friendly interface with virtual Coptic keyboard
- 🌳 Tree Visualization: Graphical dependency trees with arc labels and POS tags
- 📊 Multiple Export Formats: HTML and PDF export of parsing results
- ⚙️ Text Normalization: Automatic preprocessing to handle combining diacritics
- ✅ Grammatical Validation: Prolog-based pattern matching and error detection
- Tripartite Pattern Recognition: Automatic detection of Coptic nominal sentences
- VSO Word Order Validation: Coptic-specific syntactic constraint checking
- Morphological Analysis: Article stripping and clitic identification
- POS Tagging: Stanza-based part-of-speech tagging for Coptic
- Lemmatization: Basic lemmatization support
- Python 3.8 or higher
- SWI-Prolog (for Prolog integration)
# Ubuntu/Debian sudo apt-get install swi-prolog # macOS brew install swi-prolog # Windows: Download from https://www.swi-prolog.org/download/stable
-
Clone the repository
git clone https://github.com/Rogaton/coptic-dependency-parser.git cd coptic-dependency-parser -
Install Python dependencies
pip install -r requirements.txt
-
Download Stanza models for Coptic
python3 -c "import stanza; stanza.download('cop')" -
Download or train the DiaParser model
The parser can work with either:
- Pre-trained Coptic model (place in
models/cop.diaparser) - Stanza's built-in dependency parser (automatic fallback)
See
config.pyfor model path configuration. - Pre-trained Coptic model (place in
python3 coptic-parser.py- Input Text: Type or paste Coptic text in the input field, or use the virtual keyboard
- Parse: Click "Parse & Analyze Dependencies" to process the text
- View Results:
- Parse Text Tab: See detailed token-level analysis
- Dependency Graph Tab: Navigate through visual dependency trees
- Dependency Table Tab: Export results to HTML or PDF
ⲁⲛⲟⲕ ⲡⲉ ⲡⲛⲟⲩⲧⲉ
This tripartite sentence ("I am God") will be analyzed with:
- Dependency structure showing subject-copula-predicate relations
- Automatic pattern recognition
- POS tagging and lemmatization
from coptic_prolog_rules import create_prolog_engine
import stanza
# Initialize NLP pipeline
nlp = stanza.Pipeline('cop', processors='tokenize,pos,lemma,depparse')
# Initialize Prolog validation
prolog = create_prolog_engine()
# Parse text
text = "ⲁⲛⲟⲕ ⲡⲉ ⲡⲛⲟⲩⲧⲉ"
doc = nlp(text)
# Validate with Prolog
for sentence in doc.sentences:
words = [word.text for word in sentence.words]
pos_tags = [word.upos for word in sentence.words]
heads = [word.head for word in sentence.words]
deprels = [word.deprel for word in sentence.words]
validation = prolog.validate_parse_tree(words, pos_tags, heads, deprels)
print(validation)coptic-dependency-parser/
├── coptic-parser.py # Main GUI application
├── coptic_prolog_rules.py # Prolog integration module
├── coptic_text_normalizer.py # Text preprocessing
├── config.py # Configuration management
├── coptic_grammar.pl # Prolog dependency grammar rules
├── coptic_lexicon.pl # Coptic lexical database (6,842+ entries)
├── requirements.txt # Python dependencies
├── LICENSE # CC BY-NC-SA 4.0 License
├── README.md # This file
├── tools/ # Evaluation and comparison tools
│ ├── parser_comparison_tool.py # Compare with CopticScriptorium
│ └── evaluate_baseline.py # Baseline performance evaluation
├── docs/ # Documentation
│ └── evaluation/ # Evaluation reports and documentation
│ ├── CORPUS_COMPARATIVE_ANALYSIS.md
│ └── COPTICSCRIPTORIUM_README.md
├── data/
│ ├── depparse/ # Training/evaluation data (CoNLL-U format)
│ ├── lexicon/ # Lexical resources
│ └── tokenize/ # Tokenization data
└── models/
└── cop.diaparser # DiaParser model (download separately)
Model paths and settings can be configured in config.py:
# Set custom model path via environment variable
export COPTIC_DIAPARSER_MODEL=/path/to/your/cop.diaparser
# Or place model file in:
# ./models/cop.diaparserThis parser uses linguistic data from:
- Coptic Scriptorium - UD-annotated Coptic corpus for training
- Universal Dependencies - Dependency annotation scheme
- Comprehensive Coptic Lexicon - Extracted morphological information
Compare the dependency parser with CopticScriptorium's morpheme-level tagger to understand their complementary strengths:
# Run comparison on example texts
python3 tools/parser_comparison_tool.py
# Compare specific text
python3 tools/parser_comparison_tool.py "ⲁϥⲥⲱⲧⲙ ⲙⲙⲟϥ"Key Differences:
- Dependency Parser: Word-level tokenization, syntactic structure, UD framework
- CS Tagger: Morpheme-level segmentation, TreeTagger format, corpus annotation
Both tools share underlying components (Till analyzers, normalization) but serve different research purposes. See COPTICSCRIPTORIUM_README.md for details.
Evaluated across diverse Coptic text genres:
| Corpus Type | Coverage | Characteristics |
|---|---|---|
| Documentary Papyri | 95.6% | Simple, formulaic syntax |
| Monastic Literature | 93.7% | Standardized prescriptive language |
| Biblical Texts | 93.1% | Translation Greek (Koine → Sahidic) |
| Literary Texts | 82.4% | Complex rhetorical structures |
The parser achieves 82-96% coverage using Till's grammar modules (§35-50 Articles, §292-304 Conjunctions, §309-319 Negations, §245-268 Morphology). See CORPUS_COMPARATIVE_ANALYSIS.md for detailed evaluation results.
# Evaluate parser performance on test corpora
python3 tools/evaluate_baseline.py
# Evaluate on specific corpus files
python3 tools/evaluate_baseline.py corpus1.txt corpus2.txt- Neural Parser: BiAffine attention mechanism (Dozat & Manning, 2017)
- POS Tagger: Stanza neural pipeline for Coptic
- Validation Engine: PySwip integration with SWI-Prolog
- Visualization: Matplotlib-based dependency graph rendering
- Handles multi-sentence documents
- Real-time validation with Prolog constraints
- Supports batch processing via command-line interface
- VSO transitive/intransitive sentences
- Tripartite nominal sentences (standard & converted)
- Determiner-noun phrases with gender agreement
- Adjective modification (post-nominal)
- Prepositional phrases
- Coordination structures
Original Work: Copyright (c) 2024-2025 André Linden
Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
Under the following terms:
- Attribution — You must give appropriate credit
- NonCommercial — You may not use the material for commercial purposes
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license
Third-Party Dependencies: Retain their respective original licenses (Apache 2.0, MIT, etc.). See LICENSES/ directory for details.
This parser integrates multiple open-source NLP tools and resources:
-
- Coptic NLP models and annotated corpus
- Citation: Zeldes, A., & Schroeder, C. T. (2016). "An NLP Pipeline for Coptic"
- License: Creative Commons licenses (varies by component)
-
- Tokenization, POS tagging, and lemmatization for Coptic
- Citation: Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). "Stanza: A Python Natural Language Processing Toolkit for Many Human Languages"
- License: Apache 2.0
-
DiaParser - Biaffine Dependency Parser
- Neural dependency parsing implementation
- Citation: Attardi, G., et al. (2009)
- License: Apache 2.0
-
- Neural dependency parsing architecture
- Citation: Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). "The Stanford CoreNLP Natural Language Processing Toolkit"
- License: GPL v3+
- Universal Dependencies Project - Annotation scheme and guidelines
- Comprehensive Coptic Lexicon - Morphological database
- PySwip - SWI-Prolog Python interface
- Matplotlib - Visualization library
- WeasyPrint - PDF export functionality
- Experimental stage: Results should be manually validated
- Large model files (50+ MB) require Git LFS or separate download
- Prolog integration requires SWI-Prolog installation
- Some rare Coptic constructions may not be recognized
As this is an experimental research project, contributions and feedback are welcome:
- Report bugs or issues via GitHub Issues
- Suggest linguistic patterns or grammatical rules
- Contribute test cases or example texts
For questions, suggestions, or collaboration inquiries:
- Author: André Linden
- Email: relanir@bluewin.ch
- Project: Coptic NLP Research
If you use this parser in your research, please cite:
@software{coptic_dependency_parser,
author = {Linden, André},
title = {Coptic Dependency Parser: A Neural-Symbolic Approach},
year = {2024-2025},
url = {https://github.com/Rogaton/coptic-dependency-parser},
note = {Experimental version}
}Note: This is research software under active development. Always verify parsing results manually for critical applications.