Data description:
- Training data: 8152 sentences
- Training Data Package1: All sentences (5069) from DP-2019
- Training Data Package2: 3083 sentences
- Viettreebank Testing data: 1123 sentences
- 906 sentences from Viettreebank
- 217 sentences from vnexpress.vn
Data Annatation
- Word segmentation: Review and correct all word segmentation errors in all data sets
- Part of speech tagging: Review and correct all POS errors in all data sets
- Dependency labels set: 38 main labels 47 sub-labels
Model | LAS | UAS | Method | Reference | Code |
---|---|---|---|---|---|
PhoBert+ELMO / Biaffine | 76.27 | 84.65 | Doan VLSP'20 | ||
fastText Embed / Biaffine | 75.64 | 84.08 | Nguyen VLSP'20 | ||
Graph Neural Networks | 73.19 | 81.71 | Nguyen et al. VLSP'20 |
Model | LAS | UAS | Method | Reference | Code |
---|---|---|---|---|---|
PhoBert+ELMO / Biaffine / VNCoreNLP | 67.32 | 76.12 | Doan VLSP'20 | ||
fastText Embed / Biaffine / VNCoreNLP | 65.3 | 74.47 | Nguyen VLSP'20 | ||
Graph Neural Networks | 64.35 | 72.85 | Nguyen et al. VLSP'20 |
The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).
Data description:
- 3000 sentences and 43754 tokens
Model | LAS | UAS | Method | Reference | Code |
---|---|---|---|---|---|
Trainkit v0.3.1 | 64.76 | 70.96 | Nguyen et al. EACL-DEMO'21 | Official | |
Stanza v1.1.1 | 53.63 | 48.16 | Peng et al. ACL-SD'20 | Official |
BKTreebank 1.0 contains 6,900 sentences annotated with POS tagging and dependency parsing for Vietnamese. The treebank was divided into a training set of 5639 sentences and a test set of 1270 sentences for learning and testing POS tagging and dependency parsing
Vietnamese dependency Treebank namely VnDT contains 10200 sentences. The VnDT Treebank is formatted following 10-column data format as proposed by the CoNLL shared tasks on multilingual dependency parsing.
Model | LAS | UAS | Method | Reference | Code |
---|---|---|---|---|---|
PhoBERT-base | 78.77 | 85.22 | Liu et al. '18 | Nguyen et al. '20 | Official |
PhoBERT-large | 77.85 | 84.32 | Liu et al. '18 | Nguyen et al. '20 | Official |
Biaffine | 74.99 | 81.19 | Dozat and Manning ICLR'17 | Nguyen '18 | |
JointWPD | 73.90 | 80.12 | Nguyen '18 | ||
jPTDP-v2 | 73.12 | 79.63 | Nguyen et al. CoNLL'18 | Nguyen '18 | Official |
VnCoreNLP (unsegmented) | 71.38 | 77.35 | Nguyen et al. NAACL'18 | Nguyen '18 | Official |
Model | LAS | UAS | Method | Reference | Code |
---|---|---|---|---|---|
VnCoreNLP | 73.39 | 79.02 | Nguyen et al. NAACL'18 | Official | |
Biaffine | 71.73 | 78.45 | Dozat and Manning ICLR'17 | Nguyen '18 | |
JointWPD | 70.50 | 77.04 | Nguyen '18 | ||
jPTDP-v2 | 69.81 | 76.60 | Nguyen et al. CoNLL'18 | Nguyen '18 | Official |
VnCoreNLP (unsegmented) | 67.79 | 74.24 | Nguyen et al. NAACL'18 | Nguyen '18 | Link |
📜 Papers
💫 Services: OpenFPT: Vitk (2017)
📁 Open sources
- nlp-uoregon/trankit (2021)
python
- VinAIResearch/PhoNLP (2021)
python
- datquocnguyen/jPTDP (2017)
java
- phuonglh/vn.vitk (2016)
java
- VnDP (2014)
java