Universal Conceptual Cognitive Annotation (UCCA) is a novel semantic approach to grammatical representation. It was developed in the Computational Linguistics Lab of the Hebrew University by Omri Abend and Ari Rappoport.
The central idea of the project is to analyze and annotate natural languages using purely semantic categories and structure (a graph). Syntactic categories and structure are not part of the manual annotation, and are ideally learned implicitly by the parsers. The basic set of semantic categories (the foundational layer) is inspired by work in linguistic typology, cognitive grammar, and neuroscience. The development of additional layers, such as semantic roles and super-senses (adapted from the CARMLS project) is underway.
The annotation so far focused on argument-structure and linkage phenomena. We build primarily on Basic Linguistic Theory (R.M.W. Dixon, 2010a; 2010b; 2012), a widely used approach for language description. We acknowledge that there many applicable analyses for a given sentence, but select, for practical reasons, a small set of highly useful distinctions, and apply them to provide one plausible annotation.
We have annotated 160K tokens from English Wikipedia with the UCCA scheme, as well as a 30K English-French parallel corpus based on Jules Verne's "20K Leagues Under The Sea", and a 120K tokens corpus of the entire book in German. Pilot studies were conducted on several other languages as well.
This page contains links to all of UCCA's resources: corpora, annotation guidelines, parser and code. If you use these resources in your research, please cite the following or other relevant publications:
A tutorial on Cross-lingual Semantic Representation for NLP with UCCA was presented at COLING 2020. All presentations are available on GitHub.
UCCAApp is a web application for phrase-based annotation in general, and UCCA parsing in particular.
Formally, it supports DAG structures, discontiguous units and multiple categories.
The app supports configurable multi-layer annotation and task management, and is written in Django and AngularJS.
UCCA-annotated corpora include the guidelines version they were compiled with in their repository.
The most up to date guidelines are available on github
(the most recent one is generally in draft mode, but see releases).
[v2 guidelines: pdf]
[latest guidelines: pdf]
All publicly available with a Creative Commons Attribution-ShareAlike 3.0 Unported license.
The guidelines with which each of them was annotated can be found in the repository.
Corpus |
Link |
English Wikipedia |
[github] |
English Web Treebank |
[github] |
English 20K Leagues Under The Sea |
[github] |
Excerpt of the PTB WSJ |
[github] |
German 20K Leagues Under The Sea |
[github] |
French 20K Leagues Under The Sea |
[github] |
German The Little Prince |
[github] |
Hebrew The Little Prince |
[github] |
Russian The Little Prince |
[github] |
English The Little Prince |
[github] |
Datasets produced by other labs:
Corpus |
Link |
Paper |
Turkish 50 sentences from the METU-Sabanci Turkish Treebank |
[github] |
[paper] |
TUPA is a transition-based parser for Universal Conceptual Cognitive Annotation (UCCA), developed by Daniel Hershcovich, Omri Abend and Ari Rappoport.
It can be installed by:
Python toolkit for reading and manipulating UCCA structures.
The code was written by Amit Beka and Daniel Hershcovich.
It can be installed by:
[Code: github]
UCCA was targeted in the following public parsing competitions, which accompanied top-tier NLP conferences:
SemEval 2019 Task 1
The task included open and closed tracks on English, French and German UCCA corpora from Wikipedia and Twenty Thousand Leagues Under the Sea.
Evaluation is done by labeled F1 on the graph edges, matched by child terminal yield.
CoNLL 2019 MRP Shared Task
The task included parsing to AMR, UCCA, DM, PSD, and EDS.
The UCCA training data is freely available.
UCCA evaluation is done both by UCCA F1 (as in SemEval 2019) and by the MRP metric, which is similar to smatch. The training data contains 6,572 sentences from web reviews and Wikipedia. There are two evaluation sets: one with 1,131, from the same domains (Full), and one with 87 sentences, from The Little Prince (LPP). Note that due to an error, 535 of the 1,131 Full Evaluation sentences were included in the training data, and therefore the full evaluation scores are an overestimate. The LPP scores are unaffected by this.
MRP 2019: Cross-Framework Meaning Representation Parsing. |
Stephan Oepen, Omri Abend, Jan Hajic, Daniel Hershcovich, Marco Kuhlmann, Tim O’Gorman, Nianwen Xue, Jayeol Chun, Milan Straka, Zdenka Uresova, CoNLL 2019 (shared task). |
[Paper: pdf] [Website: link] [UCCA data: link] [Code: github] |
CoNLL 2020 MRP Shared Task
The task included parsing to AMR, UCCA, PTG, DRG, and EDS, in multiple languages. For UCCA, the languages were English and German.
MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing. |
Stephan Oepen, Omri Abend, Lasha Abzianidze, Johan Bos, Jan Hajic, Daniel Hershcovich, Bin Li, Tim O’Gorman, Nianwen Xue and Daniel Zeman, CoNLL 2020 (shared task). |
[Paper: pdf] [Website: link] [Data: link] |
Semantics-aware Attention Improves Neural Machine Translation |
Aviv Slobodkin, Leshem Choshen and Omri Abend. |
[Paper: pdf] |
Self-Attentive Constituency Parsing for UCCA-based Semantic Parsing |
Necva Bölücü and Burcu Can. |
[Paper: pdf] |
Subcategorizing Adverbials in Universal Conceptual Cognitive Annotation. |
Zhuxin Wang, Jakob Prange and Nathan Schneider. LAW-DMR 2021. |
[Paper: pdf] |
RepGraph: Visualising and Analysing Meaning Representation Graphs. |
Jaron Cohen, Roy Cohen, Edan Toledo and Jan Buys. EMNLP 2021 demo. |
[Paper: pdf] [Website] |
Data-Driven Annotation of Textual Process Descriptions Based on Formal Meaning Representations. |
Lars Ackermann, Julian Neuberger and Stefan Jablonski. Lecture Notes in Computer Science. |
[Paper: pdf] |
Refining and Parsing Implicit Arguments in UCCA. |
Ruixiang Cui, MSc Thesis, |
University of Copenhagen, 2020 |
[Paper: pdf] |
Universal Semantic Parsing with Neural Networks. |
Daniel Hershcovich, PhD Thesis, |
The Hebrew University of Jerusalem, 2019 |
[Paper: pdf] |
Measuring Semantic Preservation in Machine Translation with HCOMET: Human Cognitive Metric for Evaluating Translation. |
Pedro Marinotti, MSc Thesis, |
The University of Edinburgh, 2014 |
[Paper: pdf] |
Integration of a cognitive annotation into machine translation: Theoretical foundations and bilingual corpus analysis. |
Elior Sulem, MSc Thesis, |
The Hebrew University of Jerusalem, 2014 |
[Paper: pdf] |
Semi-supervised identification of scene-evoking nouns in UCCA. |
Amit Beka, MSc Thesis, |
The Hebrew University of Jerusalem, 2013 |
[Paper: pdf] |
Grammatical Annotation Founded on Semantics: A Cognitive Linguistics Approach to Grammatical Corpus Annotation. |
Omri Abend, PhD Thesis, |
The Hebrew University of Jerusalem, 2013 |
[Paper: pdf] |
Distinguishing Human Translations and Machine Outputs with UCCA. |
Michal Kessler, Lab Report, |
The Hebrew University of Jerusalem, 2019 |
[Paper: pdf] |
For any questions or feedback, please email Omri Abend at oabend@cs.huji.ac.il.