Add DOIs for AACL 2020, AACL 2022, and EMNLP 2024 (#4104)

Also added a script that generates all volumes associated with an event (given the XML file).
acl-org · Nov 27, 2024 · b707527 · b707527
1 parent 5edcfdb
commit b707527
Show file tree

Hide file tree

Showing 39 changed files with 3,568 additions and 2 deletions.
diff --git a/bin/add_dois.py b/bin/add_dois.py
@@ -24,6 +24,11 @@
 
     add_dois.py [list of volume IDs]
 
+The best way to use it is with a script that adds all associated volumes
+for an event, including main conference and workshop volumes:
+
+    add_dois.py $(get_volumes_for_event.py data/xml/2024.emnlp.xml)
+
 e.g.,
 
     python3 add_dois.py P19-1 P19-2 P19-3 P19-4 W19-32

diff --git a/bin/generate_crossref_doi_metadata.py b/bin/generate_crossref_doi_metadata.py
@@ -28,6 +28,13 @@
 See also https://github.com/acl-org/acl-anthology/wiki/DOI
 
 Usage: python3 generate_crossref_doi_metadata.py <list of volume IDs>
+
+It's best to use this with the get_volumes_for_event.py script, which takes an
+XML file and returns all volumes and colocated workshops found within it.
+This is perfect
+
+    python3 generate_crossref_doi_metadata.py $(bin/get_volumes_for_event.py data/xml/2024.emnlp.xml)
+
 e.g.,
 
     python3 generate_crossref_doi_metadata.py P19-1 P19-2 P19-3 P19-4 > acl2019_dois.xml

diff --git a/bin/get_volumes_for_event.py b/bin/get_volumes_for_event.py
@@ -0,0 +1,41 @@
+#!/usr/bin/env python3
+
+"""
+Takes an XML file and returns all volumes within it. It will also
+return all colocated volumes. This is a convenient way to generate
+a list of all volumes associated with an event.
+"""
+
+import lxml.etree as ET
+
+
+def get_volumes(xml_file):
+    tree = ET.parse(xml_file)
+    root = tree.getroot()
+
+    collection_id = root.attrib['id']
+
+    volumes = []
+    for volume in root.findall(".//volume"):
+        volume_id_full = collection_id + "-" + volume.attrib['id']
+        volumes.append(volume_id_full)
+
+    # get the <colocated> node under <event>
+    event_node = root.find(".//event")
+    if event_node is not None:
+        colocated = event_node.find("colocated")
+        if colocated is not None:
+            for volume in colocated.findall("volume-id"):
+                volumes.append(volume.text)
+
+    return volumes
+
+
+if __name__ == '__main__':
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('xml_file', help='XML file to process')
+    args = parser.parse_args()
+
+    print(" ".join(get_volumes(args.xml_file)))
diff --git a/data/xml/2020.aacl.xml b/data/xml/2020.aacl.xml
diff --git a/data/xml/2020.iwdp.xml b/data/xml/2020.iwdp.xml
@@ -12,10 +12,12 @@
       <month>December</month>
       <year>2020</year>
       <venue>iwdp</venue>
+      <doi>10.18653/v1/2020.iwdp-1</doi>
     </meta>
     <frontmatter>
       <url hash="64d9f206">2020.iwdp-1.0</url>
       <bibkey>iwdp-2020-international</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.0</doi>
     </frontmatter>
     <paper id="1">
       <title>Research on Discourse Parsing: from the Dependency View</title>
@@ -24,6 +26,7 @@
       <abstract>Discourse parsing aims to comprehensively acquire the logical structure of the whole text which may be helpful to some downstream applications such as summarization, reading comprehension, QA and so on. One important issue behind discourse parsing is the representation of discourse structure. Up to now, many discourse structures have been proposed, and the correponding parsing methods are designed, promoting the development of discourse research. In this paper, we mainly introduce our recent discourse research and its preliminary application from the dependency view.</abstract>
       <url hash="c64faad8">2020.iwdp-1.1</url>
       <bibkey>li-2020-research</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.1</doi>
     </paper>
     <paper id="2">
       <title>A Review of Discourse-level Machine Translation</title>
@@ -32,6 +35,7 @@
       <abstract>Machine translation (MT) models usually translate a text at sentence level by considering isolated sentences, which is based on a strict assumption that the sentences in a text are independent of one another. However, the fact is that the texts at discourse level have properties going beyond individual sentences. These properties reveal texts in the frequency and distribution of words, word senses, referential forms and syntactic structures. Dissregarding dependencies across sentences will harm translation quality especially in terms of coherence, cohesion, and consistency. To solve these problems, several approaches have previously been investigated for conventional statistical machine translation (SMT). With the fast growth of neural machine translation (NMT), discourse-level NMT has drawn increasing attention from researchers. In this work, we review major works on addressing discourse related problems for both SMT and NMT models with a survey of recent trends in the fields.</abstract>
       <url hash="0b0dc392">2020.iwdp-1.2</url>
       <bibkey>zhang-2020-review</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.2</doi>
     </paper>
     <paper id="3">
       <title>A Test Suite for Evaluating Discourse Phenomena in Document-level Neural Machine Translation</title>
@@ -42,6 +46,7 @@
       <url hash="a99692c6">2020.iwdp-1.3</url>
       <bibkey>cai-xiong-2020-test</bibkey>
       <pwcdataset url="https://paperswithcode.com/dataset/opensubtitles">OpenSubtitles</pwcdataset>
+      <doi>10.18653/v1/2020.iwdp-1.3</doi>
     </paper>
     <paper id="4">
       <title>Comparison of the effects of attention mechanism on translation tasks of different lengths of ambiguous words</title>
@@ -54,6 +59,7 @@
       <abstract>In recent years, attention mechanism has been widely used in various neural machine translation tasks based on encoder decoder. This paper focuses on the performance of encoder decoder attention mechanism in word sense disambiguation task with different text length, trying to find out the influence of context marker on attention mechanism in word sense disambiguation task. We hypothesize that attention mechanisms have similar performance when translating texts of different lengths. Our conclusion is that the alignment effect of attention mechanism is magnified in short text translation tasks with ambiguous nouns, while the effect of attention mechanism is far less than expected in long-text tasks, which means that attention mechanism is not the main mechanism for NMT model to feed WSD to integrate context information. This may mean that attention mechanism pays more attention to ambiguous nouns than context markers. The experimental results show that with the increase of text length, the performance of NMT model using attention mechanism will gradually decline.</abstract>
       <url hash="5535965f">2020.iwdp-1.4</url>
       <bibkey>hu-etal-2020-comparison</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.4</doi>
     </paper>
     <paper id="5">
       <title>Context-Aware Word Segmentation for <fixed-case>C</fixed-case>hinese Real-World Discourse</title>
@@ -66,6 +72,7 @@
       <url hash="061c4146">2020.iwdp-1.5</url>
       <attachment type="Dataset" hash="ba4de41c">2020.iwdp-1.5.Dataset.rar</attachment>
       <bibkey>huang-etal-2020-context</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.5</doi>
     </paper>
     <paper id="6">
       <title>Neural Abstractive Multi-Document Summarization: Hierarchical or Flat Structure?</title>
@@ -76,6 +83,7 @@
       <url hash="0b7d8d4d">2020.iwdp-1.6</url>
       <bibkey>ma-zong-2020-neural</bibkey>
       <pwcdataset url="https://paperswithcode.com/dataset/wikisum">WikiSum</pwcdataset>
+      <doi>10.18653/v1/2020.iwdp-1.6</doi>
     </paper>
     <paper id="7">
       <title>Intent Segmentation of User Queries Via Discourse Parsing</title>
@@ -89,6 +97,7 @@
       <abstract>In this paper, we explore a new approach based on discourse analysis for the task of intent segmentation. Our target texts are user queries from a real-world chatbot. Our results show the feasibility of our approach with an F1-score of 82.97 points, and some advantages and disadvantages compared to two machine learning baselines: BERT and LSTM+CRF.</abstract>
       <url hash="e9ad8719">2020.iwdp-1.7</url>
       <bibkey>sanchez-carmona-etal-2020-intent</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.7</doi>
     </paper>
     <paper id="8">
       <title>Bridging Question Answering and Discourse The case of Multi-Sentence Questions</title>
@@ -97,6 +106,7 @@
       <abstract>In human question-answering (QA), questions are often expressed in the form of multiple sentences. One can see this in both spoken QA interactions, when one person asks a question of another, and written QA, such as are found on-line in FAQs and in what are called ”Community Question-Answering Forums”. Computer-based QA has taken the challenge of these ”multi-sentence questions” to be that of breaking them into an appropriately ordered sequence of separate questions, with both the previous questions and their answers serving as context for the next question. This can be seen, for example, in two recent workshops at AAAI called ”Reasoning for Complex QA” [<url>https://rcqa-ws.github.io/program/</url>]. We claim that, while appropriate for some types of ”multi-sentence questions” (MSQs), it is not appropriate for all, because they are essentially different types of discourse. To support this claim, we need to provide evidence that: • different types of MSQs are answered differently in written or spoken QA between people; • people can (and do) distinguish these different types of MSQs; • systems can be made to both distinguish different types of MSQs and provide appropriate answers.</abstract>
       <url hash="0c0d8441">2020.iwdp-1.8</url>
       <bibkey>webber-2020-bridging</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.8</doi>
     </paper>
     <paper id="9">
       <title>Component Sharing in <fixed-case>E</fixed-case>nglish and <fixed-case>C</fixed-case>hinese Clause Complex</title>
@@ -107,6 +117,7 @@
       <abstract>NT Clause Complex Framework defines a clause complex as a combination of NT clauses through component sharing and logic-semantic relationship. This paper clarifies the existence of component sharing mechanism in both English and Chinese clause complexes, illustrates the differences in component sharing between the two languages, and introduces a formal annotation scheme to represent clause-complex level structural transformations. Under the guidance of the annotation scheme, the English-Chinese Clause Alignment Corpus is built. It is believed that this corpus will aid comparative linguistic studies, translation studies and machine translation studies by providing abundant formal and computable samples for English-Chinese structural transformations on the clause complex level.</abstract>
       <url hash="fdc38338">2020.iwdp-1.9</url>
       <bibkey>ge-etal-2020-component</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.9</doi>
     </paper>
     <paper id="10">
       <title>Referential Cohesion A Challenge for Machine Translation Evaluation</title>
@@ -115,6 +126,7 @@
       <abstract>Connected texts are characterised by the presence of linguistic elements relating to shared referents throughout the text. These elements together form a structure that lends cohesion to the text. The realisation of those cohesive structures is subject to different constraints and varying preferences in different languages. We regularly observe mismatches of cohesive structures across languages in parallel texts. This can be a result of either a divergence of language-internal constraints or of effects of the translation process. As fully automatic high-quality MT is starting to look achievable, the question arises how cohesive elements should be handled in MT evaluation, since the common assumption of 1:1 correspondence between referring expressions is a poor match for what we find in corpus data. Focusing on the translation of pronouns, I discuss different approaches to evaluating a particular type of cohesive elements in MT output and the trade-offs they make between evaluation cost, validity, specificity and coverage. I suggest that a meaningful evaluation of cohesive structures in translation is difficult to achieve simply by appealing to the intuition of human annotators, but requires a more structured approach that forces us to make up our minds about the standards we expect the translation output to adhere to.</abstract>
       <url hash="45fc2672">2020.iwdp-1.10</url>
       <bibkey>hardmeier-2020-referential</bibkey>
+      <doi>10.18653/v1/2020.iwdp-1.10</doi>
     </paper>
   </volume>
 </collection>
diff --git a/data/xml/2020.knlp.xml b/data/xml/2020.knlp.xml
@@ -14,10 +14,12 @@
       <month>December</month>
       <year>2020</year>
       <venue>knlp</venue>
+      <doi>10.18653/v1/2020.knlp-1</doi>
     </meta>
     <frontmatter>
       <url hash="61ff76fc">2020.knlp-1.0</url>
       <bibkey>knlp-2020-knowledgeable</bibkey>
+      <doi>10.18653/v1/2020.knlp-1.0</doi>
     </frontmatter>
     <paper id="1">
       <title><fixed-case>COVID</fixed-case>-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature</title>
@@ -35,6 +37,7 @@
       <url hash="fdf17423">2020.knlp-1.1</url>
       <bibkey>wise-etal-2020-covid</bibkey>
       <pwcdataset url="https://paperswithcode.com/dataset/cord-19">CORD-19</pwcdataset>
+      <doi>10.18653/v1/2020.knlp-1.1</doi>
     </paper>
     <paper id="2">
       <title>Dialogue over Context and Structured Knowledge using a Neural Network Model with External Memories</title>
@@ -46,6 +49,7 @@
       <url hash="5e6c0146">2020.knlp-1.2</url>
       <bibkey>murayama-etal-2020-dialogue</bibkey>
       <pwcdataset url="https://paperswithcode.com/dataset/csqa">CSQA</pwcdataset>
+      <doi>10.18653/v1/2020.knlp-1.2</doi>
     </paper>
     <paper id="3">
       <title>Social Media Medical Concept Normalization using <fixed-case>R</fixed-case>o<fixed-case>BERT</fixed-case>a in Ontology Enriched Text Similarity Framework</title>
@@ -55,6 +59,7 @@
       <abstract>Pattisapu et al. (2020) formulate medical concept normalization (MCN) as text similarity problem and propose a model based on RoBERTa and graph embedding based target concept vectors. However, graph embedding techniques ignore valuable information available in the clinical ontology like concept description and synonyms. In this work, we enhance the model of Pattisapu et al. (2020) with two novel changes. First, we use retrofitted target concept vectors instead of graph embedding based vectors. It is the first work to leverage both concept description and synonyms to represent concepts in the form of retrofitted target concept vectors in text similarity framework based social media MCN. Second, we generate both concept and concept mention vectors with same size which eliminates the need of dense layers to project concept mention vectors into the target concept embedding space. Our model outperforms existing methods with improvements up to 3.75% on two standard datasets. Further when trained only on mapping lexicon synonyms, our model outperforms existing methods with significant improvements up to 14.61%. We attribute these significant improvements to the two novel changes introduced.</abstract>
       <url hash="ff2a5c1b">2020.knlp-1.3</url>
       <bibkey>kalyan-sangeetha-2020-social</bibkey>
+      <doi>10.18653/v1/2020.knlp-1.3</doi>
     </paper>
     <paper id="4">
       <title><fixed-case>BERTC</fixed-case>hem-<fixed-case>DDI</fixed-case> : Improved Drug-Drug Interaction Prediction from text using Chemical Structure Information</title>
@@ -63,6 +68,7 @@
       <abstract>Traditional biomedical version of embeddings obtained from pre-trained language models have recently shown state-of-the-art results for relation extraction (RE) tasks in the medical domain. In this paper, we explore how to incorporate domain knowledge, available in the form of molecular structure of drugs, for predicting Drug-Drug Interaction from textual corpus. We propose a method, BERTChem-DDI, to efficiently combine drug embeddings obtained from the rich chemical structure of drugs (encoded in SMILES) along with off-the-shelf domain-specific BioBERT embedding-based RE architecture. Experiments conducted on the DDIExtraction 2013 corpus clearly indicate that this strategy improves other strong baselines architectures by 3.4% macro F1-score.</abstract>
       <url hash="873fd0c4">2020.knlp-1.4</url>
       <bibkey>mondal-2020-bertchem</bibkey>
+      <doi>10.18653/v1/2020.knlp-1.4</doi>
     </paper>
   </volume>
 </collection>
diff --git a/data/xml/2020.lifelongnlp.xml b/data/xml/2020.lifelongnlp.xml
@@ -16,10 +16,12 @@
       <month>December</month>
       <year>2020</year>
       <venue>lifelongnlp</venue>
+      <doi>10.18653/v1/2020.lifelongnlp-1</doi>
     </meta>
     <frontmatter>
       <url hash="407cd6f4">2020.lifelongnlp-1.0</url>
       <bibkey>lifelongnlp-2020-life</bibkey>
+      <doi>10.18653/v1/2020.lifelongnlp-1.0</doi>
     </frontmatter>
     <paper id="1">
       <title>Deep Active Learning for Sequence Labeling Based on Diversity and Uncertainty in Gradient</title>
@@ -30,6 +32,7 @@
       <bibkey>kim-2020-deep</bibkey>
       <pwcdataset url="https://paperswithcode.com/dataset/atis">ATIS</pwcdataset>
       <pwcdataset url="https://paperswithcode.com/dataset/conll-2003">CoNLL 2003</pwcdataset>
+      <doi>10.18653/v1/2020.lifelongnlp-1.1</doi>
     </paper>
     <paper id="2">
       <title>Supervised Adaptation of Sequence-to-Sequence Speech Recognition Systems using Batch-Weighting</title>
@@ -44,6 +47,7 @@
       <url hash="e8e4ade1">2020.lifelongnlp-1.2</url>
       <bibkey>huber-etal-2020-supervised</bibkey>
       <pwcdataset url="https://paperswithcode.com/dataset/how2">How2</pwcdataset>
+      <doi>10.18653/v1/2020.lifelongnlp-1.2</doi>
     </paper>
     <paper id="3">
       <title>Data Augmentation using Pre-trained Transformer Models</title>
@@ -58,6 +62,7 @@
       <pwcdataset url="https://paperswithcode.com/dataset/snips">SNIPS</pwcdataset>
       <pwcdataset url="https://paperswithcode.com/dataset/sst">SST</pwcdataset>
       <pwcdataset url="https://paperswithcode.com/dataset/sst-2">SST-2</pwcdataset>
+      <doi>10.18653/v1/2020.lifelongnlp-1.3</doi>
     </paper>
   </volume>
 </collection>