Skip to content

Latest commit

 

History

History
742 lines (415 loc) · 74 KB

2021.md

File metadata and controls

742 lines (415 loc) · 74 KB

2021 (44 papers)

  1. Prefix-Tuning: Optimizing Continuous Prompts for Generation, Xiang Lisa Li,Percy Liang, 01-01-2021

    Categories

    Computation and Language

    Abstract

    Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens". We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.

    Bullet Points

    • The paper proposes prefix-tuning as a lightweight alternative to fine-tune for natural language generation tasks, which keeps language model parameters frozen but optimizes a small continuous task-specific vector called the prefix

    • It draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens"

    • Prefix-Tuning is applied to GPT-2 for table-to-text generation and BART for summarization

    • By learning only 0.1% of the parameters, it obtains comparable performance in the full data setting, outperforms fine-tinging in low-data settings, and extrapolates better to examples with unseen topics during training.

  2. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm, Laria Reynolds,Kyle McDonell, 15-02-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Prevailing methods for mapping large generative language models to supervised tasks may fail to sufficiently probe models' novel capabilities. Using GPT-3 as a case study, we show that 0-shot prompts can significantly outperform few-shot prompts. We suggest that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning. This analysis motivates rethinking the role of prompts in controlling and evaluating powerful language models. In this work, we discuss methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language. We explore techniques for exploiting the capacity of narratives and cultural anchors to encode nuanced intentions and techniques for encouraging deconstruction of a problem into components before producing a verdict. Informed by this more encompassing theory of prompt programming, we also introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks. Finally, we discuss how these more general methods of interacting with language models can be incorporated into existing and future benchmarks and practical applications.

    Bullet Points

    • Prevailing methods for mapping large generative language models to supervised tasks may not sufficiently probe models' novel capabilities

    • Using GPT-3 as a case study, 0-shot prompts can significantly outperform few-shot examples, suggesting that the function of few-shoot examples is better described as locating an already learned task rather than meta-learning

    • This analysis motivates rethinking the role of prompts in controlling and evaluating powerful language models

    • Methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language, explore techniques for exploiting narratives and cultural anchors to encode nuanced intentions and encourage deconstruction of a problem into components before producing a verdict

    • Informed by this more encompassing theory of prompt Programming, we introduce a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks

    • These more general methods

  3. Calibrate Before Use: Improving Few-Shot Performance of Language Models, Tony Z. Zhao,Eric Wallace,Shi Feng,Dan Klein,Sameer Singh, 19-02-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.

    Bullet Points

    • GPT-3 can perform many tasks when a natural language prompt contains a few training examples, but it can be unstable due to the bias of language models towards predicting certain answers

    • To mitigate this, we estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A"

    • We then fit calibration parameters that cause the prediction for this input to be uniform across answers

    • This contextual calibration procedure significantly improves the average accuracy and reduces variance across different choices of the prompt.

  4. PADA: Example-based Prompt Learning for on-the-fly Adaptation to Unseen Domains, Eyal Ben-David,Nadav Oved,Roi Reichart, 24-02-2021

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Natural Language Processing algorithms have made incredible progress, but they still struggle when applied to out-of-distribution examples. We address a challenging and underexplored version of this domain adaptation problem, where an algorithm is trained on several source domains, and then applied to examples from unseen domains that are unknown at training time. Particularly, no examples, labeled or unlabeled, or any other knowledge about the target domain are available to the algorithm at training time. We present PADA: An example-based autoregressive Prompt learning algorithm for on-the-fly Any-Domain Adaptation, based on the T5 language model. Given a test example, PADA first generates a unique prompt for it and then, conditioned on this prompt, labels the example with respect to the NLP prediction task. PADA is trained to generate a prompt which is a token sequence of unrestricted length, consisting of Domain Related Features (DRFs) that characterize each of the source domains. Intuitively, the generated prompt is a unique signature that maps the test example to a semantic space spanned by the source domains. In experiments with 3 tasks (text classification and sequence tagging), for a total of 14 multi-source adaptation scenarios, PADA substantially outperforms strong baselines.

    Bullet Points

    • PADA is an example-based autoregressive prompt learning algorithm for on-the-fly Any-Domain Adaptation, based on the T5 language model

    • It generates a token sequence of unrestricted length, containing Domain Related Features (DRFs) that characterize each of the source domains, and maps the test example to a semantic space spanned by the target domains

    • In experiments with 3 tasks (text classification and sequence tagging), PADA substantially outperforms strong baselines.

  5. Learning Transferable Visual Models From Natural Language Supervision, Alec Radford,Jong Wook Kim,Chris Hallacy,Aditya Ramesh,Gabriel Goh,Sandhini Agarwal,Girish Sastry,Amanda Askell,Pamela Mishkin,Jack Clark,Gretchen Krueger,Ilya Sutskever, 26-02-2021

    Categories

    Computer Vision, Machine Learning

    Abstract

    Bullet Points

  6. How Many Data Points is a Prompt Worth?, Teven Le Scao,Alexander M. Rush, 15-03-2021

    Categories

    Machine Learning

    Abstract

    When fine-tuning pretrained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. Proponents of prompting have argued that prompts provide a method for injecting task-specific guidance, which is beneficial in low-data regimes. We aim to quantify this benefit through rigorous testing of prompts in a fair setting: comparing prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. By controlling for many sources of advantage, we find that prompting does indeed provide a benefit, and that this benefit can be quantified per task. Results show that prompting is often worth 100s of data points on average across classification tasks.

    Bullet Points

    • Researchers use either a generic model head or a task-specific prompt for fine-tuning pretrained models for classification

    • Proponents of prompting argue that prompts provide a method for injecting task-related guidance, which is beneficial in low-data regimes

    • We aim to quantify this benefit through rigorous testing of prompts in equal conditions across many tasks and data sizes

    • By controlling for many sources of advantage, we find that prompting provides a benefit and that this benefit can be quantified per task

    • The results show that it is often worth 100s of data points on average across classification tasks.

  7. GPT Understands, Too, Xiao Liu,Yanan Zheng,Zhengxiao Du,Ming Ding,Yujie Qian,Zhilin Yang,Jie Tang, 18-03-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    Prompting a pretrained language model with natural language patterns has been proved effective for natural language understanding (NLU). However, our preliminary study reveals that manual discrete prompts often lead to unstable performance -- e.g., changing a single word in the prompt might result in substantial performance drop. We propose a novel method P-Tuning that employs trainable continuous prompt embeddings in concatenation with discrete prompts. Empirically, P-Tuning not only stabilizes training by minimizing the gap between various discrete prompts, but also improves performance by a sizeable margin on a wide range of NLU tasks including LAMA and SuperGLUE. P-Tuning is generally effective for both frozen and tuned language models, under both the fully-supervised and few-shot settings.

    Bullet Points

    • Prompting a pretrained language model with natural language patterns has been effective for NLU, but manual discrete prompts can lead to unstable performance

    • A new method, P-Tuning, employs trainable continuous prompt embeddings in concatenation, which stabilizes training and improves performance by a sizeable margin on NLU tasks such as LAMA and SuperGLUE

    • It is generally effective for both frozen and tuned language models, under both fully-supervised and few-shot settings.

  8. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections, Ruiqi Zhong,Kristy Lee,Zheng Zhang,Dan Klein, 10-04-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Large pre-trained language models (LMs) such as GPT-3 have acquired a surprising ability to perform zero-shot learning. For example, to classify sentiment without any training examples, we can "prompt" the LM with the review and the label description "Does the user like this movie?", and ask whether the next word is "yes" or "no". However, the next word prediction training objective is still misaligned with the target zero-shot learning objective. To address this weakness, we propose meta-tuning, which directly optimizes the zero-shot learning objective by fine-tuning pre-trained language models on a collection of datasets. We focus on classification tasks, and construct the meta-dataset by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format. When evaluated on unseen tasks, meta-tuned models outperform a same-sized QA model and the previous SOTA zero-shot learning system based on natural language inference. Additionally, increasing parameter count from 220M to 770M improves AUC-ROC scores by 6.3%, and we forecast that even larger models would perform better. Therefore, measuring zero-shot learning performance on language models out-of-the-box might underestimate their true potential, and community-wide efforts on aggregating datasets and unifying their formats can help build models that answer prompts better.

    Bullet Points

    • Large pre-trained language models such as GPT-3 can perform zero-shot learning, but the next word prediction training objective is still misaligned with the target objective

    • To address this weakness, we propose meta-tuning, which directly optimizes the zero-shoot learning objective by fine-tuned language models on a collection of datasets

    • We aggregate 43 existing datasets and annotate 441 label descriptions in a question-answering (QA) format, and increase parameter count from 220M to 770M to improve AUC-ROC scores by 6.3%, and we forecast that even larger models would perform better

    • Community-wide efforts on aggregating datasets, unifying their formats, and building models that answer prompts better might underestimate their true potential.

  9. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts, Guanghui Qin,Jason Eisner, 14-04-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks, using a fill-in-the-blank paradigm (Petroni et al., 2019) or a few-shot extrapolation paradigm (Brown et al., 2020). For example, language models retain factual knowledge from their training corpora that can be extracted by asking them to "fill in the blank" in a sentential prompt. However, where does this prompt come from? We explore the idea of learning prompts by gradient descent -- either fine-tuning prompts taken from previous work, or starting from random initialization. Our prompts consist of "soft words," i.e., continuous vectors that are not necessarily word type embeddings from the language model. Furthermore, for each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. Across multiple English LMs and tasks, our approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.

    Bullet Points

    • Natural-language prompts have been used to coax pretrained language models into performing other AI tasks using fill-in-the-blank or few-shot extrapolation paradigms

    • This prompt comes from factual knowledge from the training corpora that can be extracted by asking them to "fill in the blank" in a sentential prompt

    • Gradient descent is a method of learning prompts by fine-tuning prompts taken from previous work, or starting from random initialization

    • Our approach outperforms previous methods, showing that implicit factual Knowledge in language models was previously underestimated

    • Random initialization is nearly as good as informed initialization, and this knowledge is cheap to elicit across multiple English LMs and tasks

    • The approach involves optimizing a mixture of prompts, learning which ones are most effective and how to ensemble them.

  10. Generating Datasets with Pretrained Language Models, Timo Schick,Hinrich Schütze, 15-04-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

    Bullet Points

    • To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs

    • This approach typically outperforms the former, but requires human effort to generate suitable datasets of sufficient size

    • We utilize the generative abilities of large and high-performing PLMs to generate entire datasets, which we then use for finetuning much smaller and more efficient models

    • Our fully unsupervised approach surpasses strong baselines on several semantic textual similarity datasets.

  11. Surface Form Competition: Why the Highest Probability Answer Isn't Always Right, Ari Holtzman,Peter West,Vered Shwartz,Yejin Choi,Luke Zettlemoyer, 16-04-2021

    Categories

    Computation and Language

    Abstract

    We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task. It achieves consistent gains in zero-shot performance over both calibrated (Zhao et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.

    Bullet Points

    • Domain Conditional Pointwise Mutual Information is an alternative scoring function that compensates for surface form competition by reweighing each option according to a term proportional to its a priori likelihood within the context of the specific zero-shot task

    • It achieves consistent results in both calibrated and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.

  12. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, Yao Lu,Max Bartolo,Alastair Moore,Sebastian Riedel,Pontus Stenetorp, 18-04-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are "fantastic" and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

    Bullet Points

    • Large, pretrained language models like GPT-3 have shown competitive results when primed with only a handful of training samples

    • The order in which the samples are provided can make the difference between near state-of-the-art and random guess performance

    • This phenomenon is present across model sizes, not related to a specific subset of samples, and a given good permutation for one model is not transferable to another

    • A development set can be used to determine which permutations are performant, but it would deviate from the true few-shot setting as it requires additional annotated data

    • Instead, we use the generative nature of language models to construct an artificial development set and identify performant prompts based on entropy statistics

    • Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

  13. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation, Kang Min Yoo,Dongju Park,Jaewook Kang,Sang-Woo Lee,Woomyeong Park, 18-04-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts. Recent studies report that prompt-based direct classification eliminates the need for fine-tuning but lacks data and inference scalability. This paper proposes a novel data augmentation technique that leverages large-scale language models to generate realistic text samples from a mixture of real samples. We also propose utilizing soft-labels predicted by the language models, effectively distilling knowledge from the large-scale language models and creating textual perturbations simultaneously. We perform data augmentation experiments on diverse classification tasks and show that our method hugely outperforms existing text augmentation methods. Ablation studies and a qualitative analysis provide more insights into our approach.

    Bullet Points

    • The paper proposes a new data augmentation technique that utilizes large-scale language models to generate realistic text samples from a mixture of real samples, using soft-labels predicted by the language models

    • This method outperforms existing text augmentation methods and provides more insights into the approach

    • Ablation studies and qualitative analysis are also provided.

  14. The Power of Scale for Parameter-Efficient Prompt Tuning, Brian Lester,Rami Al-Rfou,Noah Constant, 18-04-2021

    Categories

    Computation and Language

    Abstract

    In this work, we explore "prompt tuning", a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

    Bullet Points

    • Prompt tuning is a simple and effective method for learning "soft prompts" to condition frozen language models to perform specific downstream tasks

    • Soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples

    • The end-to-end learned approach outperforms GPT-3's "few-shot" learning by a large margin, and through ablations on model size using T5, prompt tuning becomes more competitive with scale as models exceed billions of parameters

    • This finding is relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden

    • This method is similar to prefix tuning, and it can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021)

    • Finally, conditioning a frozen model with soft prompts confers benefits in robustness to domain

  15. PTR: Prompt Tuning with Rules for Text Classification, Xu Han,Weilin Zhao,Ning Ding,Zhiyuan Liu,Maosong Sun, 24-05-2021

    Categories

    Computation and Language

    Abstract

    Fine-tuned pre-trained language models (PLMs) have achieved awesome performance on almost all NLP tasks. By using additional prompts to fine-tune PLMs, we can further stimulate the rich knowledge distributed in PLMs to better serve downstream tasks. Prompt tuning has achieved promising results on some few-class classification tasks such as sentiment classification and natural language inference. However, manually designing lots of language prompts is cumbersome and fallible. For those auto-generated prompts, it is also expensive and time-consuming to verify their effectiveness in non-few-shot scenarios. Hence, it is still challenging for prompt tuning to address many-class classification tasks. To this end, we propose prompt tuning with rules (PTR) for many-class text classification and apply logic rules to construct prompts with several sub-prompts. In this way, PTR is able to encode prior knowledge of each class into prompt tuning. We conduct experiments on relation classification, a typical and complicated many-class classification task, and the results show that PTR can significantly and consistently outperform existing state-of-the-art baselines. This indicates that PTR is a promising approach to take advantage of both human prior knowledge and PLMs for those complicated classification tasks.

    Bullet Points

    • Fine-tuned pre-trained language models (PLMs) have achieved great performance on almost all NLP tasks

    • By using additional prompts, we can stimulate the rich knowledge distributed in PLMs to better serve downstream tasks

    • Prompt tuning has achieved promising results on some few-class classification tasks such as sentiment classification and natural language inference

    • However, manually designing lots of language prompts is cumbersome and fallible, and auto-generated prompts are expensive and time-consuming to verify their effectiveness in non-few-shot scenarios

    • Therefore, prompt tuning with rules (PTR) for many-class text classification is still challenging for addressing complex classification tasks

    • We conduct experiments on relation classification, and PTR can significantly outperform existing state-of-the-art baselines.

  16. True Few-Shot Learning with Language Models, Ethan Perez,Douwe Kiela,Kyunghyun Cho, 24-05-2021

    Categories

    Computation and Language, Machine Learning, Machine Learning

    Abstract

    Pretrained language models (LMs) perform well on many tasks even when learning from a few examples, but prior work uses many held-out examples to tune various aspects of learning, such as hyperparameters, training objectives, and natural language templates ("prompts"). Here, we evaluate the few-shot ability of LMs when such held-out examples are unavailable, a setting we call true few-shot learning. We test two model selection criteria, cross-validation and minimum description length, for choosing LM prompts and hyperparameters in the true few-shot setting. On average, both marginally outperform random selection and greatly underperform selection based on held-out examples. Moreover, selection criteria often prefer models that perform significantly worse than randomly-selected ones. We find similar results even when taking into account our uncertainty in a model's true performance during selection, as well as when varying the amount of computation and number of examples used for selection. Overall, our findings suggest that prior work significantly overestimated the true few-shot ability of LMs given the difficulty of few-shot model selection.

    Bullet Points

    • Pretrained language models (LMs) perform well even when learning from a few examples, but prior work uses many held-out examples to tune various aspects of learning, such as hyperparameters, training objectives, and natural language templates ("prompts")

    • To evaluate the few-shot ability of LMs when these held out examples are unavailable, we test two model selection criteria, cross-validation and minimum description length

    • On average, both models marginally outperform random selection and greatly underperform selection

    • Selection criteria often prefer models that perform significantly worse than randomly-selected ones

    • Similar results are found even when taking into account our uncertainty in a model's true performance during selection, as well as when varying the amount of computation and number of examples used for selection.

  17. LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu,Yelong Shen,Phillip Wallis,Zeyuan Allen-Zhu,Yuanzhi Li,Shean Wang,Lu Wang,Weizhu Chen, 17-06-2021

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Bullet Points

  18. Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning, Colin Wei,Sang Michael Xie,Tengyu Ma, 17-06-2021

    Categories

    Machine Learning, Machine Learning

    Abstract

    Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover from the long-term memory. Experiments on synthetically generated data from HMMs back our theoretical findings.

    Bullet Points

    • Pretrained language models have achieved state-of-the-art performance when adapted to downstream NLP tasks

    • However, theoretical analysis of these models is scarce and challenging due to differences in pretraining and downstream tasks

    • A framework is proposed that links pretraining with an underlying latent variable generative model of text, where the downstream classifier must recover a function of the posterior distribution over the latent variables

    • Head tuning and prompt tuning are used in this setting

    • The HMM is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language

    • Under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, while prompt tuning obtains downstream guarantees with weaker conditions

    • Our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover

  19. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models, Robert L. Logan IV,Ivana Balažević,Eric Wallace,Fabio Petroni,Sameer Singh,Sebastian Riedel, 24-06-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.

    Bullet Points

    • Finetuning language models (LMs) with training examples and task descriptions can significantly reduce the need for prompt engineering in few-shot learning

    • Null prompts can achieve competitive accuracy to manually-tuned prompts across a wide range of tasks, while only updating bias terms can achieve comparable or better accuracy

    • The memory overhead can be reduced by finetunting only the bias terms while updating 0.1% of the parameters

    • Finetuneing LMs for few shot learning is more accurate, robust to different prompts, and nearly as efficient as using frozen models.

  20. Deduplicating Training Data Makes Language Models Better, Katherine Lee,Daphne Ippolito,Andrew Nystrom,Chiyuan Zhang,Douglas Eck,Chris Callison-Burch,Nicholas Carlini, 14-07-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    Bullet Points

  21. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing, Pengfei Liu,Weizhe Yuan,Jinlan Fu,Zhengbao Jiang,Hiroaki Hayashi,Graham Neubig, 28-07-2021

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    including constantly-updated survey, and paperlist.

    Bullet Points

    • I'm sorry, I cannot provide a summary without additional context or information about what you are referring to

    • Can you please provide more details or a specific task for me to assist you with? Thank you for your prompt response.

  22. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification, Shengding Hu,Ning Ding,Huadong Wang,Zhiyuan Liu,Jingang Wang,Juanzi Li,Wei Wu,Maosong Sun, 04-08-2021

    Categories

    Computation and Language

    Abstract

    Tuning pre-trained language models (PLMs) with task-specific prompts has been a promising approach for text classification. Particularly, previous studies suggest that prompt-tuning has remarkable superiority in the low-data scenario over the generic fine-tuning methods with extra classifiers. The core idea of prompt-tuning is to insert text pieces, i.e., template, to the input and transform a classification problem into a masked language modeling problem, where a crucial step is to construct a projection, i.e., verbalizer, between a label space and a label word space. A verbalizer is usually handcrafted or searched by gradient descent, which may lack coverage and bring considerable bias and high variances to the results. In this work, we focus on incorporating external knowledge into the verbalizer, forming a knowledgeable prompt-tuning (KPT), to improve and stabilize prompt-tuning. Specifically, we expand the label word space of the verbalizer using external knowledge bases (KBs) and refine the expanded label word space with the PLM itself before predicting with the expanded label word space. Extensive experiments on zero and few-shot text classification tasks demonstrate the effectiveness of knowledgeable prompt-tuning.

    Bullet Points

    • Prompt-tuning pre-trained language models (PLMs) with task-specific prompts has been a promising approach for text classification

    • Previous studies suggest that prompt-tuneing has superiority in the low-data scenario over generic fine-tinging methods with extra classifiers

    • The core idea is to insert text pieces to the input and transform a classification problem into a masked language modeling problem, where a crucial step is to construct a projection, i.e., verbalizer, between a label space and label word space

    • In this work, we focus on incorporating external knowledge into the Verbalizer and forming a knowledgeable prompt-ting (KPT)

    • Extensive experiments on zero and few-shot text classification tasks demonstrate its effectiveness.

  23. Noisy Channel Language Model Prompting for Few-Shot Text Classification, Sewon Min,Mike Lewis,Hannaneh Hajishirzi,Luke Zettlemoyer, 09-08-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input (referred as direct models), channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters, via either in-context demonstration or prompt tuning. Our experiments show that, for both methods, channel models significantly outperform their direct counterparts, which we attribute to their stability, i.e., lower variance and higher worst-case accuracy. We also present extensive ablations that provide recommendations for when to use channel prompt tuning instead of other competitive methods (e.g., direct head tuning): channel prompt tuning is preferred when the number of training examples is small, labels in the training data are imbalanced, or generalization to unseen labels is required.

    Bullet Points

    • We introduce a noisy channel approach for language model prompting in few-shot text classification, where channel models compute the conditional probability of the input given the label and explain every word in the input

    • We use channel models for recently proposed few shot learning methods with no updates to the language model parameters via in-context demonstration or prompt tuning

    • Our experiments show that channel models significantly outperform their direct counterparts, attributed to their stability, lower variance, and higher worst-case accuracy

    • Additionally, we provide recommendations for when to use channel prompt tuning instead of other competitive methods such as direct head tuning.

  24. FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning, Nam Hyeon-Woo,Moon Ye-Bin,Tae-Hyun Oh, 13-08-2021

    Categories

    Machine Learning, Computer Vision

    Abstract

    In this work, we propose a communication-efficient parameterization, FedPara, for federated learning (FL) to overcome the burdens on frequent model uploads and downloads. Our method re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity. This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers, which is not achievable by the traditional low-rank methods. The efficiency of our method can be further improved by combining with other efficient FL optimizers. In addition, we extend our method to a personalized FL application, pFedPara, which separates parameters into global and local ones. We show that pFedPara outperforms competing personalized FL methods with more than three times fewer parameters.

    Bullet Points

    • The work proposes a communication-efficient parameterization for federated learning (FL) by re-paramizing weight parameters of layers using low-rank weights followed by the Hadamard product

    • This allows for comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers

    • The efficiency of the method can be improved by combining with other efficient FL optimizers

    • Additionally, pFedPara outperforms competing personalized FL methods with more than three times fewer parameters.

  25. Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners, Ningyu Zhang,Luoqiu Li,Xiang Chen,Shumin Deng,Zhen Bi,Chuanqi Tan,Fei Huang,Huajun Chen, 30-08-2021

    Categories

    Computation and Language, Artificial Intelligence, Computer Vision, Information Retrieval, Machine Learning

    Abstract

    .

    Bullet Points

    • I'm sorry, but you haven't provided me with any context or information to summarize

    • Please provide me with more details so that I can assist you better in summarizing the information you're looking for

    • Thank you for your help.

  26. Want To Reduce Labeling Cost? GPT-3 Can Help, Shuohang Wang,Yang Liu,Yichong Xu,Chenguang Zhu,Michael Zeng, 30-08-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Data annotation is a time-consuming and labor-intensive process for many NLP tasks. Although there exist various methods to produce pseudo data labels, they are often task-specific and require a decent amount of labeled data to start with. Recently, the immense language model GPT-3 with 175 billion parameters has achieved tremendous improvement across many few-shot learning tasks. In this paper, we explore ways to leverage GPT-3 as a low-cost data labeler to train other models. We find that, to make the downstream model achieve the same performance on a variety of NLU and NLG tasks, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans. Furthermore, we propose a novel framework of combining pseudo labels from GPT-3 with human labels, which leads to even better performance with limited labeling budget. These results present a cost-effective data labeling methodology that is generalizable to many practical applications.

    Bullet Points

    • The paper explores ways to leverage GPT-3 as a low-cost data labeler to train other models for NLU and NLG tasks, finding that it costs 50% to 96% less to use labels from GPT3 than using labels from humans

    • Additionally, a novel framework of combining pseudo labels with human labels leads to even better performance with limited labeling budget

    • These results present a cost-effective data labeling methodology that is generalizable to many practical applications.

  27. Do Prompt-Based Models Really Understand the Meaning of their Prompts?, Albert Webson,Ellie Pavlick, 02-09-2021

    Categories

    Computation and Language

    Abstract

    Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot learning with various prompt-based models. It is commonly argued that prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. In this study, we experiment with over 30 prompt templates manually written for natural language inference (NLI). We find that models learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively "good" prompts. Further, such patterns hold even for models as large as 175 billion parameters (Brown et al., 2020) as well as the recently proposed instruction-tuned models which are trained on hundreds of prompts (Sanh et al., 2022). That is, instruction-tuned models often produce good predictions with irrelevant and misleading prompts even at zero shots. In sum, notwithstanding prompt-based models' impressive improvement, we find evidence of serious limitations that question the degree to which such improvement is derived from models understanding task instructions in ways analogous to humans' use of task instructions.

    Bullet Points

    • Prompt-based models have shown remarkable progress in zero-shot and few-shot learning with various prompts, which help models to learn faster in the same way as humans learn faster when provided with task instructions expressed in natural language

    • Over 30 prompt templates were experimented with manually written for NLI and found that models learn just as fast with many irrelevant or pathologically misleading prompts as they do with instructively "good" prompts

    • These patterns hold even for models as large as 175 billion parameters and the recently proposed instruction-tuned models which are trained on hundreds of prompts (Sanh et al., 2022)

    • Despite this impressive improvement, there are serious limitations that question the degree to which such improvement is derived from models understanding task instructions in ways analogous to humans' use of task instructions.

  28. Finetuned Language Models Are Zero-Shot Learners, Jason Wei,Maarten Bosma,Vincent Y. Zhao,Kelvin Guu,Adams Wei Yu,Brian Lester,Nan Du,Andrew M. Dai,Quoc V. Le, 03-09-2021

    Categories

    Computation and Language

    Abstract

    We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

    Bullet Points

    • Instruction-tuned language model, FLAN, improves performance on over 60 NLP tasks verbalized via natural language instruction templates

    • It surpasses zero-shot 175B GPT-3 on 20 of 25 tasks evaluated

    • FLAN outperforms few-shot GPT-1 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze

    • Number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

  29. General-Purpose Question-Answering with Macaw, Oyvind Tafjord,Peter Clark, 06-09-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

  30. Discrete and Soft Prompting for Multilingual Models, Mengjie Zhao,Hinrich Schütze, 08-09-2021

    Categories

    Computation and Language

    Abstract

    It has been shown for English that discrete and soft prompting perform strongly in few-shot learning with pretrained language models (PLMs). In this paper, we show that discrete and soft prompting perform better than finetuning in multilingual cases: Crosslingual transfer and in-language training of multilingual natural language inference. For example, with 48 English training examples, finetuning obtains 33.74% accuracy in crosslingual transfer, barely surpassing the majority baseline (33.33%). In contrast, discrete and soft prompting outperform finetuning, achieving 36.43% and 38.79%. We also demonstrate good performance of prompting with training data in multiple languages other than English.

    Bullet Points

    • Discrete and soft prompting perform strongly in few-shot learning with pretrained language models (PLMs), outperform finetuning in multilingual cases such as crosslingual transfer and in-language training of multilingual natural language inference

    • The paper also shows good performance with training data in multiple languages other than English.

  31. Open Aspect Target Sentiment Classification with Natural Language Prompts, Ronald Seoh,Ian Birle,Mrinal Tak,Haw-Shiuan Chang,Brian Pinette,Alfred Hough, 08-09-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    For many business applications, we often seek to analyze sentiments associated with any arbitrary aspects of commercial products, despite having a very limited amount of labels or even without any labels at all. However, existing aspect target sentiment classification (ATSC) models are not trainable if annotated datasets are not available. Even with labeled data, they fall short of reaching satisfactory performance. To address this, we propose simple approaches that better solve ATSC with natural language prompts, enabling the task under zero-shot cases and enhancing supervised settings, especially for few-shot cases. Under the few-shot setting for SemEval 2014 Task 4 laptop domain, our method of reformulating ATSC as an NLI task outperforms supervised SOTA approaches by up to 24.13 accuracy points and 33.14 macro F1 points. Moreover, we demonstrate that our prompts could handle implicitly stated aspects as well: our models reach about 77% accuracy on detecting sentiments for aspect categories (e.g., food), which do not necessarily appear within the text, even though we trained the models only with explicitly mentioned aspect terms (e.g., fajitas) from just 16 reviews - while the accuracy of the no-prompt baseline is only around 65%.

    Bullet Points

    • To better solve aspect target sentiment classification (ATSC) models, we propose simple approaches that improve their performance under zero-shot cases and enhance supervised settings

    • Our approach of reformulating ATSC as an NLI task outperforms supervised SOTA approaches by up to 24.13 accuracy points and 33.14 macro F1 points

    • Additionally, our prompts could handle implicitly stated aspects as well

    • Our models reach about 77% accuracy on detecting sentiments for aspect categories, which do not necessarily appear within the text, despite training only with explicitly mentioned aspect terms.

  32. Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning, Prasetya Ajie Utama,Nafise Sadat Moosavi,Victor Sanh,Iryna Gurevych, 09-09-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Recent prompt-based approaches allow pretrained language models to achieve strong performances on few-shot finetuning by reformulating downstream tasks as a language modeling problem. In this work, we demonstrate that, despite its advantages on low data regimes, finetuned prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inference heuristics based on lexical overlap, e.g., models incorrectly assuming a sentence pair is of the same meaning because they consist of the same set of words. Interestingly, we find that this particular inference heuristic is significantly less present in the zero-shot evaluation of the prompt-based model, indicating how finetuning can be destructive to useful knowledge learned during the pretraining. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning. Our evaluation on three datasets demonstrates promising improvements on the three corresponding challenge datasets used to diagnose the inference heuristics.

    Bullet Points

    • Prompt-based approaches can improve pretrained language models on few-shot finetuning by reformulating downstream tasks as a language modeling problem

    • However, finetuned prompt-based models for sentence pair classification tasks still suffer from inference heuristics based on lexical overlap, which can be destructive to useful knowledge learned during pretraining

    • Adding regularization that preserves pretraining weights is effective in mitigating this destructive tendency

    • The evaluation on three datasets shows promising improvements.

  33. PPT: Pre-trained Prompt Tuning for Few-shot Learning, Yuxian Gu,Xu Han,Zhiyuan Liu,Minlie Huang, 09-09-2021

    Categories

    Computation and Language

    Abstract

    Prompts for pre-trained language models (PLMs) have shown remarkable performance by bridging the gap between pre-training tasks and various downstream tasks. Among these methods, prompt tuning, which freezes PLMs and only tunes soft prompts, provides an efficient and effective solution for adapting large-scale PLMs to downstream tasks. However, prompt tuning is yet to be fully explored. In our pilot experiments, we find that prompt tuning performs comparably with conventional full-model fine-tuning when downstream data are sufficient, whereas it performs much worse under few-shot learning settings, which may hinder the application of prompt tuning in practice. We attribute this low performance to the manner of initializing soft prompts. Therefore, in this work, we propose to pre-train prompts by adding soft prompts into the pre-training stage to obtain a better initialization. We name this Pre-trained Prompt Tuning framework "PPT". To ensure the generalization of PPT, we formulate similar classification tasks into a unified task form and pre-train soft prompts for this unified task. Extensive experiments show that tuning pre-trained prompts for downstream tasks can reach or even outperform full-model fine-tuning under both full-data and few-shot settings. Our approach is effective and efficient for using large-scale PLMs in practice.

    Bullet Points

    • Prompt tuning for pre-trained language models (PLMs) has shown remarkable performance in bridging the gap between pre-training tasks and downstream tasks

    • However, prompt tuning is yet to be fully explored

    • The method freezes PLMs and only tunes soft prompts, which performs comparably with conventional full-model fine-tuning when downstream data are sufficient, whereas it performs much worse under few-shot learning settings, which may hinder the application of prompt tuning in practice

    • To achieve better initialization, we propose to pre-train prompts by incorporating similar classification tasks into a unified task form

    • Pre-train soft prompt for downstream tasks can reach or even outperform full- model fine-tinging under both full-data and few-shoot settings

    • Our approach is effective and efficient for using large-scale LLMs in practice, according to our pilot experiments.

  34. CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented Dialog Systems, Fei Mi,Yitong Li,Yasheng Wang,Xin Jiang,Qun Liu, 10-09-2021

    Categories

    Computation and Language, Machine Learning

    Abstract

    As labeling cost for different modules in task-oriented dialog (ToD) systems is high, a major challenge in practice is to learn different tasks with the least amount of labeled data. Recently, prompting methods over pre-trained language models (PLMs) have shown promising results for few-shot learning in ToD. To better utilize the power of PLMs, this paper proposes Comprehensive Instruction (CINS) that exploits PLMs with extra task-specific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD, i.e. intent classification, dialog state tracking, and natural language generation. A sequence-to-sequence model (T5) is adopted to solve these three tasks in a unified framework. Extensive experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data. Empirical results demonstrate that the proposed CINS approach consistently improves techniques that finetune PLMs with raw input or short prompts.

    Bullet Points

    • The paper proposes Comprehensive Instruction (CINS) that utilizes PLMs with extra task-specific instructions and their customized realizations for three important downstream tasks in ToD, i.e

    • intent classification, dialog state tracking, and natural language generation

    • Extensive experiments are conducted on these tasks in realistic few-shot learning scenarios with small validation data

    • Empirical results demonstrate that the proposed CINS approach consistently improves techniques that finetune PLM models with raw input or short prompts.

  35. PoKE: A Prompt-based Knowledge Eliciting Approach for Event Argument Extraction, Jiaju Lin,Qin Chen, 11-09-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Eliciting knowledge from pre-trained language models via prompt-based learning has shown great potential in many natural language processing tasks. Whereas, the applications for more complex tasks such as event extraction are less studied since the design of prompt is not straightforward for the structured event containing various triggers and arguments. % Meanwhile, current conditional generation methods employ large encoder-decoder models, which are costly to train and serve. In this paper, we present a novel prompt-based approach, which elicits both the independent and joint knowledge about different events for event argument extraction. The experimental results on the benchmark ACE2005 dataset show the great advantages of our proposed approach. In particular, our approach is superior to the recent advanced methods in both fully-supervised and low-resource scenarios.

    Bullet Points

    • Prompt-based learning has potential in natural language processing tasks, but more complex tasks such as event extraction are less studied due to the design of prompts not straightforward for the structured event containing various triggers and arguments

    • Current conditional generation methods employ large encoder-decoder models, which are costly to train and serve

    • The proposed prompt-based approach is superior to recent advanced methods in both fully-supervised and low-resource scenarios.

  36. Exploring Prompt-based Few-shot Learning for Grounded Dialog Generation, Chujie Zheng,Minlie Huang, 14-09-2021

    Categories

    Computation and Language

    Abstract

    Dialog models can be greatly strengthened through grounding on various external information, but grounded dialog corpora are usually not naturally accessible. In this work, we focus on the few-shot learning for grounded dialog generation (GDG). We first propose a simple prompting method for GDG tasks, where different constructs of model input, such as the grounding source and the conversation context, are distinguished through continuous or discrete prompts. On three typical GDG tasks, we empirically demonstrate and analyze in-depth the effectiveness of our method. We then conduct extensive experiments to thoroughly investigate how our prompting method works with different pre-trained models. We show that prompted language models perform superiorly to conversational models, and further analyze various factors that influence the effects of prompting. Overall, our work introduces a prompt-based perspective to the few-shot learning for GDG tasks, and provides valuable findings and insights for future research.

    Bullet Points

    • The work focuses on few-shot learning for grounded dialog generation (GDG) and proposes a simple prompting method for GDG tasks where different constructs of model input are distinguished through continuous or discrete prompts

    • We empirically demonstrate and analyze the effectiveness of our method on three typical GDF tasks and conduct extensive experiments to investigate how it works with different pre-trained models

    • Prompted language models perform superiorly to conversational models and further analyze factors that influence the effects of prompting

    • This approach provides valuable insights for future research.

  37. Can Language Models be Biomedical Knowledge Bases?, Mujeen Sung,Jinhyuk Lee,Sean Yi,Minji Jeon,Sungdong Kim,Jaewoo Kang, 15-09-2021

    Categories

    Computation and Language

    Abstract

    Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can be used as domain-specific KBs. To this end, we create the BioLAMA benchmark, which is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs. We find that biomedical LMs with recently proposed probing methods can achieve up to 18.51% Acc@5 on retrieving biomedical knowledge. Although this seems promising given the task difficulty, our detailed analyses reveal that most predictions are highly correlated with prompt templates without any subjects, hence producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs. We hope that BioLAMA can serve as a challenging benchmark for biomedical factual probing.

    Bullet Points

    • BioLAMA is a benchmark for probing biomedical LMs with recently proposed probing methods, which can achieve up to 18.51% Acc@5 on retrieving biological knowledge

    • However, most predictions are highly correlated with prompt templates without any subjects, producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs.

  38. Language Models are Few-shot Multilingual Learners, Genta Indra Winata,Andrea Madotto,Zhaojiang Lin,Rosanne Liu,Jason Yosinski,Pascale Fung, 16-09-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    General-purpose language models have demonstrated impressive capabilities, performing on par with state-of-the-art approaches on a range of downstream natural language processing (NLP) tasks and benchmarks when inferring instructions from very few examples. Here, we evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages without any parameter updates. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones. Finally, we find the in-context few-shot cross-lingual prediction results of language models are significantly better than random prediction, and they are competitive compared to the existing state-of-the-art cross-lingual models.

    Bullet Points

    • The GPT and T5 models have impressive multilingual skills in conducting multi-class classification on non-English languages without any parameter updates

    • Pre-trained language models can predict both English test samples and non- English ones

    • In-context few-shot cross-lingual prediction results are significantly better than random prediction and are competitive compared to existing state-of-the-art cross-linguistic models.

  39. Reframing Instructional Prompts to GPTk's Language, Swaroop Mishra,Daniel Khashabi,Chitta Baral,Yejin Choi,Hannaneh Hajishirzi, 16-09-2021

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    What kinds of instructional prompts are easier to follow for Language Models (LMs)? We study this question by conducting extensive empirical analysis that shed light on important features of successful instructional prompts. Specifically, we study several classes of reframing techniques for manual reformulation of prompts into more effective ones. Some examples include decomposing a complex task instruction into multiple simpler tasks or itemizing instructions into sequential steps. Our experiments compare the zero-shot and few-shot performance of LMs prompted with reframed instructions on 12 NLP tasks across 6 categories. Compared with original instructions, our reframed instructions lead to significant improvements across LMs with different sizes. For example, the same reframed prompts boost few-shot performance of GPT3-series and GPT2-series by 12.5% and 6.7% respectively averaged over all tasks. Furthermore, reframed instructions reduce the number of examples required to prompt LMs in the few-shot setting. We hope these empirically-driven techniques will pave the way towards more effective future prompting algorithms.

    Bullet Points

    • Empirical analysis suggests that manual reformulation of prompts into more effective ones is easier for Language Models (LMs)

    • Reframed instructions lead to significant improvements across LMs with different sizes, boosting few-shot performance of GPT3-series and GPT2-series by 12.5% and 6.7% respectively averaged over all tasks

    • These techniques will pave the way towards more effective future prompting algorithms.

  40. SentiPrompt: Sentiment Knowledge Enhanced Prompt-Tuning for Aspect-Based Sentiment Analysis, Chengxi Li,Feiyu Gao,Jiajun Bu,Lu Xu,Xiang Chen,Yu Gu,Zirui Shao,Qi Zheng,Ningyu Zhang,Yongpan Wang,Zhi Yu, 17-09-2021

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Aspect-based sentiment analysis (ABSA) is an emerging fine-grained sentiment analysis task that aims to extract aspects, classify corresponding sentiment polarities and find opinions as the causes of sentiment. The latest research tends to solve the ABSA task in a unified way with end-to-end frameworks. Yet, these frameworks get fine-tuned from downstream tasks without any task-adaptive modification. Specifically, they do not use task-related knowledge well or explicitly model relations between aspect and opinion terms, hindering them from better performance. In this paper, we propose SentiPrompt to use sentiment knowledge enhanced prompts to tune the language model in the unified framework. We inject sentiment knowledge regarding aspects, opinions, and polarities into prompt and explicitly model term relations via constructing consistency and polarity judgment templates from the ground truth triplets. Experimental results demonstrate that our approach can outperform strong baselines on Triplet Extraction, Pair Extraction, and Aspect Term Extraction with Sentiment Classification by a notable margin.

    Bullet Points

    • SentiPrompt is a fine-grained sentiment analysis task that uses sentiment knowledge enhanced prompts to tune the language model in the unified framework without any task-adaptive modification

    • It injects sentiment knowledge regarding aspects, opinions, and polarities into prompts and explicitly model term relations via consistency and sentiment judgment templates from the ground truth triplets

    • This approach can outperform strong baselines on Triplet Extraction, Pair Extraction and Aspect Term Extraction with Sentiment Classification by a notable margin.

  41. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks, Xiao Liu,Kaixuan Ji,Yicheng Fu,Weng Lam Tam,Zhengxiao Du,Zhilin Yang,Jie Tang, 14-10-2021

    Categories

    Computation and Language

    Abstract

    Bullet Points

  42. Generated Knowledge Prompting for Commonsense Reasoning, Jiacheng Liu,Alisa Liu,Ximing Lu,Sean Welleck,Peter West,Ronan Le Bras,Yejin Choi,Hannaneh Hajishirzi, 15-10-2021

    Categories

    Computation and Language

    Abstract

  43. Multitask Prompted Training Enables Zero-Shot Task Generalization, Victor Sanh,Albert Webson,Colin Raffel,Stephen H. Bach,Lintang Sutawika,Zaid Alyafeai,Antoine Chaffin,Arnaud Stiegler,Teven Le Scao,Arun Raja,Manan Dey,M Saiful Bari,Canwen Xu,Urmish Thakker,Shanya Sharma Sharma,Eliza Szczechla,Taewoon Kim,Gunjan Chhablani,Nihal Nayak,Debajyoti Datta,Jonathan Chang,Mike Tian-Jian Jiang,Han Wang,Matteo Manica,Sheng Shen,Zheng Xin Yong,Harshit Pandey,Rachel Bawden,Thomas Wang,Trishala Neeraj,Jos Rozen,Abheesht Sharma,Andrea Santilli,Thibault Fevry,Jason Alan Fries,Ryan Teehan,Tali Bers,Stella Biderman,Leo Gao,Thomas Wolf,Alexander M. Rush, 15-10-2021

    Categories

    Machine Learning, Computation and Language

    Abstract

    Bullet Points

  44. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction, Keshav Santhanam,Omar Khattab,Jon Saad-Falcon,Christopher Potts,Matei Zaharia, 02-12-2021

    Categories

    Information Retrieval, Computation and Language

    Abstract

    Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many neural IR methods encode queries and documents into single-vector representations, late interaction models produce multi-vector representations at the granularity of each token and decompose relevance modeling into scalable token-level computations. This decomposition has been shown to make late interaction more effective, but it inflates the space footprint of these models by an order of magnitude. In this work, we introduce ColBERTv2, a retriever that couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction. We evaluate ColBERTv2 across a wide range of benchmarks, establishing state-of-the-art quality within and outside the training domain while reducing the space footprint of late interaction models by 6--10$\times$.

    Bullet Points

    • ColBERTv2 is a neural information retrieval retriever that couples a residual compression mechanism with a denoised supervision strategy to improve the quality and space footprint of late interaction models by reducing their space footprint by 6--10$times$

    • It is evaluated across a wide range of benchmarks and establishes state-of-the-art quality within and outside the training domain.

  45. WebGPT: Browser-assisted question-answering with human feedback, Reiichiro Nakano,Jacob Hilton,Suchir Balaji,Jeff Wu,Long Ouyang,Christina Kim,Christopher Hesse,Shantanu Jain,Vineet Kosaraju,William Saunders,Xu Jiang,Karl Cobbe,Tyna Eloundou,Gretchen Krueger,Kevin Button,Matthew Knight,Benjamin Chess,John Schulman, 17-12-2021

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

    Bullet Points

    • Fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment

    • Train models using imitation learning and optimize answer quality with human feedback

    • Collect references while browsing in support of their answers

    • Train and evaluate models on ELI5, a dataset of questions asked by Reddit users

    • Best model obtained using behavior cloning and rejection sampling against a reward model trained to predict human preferences

    • Answers are preferred by humans 56% of the time.