2024 (199 papers)

A Computational Framework for Behavioral Assessment of LLM Therapists, Yu Ying Chiu,Ashish Sharma,Inna Wanyin Lin,Tim Althoff, 01-01-2024

Categories

Computation and Language, Human-Computer Interaction

Abstract

The emergence of ChatGPT and other large language models (LLMs) has greatly increased interest in utilizing LLMs as therapists to support individuals struggling with mental health challenges. However, due to the lack of systematic studies, our understanding of how LLM therapists behave, i.e., ways in which they respond to clients, is significantly limited. Understanding their behavior across a wide range of clients and situations is crucial to accurately assess their capabilities and limitations in the high-risk setting of mental health, where undesirable behaviors can lead to severe consequences. In this paper, we propose BOLT, a novel computational framework to study the conversational behavior of LLMs when employed as therapists. We develop an in-context learning method to quantitatively measure the behavior of LLMs based on 13 different psychotherapy techniques including reflections, questions, solutions, normalizing, and psychoeducation. Subsequently, we compare the behavior of LLM therapists against that of high- and low-quality human therapy, and study how their behavior can be modulated to better reflect behaviors observed in high-quality therapy. Our analysis of GPT and Llama-variants reveals that these LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions, which is against typical recommendations. At the same time, unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths. Our analysis framework suggests that despite the ability of LLMs to generate anecdotal examples that appear similar to human therapists, LLM therapists are currently not fully consistent with high-quality care, and thus require additional research to ensure quality care.

Bullet Points
- The paper proposes BOLT, a computational framework to study the conversational behavior of LLMs when employed as therapists, and develops an in-context learning method to quantitatively measure their behavior based on 13 psychotherapy techniques
- The study compares LLM behavior against that of high-quality human therapy and explores how their behavior can be modulated to better reflect behaviors observed in low-quality therapy
- Despite the ability to generate anecdotal examples that appear similar to human therapist, LLM therapy is currently not fully consistent with high quality care and requires additional research to ensure quality care.
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models, Guangji Bai,Zheng Chai,Chen Ling,Shiyu Wang,Jiaying Lu,Nan Zhang,Tingwei Shi,Ziyang Yu,Mengdan Zhu,Yifei Zhang,Carl Yang,Yue Cheng,Liang Zhao, 01-01-2024

Categories

Machine Learning

Abstract

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated models like OpenAI's ChatGPT, represents a significant advancement in artificial intelligence. These models, however, bring forth substantial challenges in the high consumption of computational, memory, energy, and financial resources, especially in environments with limited resource capabilities. This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs. We categorize methods based on their optimization focus: computational, memory, energy, financial, and network resources and their applicability across various stages of an LLM's lifecycle, including architecture design, pretraining, finetuning, and system design. Additionally, the survey introduces a nuanced categorization of resource efficiency techniques by their specific resource types, which uncovers the intricate relationships and mappings between various resources and corresponding optimization techniques. A standardized set of evaluation metrics and datasets is also presented to facilitate consistent and fair comparisons across different models and techniques. By offering a comprehensive overview of the current sota and identifying open research avenues, this survey serves as a foundational reference for researchers and practitioners, aiding them in developing more sustainable and efficient LLMs in a rapidly evolving landscape.

Bullet Points
- This survey reviews techniques to enhance resource efficiency of LLMs, categorizing them based on their optimization focus and applicability across various stages of an LLM's lifecycle
- The survey also presents a nuanced categorization of resource efficiency techniques by their specific resource types, providing a foundational reference for researchers and practitioners.
General-purpose foundation models for increased autonomy in robot-assisted surgery, Samuel Schmidgall,Ji Woong Kim,Alan Kuntz,Ahmed Ezzat Ghazi,Axel Krieger, 01-01-2024

Categories

Robotics, Machine Learning, Quantitative Biology

Abstract

The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific objectives that solve a single robotic problem such as picking up an object or reaching a target position. However, recent work on high-capacity models in robotics has shown promise toward being trained on large collections of diverse and task-agnostic datasets of video demonstrations. These models have shown impressive levels of generalization to unseen circumstances, especially as the amount of data and the model complexity scale. Surgical robot systems that learn from data have struggled to advance as quickly as other fields of robot learning for a few reasons: (1) there is a lack of existing large-scale open-source data to train models, (2) it is challenging to model the soft-body deformations that these robots work with during surgery because simulation cannot match the physical and visual complexity of biological tissue, and (3) surgical robots risk harming patients when tested in clinical trials and require more extensive safety measures. This perspective article aims to provide a path toward increasing robot autonomy in robot-assisted surgery through the development of a multi-modal, multi-task, vision-language-action model for surgical robots. Ultimately, we argue that surgical robots are uniquely positioned to benefit from general-purpose models and provide three guiding actions toward increased autonomy in robot-assisted surgery.

Bullet Points
- The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific objectives
- However, recent work on high-capacity models in robotics has shown promise towards being trained on large collections of diverse and task-agnostic datasets of video demonstrations
- Surgical robot systems that learn from data have struggled to advance as quickly as other fields of robot learning due to lack of large-scale open-source data, difficulty in modeling soft-body deformations, and risk of harming patients when tested in clinical trials
- The article aims to develop a multi-modal, multi-task, vision-language-action model for surgical robots to increase robot autonomy in robot-assisted surgery.
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents, Ke Yang,Jiateng Liu,John Wu,Chaoqi Yang,Yi R. Fung,Sha Li,Zixuan Huang,Xu Cao,Xingyao Wang,Yiquan Wang,Heng Ji,Chengxiang Zhai, 01-01-2024

Categories

Computation and Language

Abstract

The prominent large language models (LLMs) of today differ from past language models not only in size, but also in the fact that they are trained on a combination of natural language and formal language (code). As a medium between humans and computers, code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity. In this survey, we present an overview of the various benefits of integrating code into LLMs' training data. Specifically, beyond enhancing LLMs in code generation, we observe that these unique properties of code help (i) unlock the reasoning ability of LLMs, enabling their applications to a range of more complex natural language tasks; (ii) steer LLMs to produce structured and precise intermediate steps, which can then be connected to external execution ends through function calls; and (iii) take advantage of code compilation and execution environment, which also provides diverse feedback for model improvement. In addition, we trace how these profound capabilities of LLMs, brought by code, have led to their emergence as intelligent agents (IAs) in situations where the ability to understand instructions, decompose goals, plan and execute actions, and refine from feedback are crucial to their success on downstream tasks. Finally, we present several key challenges and future directions of empowering LLMs with code.

Bullet Points
- The survey discusses the benefits of integrating code into LLMs' training data, including unlocking their reasoning ability, steering them to produce structured and precise intermediate steps, taking advantage of code compilation and execution environment, and tracing how these capabilities have led to their emergence as intelligent agents in downstream tasks
- Key challenges and future directions are presented.
The Earth is Flat? Unveiling Factual Errors in Large Language Models, Wenxuan Wang,Juluan Shi,Zhaopeng Tu,Youliang Yuan,Jen-tse Huang,Wenxiang Jiao,Michael R. Lyu, 01-01-2024

Categories

Software Engineering, Artificial Intelligence, Computation and Language

Abstract

Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by test data leakage or the need for extensive human labor, hindering efficient and accurate error detection. To tackle this problem, we introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs. This framework involves three main steps: First, it constructs a factual knowledge graph by retrieving fact triplets from a large-scale knowledge database. Then, leveraging the knowledge graph, FactChecker employs a rule-based approach to generates three types of questions (Yes-No, Multiple-Choice, and WH questions) that involve single-hop and multi-hop relations, along with correct answers. Lastly, it assesses the LLMs' responses for accuracy using tailored matching strategies for each question type. Our extensive tests on six prominent LLMs, including text-davinci-002, text-davinci-003, ChatGPT~(gpt-3.5-turbo, gpt-4), Vicuna, and LLaMA-2, reveal that FactChecker can trigger factual errors in up to 45% of questions in these models. Moreover, we demonstrate that FactChecker's test cases can improve LLMs' factual accuracy through in-context learning and fine-tuning (e.g., llama-2-13b-chat's accuracy increase from 35.3% to 68.5%). We are making all code, data, and results available for future research endeavors.

Bullet Points
- FactChecker is an automatic testing framework that aims to uncover factual errors in large language models like ChatGPT
- It uses a rule-based approach to generate three types of questions that involve single-hop and multi-hop relations, along with correct answers, and assesses the LLMs' responses for accuracy using tailored matching strategies for each question type
- The framework has been tested on six prominent LLM models, including text-davinci-002, text-dravinci-012, chatGPT-3.5-turbo, gpt-4, Vicuna, and LLaMA-2, and can trigger factual inaccuracies in up to 45% of questions in these models
- We are making all code, data, and results available for future research endeavors.
A Comprehensive Study of Knowledge Editing for Large Language Models, Ningyu Zhang,Yunzhi Yao,Bozhong Tian,Peng Wang,Shumin Deng,Mengru Wang,Zekun Xi,Shengyu Mao,Jintian Zhang,Yuansheng Ni,Siyuan Cheng,Ziwen Xu,Xin Xu,Jia-Chen Gu,Yong Jiang,Pengjun Xie,Fei Huang,Lei Liang,Zhiqiang Zhang,Xiaowei Zhu,Jun Zhou,Huajun Chen, 02-01-2024

Categories

Computation and Language, Artificial Intelligence, Computer Vision, Human-Computer Interaction, Machine Learning

Abstract

Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication. However, a primary limitation lies in the significant computational demands during training, arising from their extensive parameterization. This challenge is further intensified by the dynamic nature of the world, necessitating frequent updates to LLMs to correct outdated information or integrate new knowledge, thereby ensuring their continued relevance. Note that many applications demand continual model adjustments post-training to address deficiencies or undesirable behaviors. There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications. To this end, recent years have seen a burgeoning in the techniques of knowledge editing for LLMs, which aim to efficiently modify LLMs' behaviors within specific domains while preserving overall performance across various inputs. In this paper, we first define the knowledge editing problem and then provide a comprehensive review of cutting-edge approaches. Drawing inspiration from educational and cognitive research theories, we propose a unified categorization criterion that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge. Furthermore, we introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches. Additionally, we provide an in-depth analysis of knowledge location, which can give a deeper understanding of the knowledge structures inherent within LLMs. Finally, we discuss several potential applications of knowledge editing, outlining its broad and impactful implications.

Bullet Points
- The paper discusses the limitations of Large Language Models (LLMs) in understanding and generating text that closely mirrors human communication
- The computational demands during training are significant, and frequent updates are necessary to correct outdated information or integrate new knowledge
- There is an increasing interest in efficient, lightweight methods for on-the-fly model modifications
- Recent years have seen a burgeoning in the techniques of knowledge editing for LLMs
- A unified categorization criterion is proposed that classifies knowledge editing methods into three groups: resorting to external knowledge, merging knowledge into the model, and editing intrinsic knowledge
- A new benchmark, KnowEdit, is introduced for a comprehensive empirical evaluation of representative knowledge editing approaches
- Additionally, an in-depth analysis of knowledge location provides a deeper understanding of the knowledge structures inherent within LLM
- The paper concludes by discussing several potential applications of Knowledge Editing, outlining its broad and impactful
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models, S.M Towhidul Islam Tonmoy,S M Mehedi Zaman,Vinija Jain,Anku Rani,Vipula Rawte,Aman Chadha,Amitava Das, 02-01-2024

Categories

Computation and Language

Abstract

As Large Language Models (LLMs) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded. This issue of hallucination is arguably the biggest hindrance to safely deploying these powerful LLMs into real-world production systems that impact people's lives. The journey toward widespread adoption of LLMs in practical settings heavily relies on addressing and mitigating hallucinations. Unlike traditional AI systems focused on limited tasks, LLMs have been exposed to vast amounts of online text data during training. While this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. This becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, financial analysis reports, etc. This paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in LLMs. Notable among these are Retrieval Augmented Generation (Lewis et al, 2021), Knowledge Retrieval (Varshney et al,2023), CoNLI (Lei et al, 2023), and CoVe (Dhuliawala et al, 2023). Furthermore, we introduce a detailed taxonomy categorizing these methods based on various parameters, such as dataset utilization, common tasks, feedback mechanisms, and retriever types. This classification helps distinguish the diverse approaches specifically designed to tackle hallucination issues in LLMs. Additionally, we analyze the challenges and limitations inherent in these techniques, providing a solid foundation for future research in addressing hallucinations and related phenomena within the realm of LLMs.

Bullet Points
- The paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in LLMs, including Retrieval Augmented Generation (Revshney et al., 2021), Knowledge Revieval (CoNLI), and CoVe
- The paper categorizes these techniques based on dataset utilization, common tasks, feedback mechanisms, and retriever types, and analyzes the challenges and limitations inherent in these techniques for future research.
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, Hongye Jin,Xiaotian Han,Jingfeng Yang,Zhimeng Jiang,Zirui Liu,Chia-Yuan Chang,Huiyuan Chen,Xia Hu, 02-01-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The limited length of the training sequence during training may limit the application of Large Language Models (LLMs) on long input sequences for inference. In this work, we argue that existing LLMs themselves have inherent capabilities for handling long contexts. Based on this argument, we suggest extending LLMs' context window by themselves to fully utilize the inherent ability.We propose Self-Extend to stimulate LLMs' long context handling potential. The basic idea is to construct bi-level attention information: the group level and the neighbor level. The two levels are computed by the original model's self-attention, which means the proposed does not require any training. With only four lines of code modification, the proposed method can effortlessly extend existing LLMs' context window without any fine-tuning. We conduct comprehensive experiments and the results show that the proposed method can effectively extend existing LLMs' context window's length.

Bullet Points
- The work proposes extending LLMs' context window by themselves to fully utilize their inherent ability to handle long contexts without fine-tuning
- The proposed method, Self-Extend, involves building bi-level attention information on the group level and neighbor level using the original model's self-attention, which does not require any training
- The experiment results show that the proposed method can effectively extend existing LLM's context window's length.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models, Zixiang Chen,Yihe Deng,Huizhuo Yuan,Kaixuan Ji,Quanquan Gu, 02-01-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language, Machine Learning

Abstract

Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.

Bullet Points
- The paper proposes a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised well-tuned model
- The LLM refines its capability by playing against instances of itself, refining its policy by discerning self-generated responses from those obtained from human-annotated data
- The global optimum to the training objective function of SPIN is achieved only when the LLM policy aligns with the target data distribution
- The results demonstrate that SPIN can significantly improve LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data.
LLaMA Beyond English: An Empirical Study on Language Capability Transfer, Jun Zhao,Zhihao Zhang,Luhui Gao,Qi Zhang,Tao Gui,Xuanjing Huang, 02-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

Bullet Points
- The paper focuses on how to transfer language generation and following instructions to a non-English language by conducting an empirical investigation based on LLaMA and analyzing the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer
- Four standardized testing benchmarks are used to assess the model's level of knowledge, and a comprehensive evaluation of its response quality is conducted using LLM-Eval, a benchmark consisting instruction tasks from 17 diverse categories
- Comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality
- Experimental outcomes across thirteen low-resource languages also exhibit similar trends
- The conclusions revealed by the experiments will aid the community in developing non- English LLMs.
Enhancing the medical foundation model with multi-scale and cross-modality feature learning, Weijian Huang,Cheng Li,Hong-Yu Zhou,Jiarun Liu,Hao Yang,Yong Liang,Shanshan Wang, 03-01-2024

Categories

Computer Vision

Abstract

The development of multi-modal medical foundation models has attracted significant attention in the field of medicine and healthcare due to their promising prospects in various clinical applications. One area of focus in this research direction is the extractions of features at different scales. While previous studies have explored feature learning at individual scales, investigation on integrating the diverse scales and modalities of information is lacking, which may hinder the potential for mutual reinforcement among these features. This paper aims to bridge this gap by proposing a method that effectively exploits multi-scale and cross-modality information to enhance the performance of medical foundation models. The proposed method simultaneously exploit features at the local, instance, modality and global aspects, facilitating comprehensive representation learning within the models. We evaluate the effectiveness of the proposed method on six open-source datasets across different clinical tasks, demonstrating its ability to enhance the performance of medical foundation models.

Bullet Points
- The paper proposes a method that effectively exploits multi-scale and cross-modality information to enhance the performance of medical foundation models
- The proposed method leverages features at local, instance, modality, and global aspects, facilitating comprehensive representation learning within the models
- We evaluated its effectiveness on six open-source datasets across different clinical tasks.
Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review, Luoma Ke,(1),,Song Tong,(1),,Peng Cheng,(2),,Kaiping Peng,(1) ((1) Department of Psychology, Tsinghua University, (2) School of Social Science, Tsinghua University), 03-01-2024

Categories

Machine Learning, Artificial Intelligence

Abstract

This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses the impact of LLMs across various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology, highlighting their potential to simulate aspects of human cognition and behavior. The paper delves into the capabilities of these models to emulate human-like text generation, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology. While LLMs are essential in advancing research methodologies in psychology, the paper also cautions about their technical and ethical challenges. There are issues like data privacy, the ethical implications of using LLMs in psychological research, and the need for a deeper understanding of these models' limitations. Researchers should responsibly use LLMs in psychological studies, adhering to ethical standards and considering the potential consequences of deploying these technologies in sensitive areas. Overall, the article provides a comprehensive overview of the current state of LLMs in psychology, exploring potential benefits and challenges. It serves as a call to action for researchers to leverage LLMs' advantages responsibly while addressing associated risks.

Bullet Points
- The paper explores the frontiers of large language models (LLMs) in psychology applications, exploring their impact on various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology
- LLMs can simulate human cognition and behavior, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology
- However, the paper cautions about their technical and ethical challenges, including data privacy, ethical implications, and the need for a deeper understanding of these models' limitations
- Researchers should responsibly use these technologies in psychological studies, adhering to ethical standards and considering the potential consequences of deploying them in sensitive areas.
Few-shot Adaptation of Multi-modal Foundation Models: A Survey, Fan Liu,Tianshu Zhang,Wenwen Dai,Wenwen Cai,Xiaocong Zhou,Delong Chen, 03-01-2024

Categories

Computer Vision

Abstract

Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.

Bullet Points
- Three possible solutions for few-shot adaptation methods for multi-modal models are prompt-based, adapter-based and external knowledge-based methods
- These solutions are based on the limited theoretical support for existing methods, domain gap, model capacity, and sample size, as well as adaptive domain generalization.
Large Language Models Relearn Removed Concepts, Michelle Lo,Shay B. Cohen,Fazl Barez, 03-01-2024

Categories

Artificial Intelligence

Abstract

Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. Monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal.

Bullet Points
- Neuron pruning can remove undesirable concepts from large language models, but it is unclear whether models have the capacity to reacquire pruned concepts after editing
- To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining
- Models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned Concepts to primed neurons with similar semantics
- This demonstrates polysemantic capacities and can blend old and new concepts in individual neurons
- While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model textit safety
- Monitoring concept emergence and developing techniques to mitigate re learning of unsafe concepts will be important directions for more robust model editing
- Overall, our work highlights the resilience and fluidity of concept representations in LLMs post concept removal.
Correctness Comparison of ChatGPT-4, Bard, Claude-2, and Copilot for Spatial Tasks, Hartwig H. Hochmair,Levente Juhasz,Takoda Kemp, 04-01-2024

Categories

Computers and Society

Abstract

Generative AI including large language models (LLMs) have recently gained significant interest in the geo-science community through its versatile task-solving capabilities including coding, spatial computations, generation of sample data, time-series forecasting, toponym recognition, or image classification. So far, the assessment of LLMs for spatial tasks has primarily focused on ChatGPT, arguably the most prominent AI chatbot, whereas other chatbots received less attention. To narrow this research gap, this study evaluates the correctness of responses for a set of 54 spatial tasks assigned to four prominent chatbots, i.e., ChatGPT-4, Bard, Claude-2, and Copilot. Overall, the chatbots performed well on spatial literacy, GIS theory, and interpretation of programming code and given functions, but revealed weaknesses in mapping, code generation, and code translation. ChatGPT-4 outperformed other chatbots across most task categories.

Bullet Points
- Generative AI, including LLMs, has gained interest in the geo-science community due to its versatile task-solving capabilities, including coding, spatial computations, generation of sample data, time-series forecasting, toponym recognition, or image classification
- The study evaluated the correctness of responses for 54 spatial tasks assigned to four prominent chatbots, i.e
- ChatGPT-4, Bard, Claude-2, and Copilot, and revealed weaknesses in mapping, code generation, and code translation.
LLM Augmented LLMs: Expanding Capabilities through Composition, Rachit Bansal,Bidisha Samanta,Siddharth Dalmia,Nitish Gupta,Shikhar Vashishth,Sriram Ganapathy,Abhishek Bapna,Prateek Jain,Partha Talukdar, 04-01-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision

Abstract

Foundational models with billions of parameters which have been trained on large corpora of data have demonstrated non-trivial skills in a variety of domains. However, due to their monolithic structure, it is challenging and expensive to augment them or impart new skills. On the other hand, due to their adaptation abilities, several new instances of these models are being trained towards new domains and tasks. In this work, we study the problem of efficient and practical composition of existing foundation models with more specific models to enable newer capabilities. To this end, we propose CALM -- Composition to Augment Language Models -- which introduces cross-attention between models to compose their representations and enable new capabilities. Salient features of CALM are: (i) Scales up LLMs on new tasks by 're-using' existing LLMs along with a few additional parameters and data, (ii) Existing model weights are kept intact, and hence preserves existing capabilities, and (iii) Applies to diverse domains and settings. We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English and arithmetic reasoning for low-resource languages. Similarly, when PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40% over the base model for code generation and explanation tasks -- on-par with fully fine-tuned counterparts.

Bullet Points
- CALM introduces cross-attention between existing foundation models to compose their representations and enable new capabilities
- CALM allows for scaling up LLMs on new tasks by 're-using' existing models along with a few additional parameters and data, and preserves existing capabilities
- It applies to diverse domains and settings
- A smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English and arithmetic reasoning, while a code-specific model improves 40% over the base model for code generation and explanation tasks.
LLaMA Pro: Progressive LLaMA with Block Expansion, Chengyue Wu,Yukang Gan,Yixiao Ge,Zeyu Lu,Jiahao Wang,Ye Feng,Ping Luo,Ying Shan, 04-01-2024

Categories

Computation and Language

Abstract

Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

Bullet Points
- The paper proposes a new post-pretraining method for LLMs with an expansion of Transformer blocks
- The expanded blocks are tuned using only new corpus, improving the model's knowledge without catastrophic forgetting
- LLaMA Pro-8.3B is a versatile foundation model that excels in general tasks, programming, and mathematics
- It achieves advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLAMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent
- The findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.
LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model, Yichen Zhu,Minjie Zhu,Ning Liu,Zhicai Ou,Xiaofeng Mou,Jian Tang, 04-01-2024

Categories

Computer Vision, Computation and Language

Abstract

}. None
TinyLlama: An Open-Source Small Language Model, Peiyuan Zhang,Guangtao Zeng,Tianduo Wang,Wei Lu, 04-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

. None
Understanding LLMs: A Comprehensive Overview from Training to Inference, Yiheng Liu,Hao He,Tianle Han,Xu Zhang,Mengyuan Liu,Jiaming Tian,Yutong Zhang,Jiaqi Wang,Xiaohui Gao,Tianyang Zhong,Yi Pan,Shaochen Xu,Zihao Wu,Zhengliang Liu,Xin Zhang,Shu Zhang,Xintao Hu,Tuo Zhang,Ning Qiang,Tianming Liu,Bao Ge, 04-01-2024

Categories

Computation and Language

Abstract

The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development.

Bullet Points
- The introduction of ChatGPT has led to an increase in the utilization of Large Language Models (LLMs) for downstream tasks, with a focus on cost-efficient training and deployment
- This trend represents the future development trend
- The paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend, including data preprocessing, training architecture, pre-training tasks, parallel training, relevant content related to model fine-tuning, model compression, parallel computation, memory scheduling, and structural optimization
- It also explores LLMs' utilization and provides insights into their future development.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism, DeepSeek-AI,:,Xiao Bi,Deli Chen,Guanting Chen,Shanhuang Chen,Damai Dai,Chengqi Deng,Honghui Ding,Kai Dong,Qiushi Du,Zhe Fu,Huazuo Gao,Kaige Gao,Wenjun Gao,Ruiqi Ge,Kang Guan,Daya Guo,Jianzhong Guo,Guangbo Hao,Zhewen Hao,Ying He,Wenjie Hu,Panpan Huang,Erhang Li,Guowei Li,Jiashi Li,Yao Li,Y.K. Li,Wenfeng Liang,Fangyun Lin,A.X. Liu,Bo Liu,Wen Liu,Xiaodong Liu,Xin Liu,Yiyuan Liu,Haoyu Lu,Shanghao Lu,Fuli Luo,Shirong Ma,Xiaotao Nie,Tian Pei,Yishi Piao,Junjie Qiu,Hui Qu,Tongzheng Ren,Zehui Ren,Chong Ruan,Zhangli Sha,Zhihong Shao,Junxiao Song,Xuecheng Su,Jingxiang Sun,Yaofeng Sun,Minghui Tang,Bingxuan Wang,Peiyi Wang,Shiyu Wang,Yaohui Wang,Yongji Wang,Tong Wu,Y. Wu,Xin Xie,Zhenda Xie,Ziwei Xie,Yiliang Xiong,Hanwei Xu,R.X. Xu,Yanhong Xu,Dejian Yang,Yuxiang You,Shuiping Yu,Xingkai Yu,B. Zhang,Haowei Zhang,Lecong Zhang,Liyue Zhang,Mingchuan Zhang,Minghua Zhang,Wentao Zhang,Yichao Zhang,Chenggang Zhao,Yao Zhao,Shangyan Zhou,Shunfeng Zhou,Qihao Zhu,Yuheng Zou, 05-01-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

Bullet Points
- The article discusses the study of scaling laws and DeepSeek LLM, a project focused on advancing open-source language models with a long-term perspective
- We present findings that facilitate scaling of large scale models in two commonly used Open-source configurations, 7B and 67B
- We introduce deepSeek LCLM and conduct supervised fine-tuning and Direct Preference Optimization on the dataset, resulting in the creation of deepseek chat models
- The evaluation results demonstrate that DeepSeven LLM 67Be surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning
- Additionally, open-ended evaluations reveal that Deepseek LML 67BE Chat exhibits superior performance compared to GPT-3.5.
From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models, Na Liu,Liangyu Chen,Xiaoyu Tian,Wei Zou,Kaijiang Chen,Ming Cui, 05-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

This paper introduces RAISE (Reasoning and Acting through Scratchpad and Examples), an advanced architecture enhancing the integration of Large Language Models (LLMs) like GPT-4 into conversational agents. RAISE, an enhancement of the ReAct framework, incorporates a dual-component memory system, mirroring human short-term and long-term memory, to maintain context and continuity in conversations. It entails a comprehensive agent construction scenario, including phases like Conversation Selection, Scene Extraction, CoT Completion, and Scene Augmentation, leading to the LLMs Training phase. This approach appears to enhance agent controllability and adaptability in complex, multi-turn dialogues. Our preliminary evaluations in a real estate sales context suggest that RAISE has some advantages over traditional agents, indicating its potential for broader applications. This work contributes to the AI field by providing a robust framework for developing more context-aware and versatile conversational agents.

Bullet Points
- The paper introduces RAISE, an advanced architecture that enhances the integration of LLMs like GPT-4 into conversational agents
- It entails a dual-component memory system that mirrors human short-term and long-term memory to maintain context and continuity in conversations
- The approach enhances agent controllability and adaptability in complex, multi-turn dialogues
- Preliminary evaluations suggest it has some advantages over traditional agents, indicating its potential for broader applications
- This work contributes to the AI field by providing a robust framework for developing more context-aware and versatile chatbots.
Thousands of AI Authors on the Future of AI, Katja Grace,Harlan Stewart,Julia Fabienne Sandkühler,Stephen Thomas,Ben Weinstein-Raun,Jan Brauner, 05-01-2024

Categories

Computers and Society, Artificial Intelligence, Machine Learning

Abstract

Most respondents expressed substantial uncertainty about the long-term value of AI progress: While 68.3% thought good outcomes from superhuman AI are more likely than bad, of these net optimists 48% gave at least a 5% chance of extremely bad outcomes such as human extinction, and 59% of net pessimists gave 5% or more to extremely good outcomes. Between 38% and 51% of respondents gave at least a 10% chance to advanced AI leading to outcomes as bad as human extinction. More than half suggested that "substantial" or "extreme" concern is warranted about six different AI-related scenarios, including misinformation, authoritarian control, and inequality. There was disagreement about whether faster or slower AI progress would be better for the future of humanity. However, there was broad agreement that research aimed at minimizing potential risks from AI systems ought to be prioritized more.

Bullet Points
- Most respondents expressed uncertainty about the long-term value of AI progress, with 68.3% believing good outcomes from superhuman AI are more likely than bad
- Net optimists gave at least a 5% chance of extremely bad outcomes such as human extinction, while net pessimists give 5% or more to extremely good outcomes
- Nearly half of respondents gave a 10% chance to advanced AI leading to outcomes as bad as human survival
- More than half suggested "substantial" or "extreme" concern is warranted about six different AI-related scenarios, including misinformation, authoritarian control, and inequality
- There was disagreement about whether faster or slower AI progress would be better for the future of humanity, but there was broad agreement that research aimed at minimizing potential risks from AI systems should be prioritized more.
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon, Peitian Zhang,Zheng Liu,Shitao Xiao,Ninglu Shao,Qiwei Ye,Zhicheng Dou, 07-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose Activation Beacon, which condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM. It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine. The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by $\times100$ times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository.

Bullet Points
- Activation Beacon is a plug-and-play module for LLM that preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts
- It works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference
- It can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine
- The model and code will be available at the BGE repository.
Mixtral of Experts, Albert Q. Jiang,Alexandre Sablayrolles,Antoine Roux,Arthur Mensch,Blanche Savary,Chris Bamford,Devendra Singh Chaplot,Diego de las Casas,Emma Bou Hanna,Florian Bressand,Gianna Lengyel,Guillaume Bour,Guillaume Lample,Lélio Renard Lavaud,Lucile Saulnier,Marie-Anne Lachaux,Pierre Stock,Sandeep Subramanian,Sophia Yang,Szymon Antoniak,Teven Le Scao,Théophile Gervet,Thibaut Lavril,Thomas Wang,Timothée Lacroix,William El Sayed, 08-01-2024

Categories

Machine Learning, Computation and Language

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Bullet Points
- Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model with 8 feedforward blocks (experts) at each layer
- Each token has access to 47B parameters, but only 13B active parameters during inference
- Mixtral was trained with a context size of 32k tokens and outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks
- The model is fine-tuned to follow instructions, and both the base and instruct models are released under Apache 2.0 license.
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding, Zilong Wang,Hao Zhang,Chun-Liang Li,Julian Martin Eisenschlos,Vincent Perot,Zifeng Wang,Lesly Miculicich,Yasuhisa Fujii,Jingbo Shang,Chen-Yu Lee,Tomas Pfister, 09-01-2024

Categories

Computation and Language

Abstract

Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.

Bullet Points
- Table-based reasoning with large language models (LLMs) is a promising direction to tackle table-based question answering and fact verification tasks
- It requires extracting underlying semantics from free-form questions and semi-structured tabular data
- Chain-of-Thought and similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage it
- We propose a framework that uses a table as a proxy for intermediate thoughts, using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain
- LLMs can dynamically plan the next operation based on the results of the previous ones
- This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem
- The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions
- It achieves new state-
AUTOACT: Automatic Agent Learning from Scratch via Self-Planning, Shuofei Qiao,Ningyu Zhang,Runnan Fang,Yujie Luo,Wangchunshu Zhou,Yuchen Eleanor Jiang,Chengfei Lv,Huajun Chen, 10-01-2024

Categories

Computation and Language, Artificial Intelligence, Human-Computer Interaction, Machine Learning, Multiagent Systems

Abstract

. None
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Evan Hubinger,Carson Denison,Jesse Mu,Mike Lambert,Meg Tong,Monte MacDiarmid,Tamera Lanham,Daniel M. Ziegler,Tim Maxwell,Newton Cheng,Adam Jermyn,Amanda Askell,Ansh Radhakrishnan,Cem Anil,David Duvenaud,Deep Ganguli,Fazl Barez,Jack Clark,Kamal Ndousse,Kshitij Sachan,Michael Sellitto,Mrinank Sharma,Nova DasSarma,Roger Grosse,Shauna Kravec,Yuntao Bai,Zachary Witten,Marina Favaro,Jan Brauner,Holden Karnofsky,Paul Christiano,Samuel R. Bowman,Logan Graham,Jared Kaplan,Sören Mindermann,Ryan Greenblatt,Buck Shlegeris,Nicholas Schiefer,Ethan Perez, 10-01-2024

Categories

Cryptography and Security, Artificial Intelligence, Computation and Language, Machine Learning, Software Engineering

Abstract

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Bullet Points
- Yes, if an AI system learns a deceptive behavior, it can be detected and removed using current state-of-the-art safety training techniques
- Proof-Of-Concept Examples of Deceptive Behavior in Large Language Models (LLMs) are constructed
- The backdoor behavior can be persistent, so it is not removed by standard training techniques such as supervised fine-tuning, reinforcement learning, and adversarial training
- The persistence of the behavior remains even when the chain of thought is distilled away
- Adversarial Training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.
The Impact of Reasoning Step Length on Large Language Models, Mingyu Jin,Qinkai Yu,Dong shu,Haiyan Zhao,Wenyue Hua,Yanda Meng,Yongfeng Zhang,Mengnan Du, 10-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correlation between the effectiveness of CoT and the length of reasoning steps in prompts remains largely unknown. To shed light on this, we have conducted several empirical experiments to explore the relations. Specifically, we design experiments that expand and compress the rationale reasoning steps within CoT demonstrations, while keeping all other factors constant. We have the following key findings. First, the results indicate that lengthening the reasoning steps in prompts, even without adding new information into the prompt, considerably enhances LLMs' reasoning abilities across multiple datasets. Alternatively, shortening the reasoning steps, even while preserving the key information, significantly diminishes the reasoning abilities of models. This finding highlights the importance of the number of steps in CoT prompts and provides practical guidance to make better use of LLMs' potential in complex problem-solving scenarios. Second, we also investigated the relationship between the performance of CoT and the rationales used in demonstrations. Surprisingly, the result shows that even incorrect rationales can yield favorable outcomes if they maintain the requisite length of inference. Third, we observed that the advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, whereas complex tasks gain significantly from longer inference sequences.

Bullet Points
- The effectiveness of Chain of Thought (CoT) in improving LLMs' reasoning abilities is unknown, but the correlation between the effectiveness and length of reasoning steps in prompts remains largely unknown
- We conducted experiments that expand and compress the rationale reasoning steps within CoT demonstrations while keeping all other factors constant
- The results indicate that lengthening the reasoning steps enhances LLM's reasoning abilities across multiple datasets, while shortening them significantly diminishes the reasoning abilities of models
- The importance of the number of steps in CoT prompts and the relationship between performance and rationales used in demonstrations highlights the importance of maintaining the requisite length of inference
- The advantages of increasing reasoning steps are task-dependent, with simpler tasks requiring fewer steps whereas complex tasks gain significantly from longer inference sequences.
TrustLLM: Trustworthiness in Large Language Models, Lichao Sun,Yue Huang,Haoran Wang,Siyuan Wu,Qihui Zhang,Chujie Gao,Yixin Huang,Wenhan Lyu,Yixuan Zhang,Xiner Li,Zhengliang Liu,Yixin Liu,Yijue Wang,Zhikun Zhang,Bhavya Kailkhura,Caiming Xiong,Chaowei Xiao,Chunyuan Li,Eric Xing,Furong Huang,Hao Liu,Heng Ji,Hongyi Wang,Huan Zhang,Huaxiu Yao,Manolis Kellis,Marinka Zitnik,Meng Jiang,Mohit Bansal,James Zou,Jian Pei,Jian Liu,Jianfeng Gao,Jiawei Han,Jieyu Zhao,Jiliang Tang,Jindong Wang,John Mitchell,Kai Shu,Kaidi Xu,Kai-Wei Chang,Lifang He,Lifu Huang,Michael Backes,Neil Zhenqiang Gong,Philip S. Yu,Pin-Yu Chen,Quanquan Gu,Ran Xu,Rex Ying,Shuiwang Ji,Suman Jana,Tianlong Chen,Tianming Liu,Tianyi Zhou,Willian Wang,Xiang Li,Xiangliang Zhang,Xiao Wang,Xing Xie,Xun Chen,Xuyu Wang,Yan Liu,Yanfang Ye,Yinzhi Cao,Yong Chen,Yue Zhao, 10-01-2024

Categories

Computation and Language

Abstract

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

Bullet Points
- The paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, which includes principles and benchmarks for different dimensions
- It also discusses open challenges and future directions, and emphasizes the importance of transparency in the models and technologies involved in their effectiveness.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, Yuanchun Li,Hao Wen,Weijun Wang,Xiangyu Li,Yizhen Yuan,Guohong Liu,Jiacheng Liu,Wenxing Xu,Xiang Wang,Yi Sun,Rui Kong,Yile Wang,Hanfei Geng,Jian Luan,Xuefeng Jin,Zilong Ye,Guanjing Xiong,Fan Zhang,Xiang Li,Mengwei Xu,Zhijun Li,Peng Li,Yang Liu,Ya-Qin Zhang,Yunxin Liu, 10-01-2024

Categories

Human-Computer Interaction, Artificial Intelligence, Software Engineering

Abstract

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of smartphones and IoT, computing and sensing devices have become ubiquitous, greatly expanding the boundaries of IPAs. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing IPAs still have limited practicality and scalability. Recently, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of IPAs. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an in-depth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges.

Bullet Points
- The paper focuses on the development of intelligent personal assistants (IPAs), which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance
- The paper discusses the architecture, capability, efficiency, and security of Personal LLM Agents, and envisions that they will become a major software paradigm for end-users in the future.
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems, Tianyu Cui,Yanling Wang,Chuanpu Fu,Yong Xiao,Sijia Li,Xinhao Deng,Yunpeng Liu,Qinglin Zhang,Ziyi Qiu,Peiyang Li,Zhixing Tan,Junwu Xiong,Xinyu Kong,Zujie Wen,Ke Xu,Qi Li, 11-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Large language models (LLMs) have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta, and Anthropic have also made lots of efforts on responsible LLMs. Therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. In this paper, we delve into four essential modules of an LLM system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting LLM-generated content. Based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems. We hope that this paper can help LLM participants embrace a systematic perspective to build their responsible LLM systems.

Bullet Points
- The paper proposes a comprehensive taxonomy that systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies
- It also reviews prevalent benchmarks to facilitate the risk assessment of LLM systems.
Seven Failure Points When Engineering a Retrieval Augmented Generation System, Scott Barnett,Stefanus Kurniawan,Srikanth Thudumu,Zach Brannelly,Mohamed Abdelrazek, 11-01-2024

Categories

Software Engineering

Abstract

Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG systems aim to: a) reduce the problem of hallucinated responses from LLMs, b) link sources/references to generated responses, and c) remove the need for annotating documents with meta-data. However, RAG systems suffer from limitations inherent to information retrieval systems and from reliance on LLMs. In this paper, we present an experience report on the failure points of RAG systems from three case studies from separate domains: research, education, and biomedical. We share the lessons learned and present 7 failure points to consider when designing a RAG system. The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. We conclude with a list of potential research directions on RAG systems for the software engineering community.

Bullet Points
- Software engineers are using Retrieval Augmented Generation (RAG) to add semantic search capabilities to applications
- RAG systems aim to reduce the problem of hallucinated responses from LLMs, link sources/references to generated responses, and remove the need for annotating documents with meta-data
- However, the RAG system suffers from limitations inherent to information retrieval systems and from reliance on LLM
- This paper presents an experience report on the failure points of RAG Systems from three case studies from different domains: research, education, and biomedical
- We share the lessons learned and present 7 failure points to consider when designing a RAG System
- The two key takeaways arising from our work are: 1) Validation is only feasible during operation
- 1. Robustness of aRAG system evolves rather than designed in at the start.
The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models, Matthew Renze,Erhan Guven, 11-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

In this paper, we introduce Concise Chain-of-Thought (CCoT) prompting. We compared standard CoT and CCoT prompts to see how conciseness impacts response length and correct-answer accuracy. We evaluated this using GPT-3.5 and GPT-4 with a multiple-choice question-and-answer (MCQA) benchmark. CCoT reduced average response length by 48.70% for both GPT-3.5 and GPT-4 while having a negligible impact on problem-solving performance. However, on math problems, GPT-3.5 with CCoT incurs a performance penalty of 27.69%. Overall, CCoT leads to an average per-token cost reduction of 22.67%. These results have practical implications for AI systems engineers using LLMs to solve real-world problems with CoT prompt-engineering techniques. In addition, these results provide more general insight for AI researchers studying the emergent behavior of step-by-step reasoning in LLMs.

Bullet Points
- The paper presents Concise Chain-of-Thought (CCoT) prompting, which reduces response length and correct-answer accuracy by 48.70% for both GPT-3.5 and GPT-4 with a MCQA benchmark
- The results have practical implications for AI systems engineers using LLMs to solve real-world problems with CoT prompt-engineering techniques, as well as general insight for AI researchers studying the emergent behavior of step-by-step reasoning.
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs, Yi Zeng,Hongpeng Lin,Jingwen Zhang,Diyi Yang,Ruoxi Jia,Weiyan Shi, 12-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs

Bullet Points
- The paper explores how to persuade LLMs to jailbreak them as human-like communicators, and explores the intersection between everyday language interaction and AI safety
- The paper proposes a persuasive adversarial prompt taxonomy derived from social science research, and uses it to automatically generate interpretable persuasive antagonistic prompts (PAP)
- The results show that PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks
- On the defense side, it explores various mechanisms against PAP and advocates for more fundamental mitigation for highly interactive LLM.
Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender, Yuqi Zhang,Liang Ding,Lefei Zhang,Dacheng Tao, 12-01-2024

Categories

Computation and Language

Abstract

None
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, Angels Balaguer,Vinamra Benara,Renato Luiz de Freitas Cunha,Roberto de M. Estevão Filho,Todd Hendry,Daniel Holstein,Jennifer Marsman,Nick Mecklenburg,Sara Malvar,Leonardo O. Nunes,Rafael Padilha,Morris Sharp,Bruno Silva,Swati Sharma,Vijay Aski,Ranveer Chandra, 16-01-2024

Categories

Computation and Language, Machine Learning

Abstract

There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However, the pros and cons of both approaches are not well understood. In this paper, we propose a pipeline for fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. Our pipeline consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. We propose metrics to assess the performance of different stages of the RAG and fine-Tuning pipeline. We conduct an in-depth study on an agricultural dataset. Agriculture as an industry has not seen much penetration of AI, and we study a potentially disruptive application - what if we could provide location-specific insights to a farmer? Our results show the effectiveness of our dataset generation pipeline in capturing geographic-specific knowledge, and the quantitative and qualitative benefits of RAG and fine-tuning. We see an accuracy increase of over 6 p.p. when fine-tuning the model and this is cumulative with RAG, which increases accuracy by 5 p.p. further. In one particular experiment, we also demonstrate that the fine-tuned model leverages information from across geographies to answer specific questions, increasing answer similarity from 47% to 72%. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.

Bullet Points
- The paper proposes a pipeline for fine-tuning and RAG, which involves extracting information from PDFs, generating questions and answers, and leveraging GPT-4 for evaluating the results
- The pipeline consists of multiple stages, including extracting data from PDF and generating answers
- The results demonstrate the effectiveness of the dataset generation pipeline in capturing geographic-specific knowledge and the quantitative and qualitative benefits of RAG and Fine-Tuning, with an accuracy increase of over 6 p.p
- and cumulative with RAG
- In an in-depth study on an agricultural dataset, the results demonstrate that the finetuned model leverages information from across geographies to answer specific questions, increasing answer similarity from 47% to 72%
- This suggests that systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of L
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World, Yining Hong,Zishuo Zheng,Peihao Chen,Yian Wang,Junyan Li,Chuang Gan, 16-01-2024

Categories

Computer Vision, Artificial Intelligence, Computation and Language, Machine Learning, Robotics

Abstract

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.

Bullet Points
- MultiPLY is a multisensory embodied large language model that incorporates multisensorial interactive data, including visual, audio, tactile, and thermal information, into large language models to establish the correlation among words, actions, and percepts
- We first collect Multisensory Universe dataset by deploying an LLM-powered agent to engage with the 3D environment
- To perform instruction tuning with pre-trained LLM on generated data, we encode 3D scenes as abstracted object-centric representations and introduce action tokens and state tokens that represent the multisensorious state observations of the agent at each time step
- Inference time could be generated by generating instruction tokens, instructing the agent to take the action in the environment and obtain the next multiplesensory state observation
- The observation is then appended back to the LLM via State tokens to generate subsequent text or action token
- This model outperforms baselines by
A Survey of Resource-efficient LLM and Multimodal Foundation Models, Mengwei Xu,Wangsong Yin,Dongqi Cai,Rongjie Yi,Daliang Xu,Qipeng Wang,Bingyang Wu,Yihao Zhao,Chen Yang,Shihe Wang,Qiyang Zhang,Zhenyan Lu,Li Zhang,Shangguang Wang,Yuanchun Li,Yunxin Liu,Xin Jin,Xuanzhe Liu, 16-01-2024

Categories

Machine Learning, Artificial Intelligence, Distributed, Parallel, and Cluster Computing

Abstract

Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.

Bullet Points
- The survey focuses on developing resource-efficient strategies to support the growth of large foundation models in a sustainable and scalable way, examining both algorithmic and systemic aspects
- It provides an overview of current approaches to tackling resource challenges and potential future breakthroughs in this field.
ReFT: Reasoning with Reinforced Fine-Tuning, Trung Quoc Luong,Xinbo Zhang,Zhanming Jie,Peng Sun,Xiaoran Jin,Hang Li, 17-01-2024

Categories

Computation and Language

Abstract

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.

Bullet Points
- Reinforced Fine-Tuning (ReFT) is a simple and effective approach to enhance the reasoning capability of LLMs using Chain-of-Thought annotations
- ReFT warmups the model with SFT and uses on-line reinforcement learning to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and rewards are naturally derived from the ground-truth answers
- Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking
- The improvement can be achieved by learning from the same training questions as SFT without relying on extra or augmented training questions.
Knowledge Fusion of Large Language Models, Fanqi Wan,Xinting Huang,Deng Cai,Xiaojun Quan,Wei Bi,Shuming Shi, 19-01-2024

Categories

Computation and Language

Abstract

While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{this https URL}.

Bullet Points
- Knowledge fusion for LLMs is a cost-effective and compelling approach to combining existing LLM capabilities and transferring them into a single LLM
- By leveraging generative distributions, we externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM, and we validate our approach using three popular LLM architectures - Llama-2, MPT, and OpenLLaMA - across various benchmarks and tasks
- Our code, model weights, and data are public at urlthis https URL.
MM-LLMs: Recent Advances in MultiModal Large Language Models, Duzhen Zhang,Yahan Yu,Chenxing Li,Jiahua Dong,Dan Su,Chenhui Chu,Dong Yu, 24-01-2024

Categories

Computation and Language

Abstract

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of $26$ existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

Bullet Points
- The paper provides a comprehensive survey to facilitate further research of MM-LLMs, including general design formulations, introductions of existing models, performance on mainstream benchmarks, key training recipes, and a real-time tracking website
- The survey encourages further research and contributes to the ongoing advancement of the domain.
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, Hongliang He,Wenlin Yao,Kaixin Ma,Wenhao Yu,Yong Dai,Hongming Zhang,Zhenzhong Lan,Dong Yu, 25-01-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

Bullet Points
- We introduce WebVoyager, an LMM-powered web agent that can complete user instructions end-to-end by interacting with real-world websites
- We propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of open-ended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT-4V
- We create a benchmark by gathering real-life tasks from 15 widely used websites to evaluate our agents, achieving a 55.7% task success rate
- Our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real world setting.
The Power of Noise: Redefining Retrieval for RAG Systems, Florin Cuconasu,Giovanni Trappolini,Federico Siciliano,Simone Filice,Cesare Campagnano,Yoelle Maarek,Nicola Tonellotto,Fabrizio Silvestri, 26-01-2024

Categories

Information Retrieval, Computation and Language

Abstract

Retrieval-Augmented Generation (RAG) systems represent a significant advancement over traditional Large Language Models (LLMs). RAG systems enhance their generation ability by incorporating external data retrieved through an Information Retrieval (IR) phase, overcoming the limitations of standard LLMs, which are restricted to their pre-trained knowledge and limited context window. Most research in this area has predominantly concentrated on the generative aspect of LLMs within RAG systems. Our study fills this gap by thoroughly and critically analyzing the influence of IR components on RAG systems. This paper analyzes which characteristics a retriever should possess for an effective RAG's prompt formulation, focusing on the type of documents that should be retrieved. We evaluate various elements, such as the relevance of the documents to the prompt, their position, and the number included in the context. Our findings reveal, among other insights, that including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality. These results underscore the need for developing specialized strategies to integrate retrieval with language generation models, thereby laying the groundwork for future research in this field.

Bullet Points
- Retrieval-Augmented Generation (RAG) systems enhance their generation ability by incorporating external data retrieved through an Information Retriation (IR) phase, overcoming limitations of traditional LLMs
- The paper analyzes the influence of IR components on RAG systems, analyzing which characteristics a retriever should possess for an effective RAG prompt formulation, focusing on the type of documents that should be retrieved
- Including irrelevant documents can unexpectedly enhance performance by more than 30% in accuracy, contradicting our initial assumption of diminished quality
- Specified strategies should be developed to integrate retrieval with language generation models, laying the groundwork for future research in this field.
A Comprehensive Survey of Compression Algorithms for Language Models, Seungcheol Park,Jaehyeon Choi,Sojin Lee,U Kang, 27-01-2024

Categories

Computation and Language, Artificial Intelligence, Natural Language Processing, Natural Language Processing

Abstract

How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the gigantic size of language models, such as increased carbon emissions and expensive maintenance fees. While numerous compression algorithms have shown remarkable progress in compressing language models, it ironically becomes challenging to capture emerging trends and identify the fundamental concepts underlying them due to the excessive number of algorithms. In this paper, we survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. We not only summarize the overall trend of diverse compression algorithms but also select representative algorithms and provide in-depth analyses of them. We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of large language models. Finally, we introduce promising future research topics based on our survey results.

Bullet Points
- To compress language models without sacrificing accuracy, we can use diverse compression algorithms such as pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design
- The paper surveys and summarizes diverse algorithms, selects representative algorithms, provides in-depth analyses, discusses the value of each category of compression algorithms, and proposes promising future research topics based on our survey results.
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models, Fuzhao Xue,Zian Zheng,Yao Fu,Jinjie Ni,Zangwei Zheng,Wangchunshu Zhou,Yang You, 29-01-2024

Categories

Computation and Language, Artificial Intelligence, Distributed, Parallel, and Cluster Computing, Machine Learning

Abstract

One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.

Bullet Points
- The study found that routing decisions in OpenMoE models are predominantly based on token IDs, with minimal context relevance
- The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged
- This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations
- The study proposes potential strategies for mitigating these issues and improving off-the-shelf MoE LLM designs.
Corrective Retrieval Augmented Generation, Shi-Qi Yan,Jia-Chen Gu,Yun Zhu,Zhen-Hua Ling, 29-01-2024

Categories

Computation and Language

Abstract

Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return sub-optimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches.

Bullet Points
- We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of LLM generation by utilizing a lightweight retrieval evaluator, large-scale web searches, and a decompose algorithm to selectively focus on key information and filter out irrelevant information in retrieved documents
- CRAG is a plug-and-play approach that can be seamlessly coupled with various RAG-based approaches.
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, Ansar Aynetdinov,Alan Akbik, 30-01-2024

Categories

Computation and Language

Abstract

Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.

Bullet Points
- The paper proposes a SemScore evaluation metric that directly compares model outputs to gold target responses using semantic textual similarity (STS)
- We conduct a comparative evaluation of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation and find that our proposed metric outperforms all other, in many cases more complex evaluation metrics in terms of correlation to human evaluation.
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, Luca Soldaini,Rodney Kinney,Akshita Bhagia,Dustin Schwenk,David Atkinson,Russell Authur,Ben Bogin,Khyathi Chandu,Jennifer Dumas,Yanai Elazar,Valentin Hofmann,Ananya Harsh Jha,Sachin Kumar,Li Lucy,Xinxi Lyu,Nathan Lambert,Ian Magnusson,Jacob Morrison,Niklas Muennighoff,Aakanksha Naik,Crystal Nam,Matthew E. Peters,Abhilasha Ravichander,Kyle Richardson,Zejiang Shen,Emma Strubell,Nishant Subramani,Oyvind Tafjord,Pete Walsh,Luke Zettlemoyer,Noah A. Smith,Hannaneh Hajishirzi,Iz Beltagy,Dirk Groeneveld,Jesse Dodge,Kyle Lo, 31-01-2024

Categories

Computation and Language

Abstract

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

Bullet Points
- To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus
- We document its design principles, details about its construction, and a summary of its contents
- We present analyses and experimental results on intermediate states to share what we have learned about important data curation practices
- Finally, we open-source our toolkit to enable reproducibility of our work and support further research in large-scale data curation.
Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?, Xue-Yong Fu,Md Tahmid Rahman Laskar,Elena Khasanova,Cheng Chen,Shashi Bhushan TN, 01-02-2024

Categories

Computation and Language

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities to solve a wide range of tasks without being explicitly fine-tuned on task-specific datasets. However, deploying LLMs in the real world is not trivial, as it requires substantial computing resources. In this paper, we investigate whether smaller, compact LLMs are a good alternative to the comparatively Larger LLMs2 to address significant costs associated with utilizing LLMs in the real world. In this regard, we study the meeting summarization task in a real-world industrial environment and conduct extensive experiments by comparing the performance of fine-tuned compact LLMs (e.g., FLAN-T5, TinyLLaMA, LiteLLaMA) with zero-shot larger LLMs (e.g., LLaMA-2, GPT-3.5, PaLM-2). We observe that most smaller LLMs, even after fine-tuning, fail to outperform larger zero-shot LLMs in meeting summarization datasets. However, a notable exception is FLAN-T5 (780M parameters), which performs on par or even better than many zero-shot Larger LLMs (from 7B to above 70B parameters), while being significantly smaller. This makes compact LLMs like FLAN-T5 a suitable cost-efficient solution for real-world industrial deployment.

Bullet Points
- The paper investigates if smaller, compact LLMs are a good alternative to Larger LLM2 to address significant costs associated with deploying them in the real world
- We study meeting summarization tasks in a real-world industrial environment and compare the performance of fine-tuned compact and zero-shot LargerLLMs
- FLAN-T5 performs on par or even better than many Zero-Shot Larger MLMs, making it a suitable cost-efficient solution for real-life industrial deployment.
TravelPlanner: A Benchmark for Real-World Planning with Language Agents, Jian Xie,Kai Zhang,Jiangjie Chen,Tinghui Zhu,Renze Lou,Yuandong Tian,Yanghua Xiao,Yu Su, 02-02-2024

Categories

Computation and Language

Abstract

Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

Bullet Points
- TravelPlanner is a new planning benchmark that focuses on travel planning, a common real-world planning scenario
- It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans
- However, the current language agents are not yet capable of handling such complex planning tasks, even GPT-4 only achieves a success rate of 0.6%
- The mere possibility for language agents to tackle such a complex problem is non-trivial progress.
More Agents Is All You Need, Junyou Li,Qin Zhang,Yangbin Yu,Qiang Fu,Deheng Ye, 03-02-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code is publicly available at: \url{https://anonymous.4open.science/r/more_agent_is_all_you_need}.

Bullet Points
- A sampling-and-voting method is used to measure the performance of large language models (LLMs) with the number of agents instantiated
- This method is orthogonal to existing complicated methods to enhance LLMs, and the degree of enhancement is correlated to task difficulty
- We conduct extensive experiments on LLM benchmarks to verify the presence of our finding and study the properties that can facilitate its occurrence
- Our code is publicly available at http://anonymous.4open.science/r/more_agent_is_all_you_need.
Large Language Model for Table Processing: A Survey, Weizheng Lu,Jiaming Zhang,Jing Zhang,Yueguo Chen, 04-02-2024

Categories

Artificial Intelligence, Computation and Language

Abstract

Tables, typically two-dimensional and structured to store large amounts of data, are essential in daily activities like database queries, spreadsheet calculations, and generating reports from web tables. Automating these table-centric tasks with Large Language Models (LLMs) offers significant public benefits, garnering interest from academia and industry. This survey provides an extensive overview of table tasks, encompassing not only the traditional areas like table question answering (Table QA) and fact verification, but also newly emphasized aspects such as table manipulation and advanced table data analysis. Additionally, it goes beyond the early strategies of pre-training and fine-tuning small language models, to include recent paradigms in LLM usage. The focus here is particularly on instruction-tuning, prompting, and agent-based approaches within the realm of LLMs. Finally, we highlight several challenges, ranging from private deployment and efficient inference to the development of extensive benchmarks for table manipulation and advanced data analysis.

Bullet Points
- Tables are essential in daily activities like database queries, spreadsheet calculations, and generating reports from web tables
- LLMs are used to automate table-centric tasks, gaining public interest from academia and industry
- This survey provides an overview of table tasks, including traditional areas like table question answering and fact verification, as well as newly emphasized aspects such as table manipulation and advanced table data analysis
- The survey also includes recent paradigms in LLM usage, including instruction-tuning, prompting, and agent-based approaches
- Challenges include private deployment, efficient inference, and the development of benchmarks for table manipulation.
Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases, Elad Levi,Eli Brosh,Matan Friedmann, 05-02-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, Pranab Sahoo,Ayush Kumar Singh,Sriparna Saha,Vinija Jain,Samrat Mondal,Aman Chadha, 05-02-2024

Categories

Artificial Intelligence, Computation and Language, Human-Computer Interaction

Abstract

Prompt engineering has emerged as an indispensable technique for extending the capabilities of large language models (LLMs) and vision-language models (VLMs). This approach leverages task-specific instructions, known as prompts, to enhance model efficacy without modifying the core model parameters. Rather than updating the model parameters, prompts allow seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt. Prompts can be natural language instructions that provide context to guide the model or learned vector representations that activate relevant knowledge. This burgeoning field has enabled success across various applications, from question-answering to commonsense reasoning. However, there remains a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques. This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area. For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized. We also delve into the strengths and limitations of each approach and include a taxonomy diagram and table summarizing datasets, models, and critical points of each prompting technique. This systematic analysis enables a better understanding of this rapidly developing field and facilitates future research by illuminating open challenges and opportunities for prompt engineering.

Bullet Points
- Prompt engineering is a technique that extends the capabilities of LLMs and vision-language models by using task-specific instructions, such as prompts, to enhance model efficacy without modifying the model parameters
- It allows seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt
- This field has enabled success across various applications, from question-answering to commonsense reasoning
- However, there is still a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques
- This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area
- For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized
- We also explore the strengths and limitations of each approach and include a taxonomy diagram and table summar
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls, Yu Du,Fangyun Wei,Hongyang Zhang, 06-02-2024

Categories

Computation and Language

Abstract
Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning, Yanfang Zhang,Yiliu Sun,Yibing Zhan,Dapeng Tao,Dacheng Tao,Chen Gong, 06-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Recently, increasing attention has been focused drawn on to improve the ability of Large Language Models (LLMs) to perform complex reasoning. However, previous methods, such as Chain-of-Thought and Self-Consistency, mainly follow Direct Reasoning (DR) frameworks, so they will meet difficulty in solving numerous real-world tasks which can hardly be solved via DR. Therefore, to strengthen the reasoning power of LLMs, this paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof. Specifically, our methodology comprises two steps. Firstly, we leverage the logical equivalence of contrapositive to augment the data and rules to enhance the comprehensibility of LLMs. Secondly, we design a set of prompt templates to trigger LLMs to conduct IR based on proof by contradiction that is logically equivalent to the original DR process. Our IR method is simple yet effective and can be straightforwardly integrated with existing DR methods to further boost the reasoning abilities of LLMs. The experimental results on popular LLMs, such as GPT-3.5-turbo and Gemini-pro, show that our IR method enhances the overall accuracy of factual reasoning by 27.33% and mathematical proof by 31.43%, when compared with traditional DR methods. Moreover, the methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.

Bullet Points
- The paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof
- The methodology involves leveraging the logical equivalence of counterpositives to enhance the comprehensibility of LLMs, and designing prompt templates to trigger them to conduct IR based on proof by contradiction that is logically equivalent to the original DR process
- The IR method is simple yet effective, and can be easily integrated with existing DR methods to further boost the reasoning abilities
- Experimental results show that our method enhances the overall accuracy of factual Reasoning by 27.33% and mathematical proof by 31.43% when compared with traditional DRM methods
- The methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.
LLM Agents can Autonomously Hack Websites, Richard Fang,Rohan Bindu,Akul Gupta,Qiusi Zhan,Daniel Kang, 06-02-2024

Categories

Cryptography and Security, Artificial Intelligence

Abstract

In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

Bullet Points
- LLM agents can autonomously hack websites without human feedback, using frontier models that are highly capable of tool use and leveraging extended context
- GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild, which raises questions about the widespread deployment of LLMs.
Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning, Yanfang Zhang,Yiliu Sun,Yibing Zhan,Dapeng Tao,Dacheng Tao,Chen Gong, 06-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Recently, increasing attention has been focused drawn on to improve the ability of Large Language Models (LLMs) to perform complex reasoning. However, previous methods, such as Chain-of-Thought and Self-Consistency, mainly follow Direct Reasoning (DR) frameworks, so they will meet difficulty in solving numerous real-world tasks which can hardly be solved via DR. Therefore, to strengthen the reasoning power of LLMs, this paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof. Specifically, our methodology comprises two steps. Firstly, we leverage the logical equivalence of contrapositive to augment the data and rules to enhance the comprehensibility of LLMs. Secondly, we design a set of prompt templates to trigger LLMs to conduct IR based on proof by contradiction that is logically equivalent to the original DR process. Our IR method is simple yet effective and can be straightforwardly integrated with existing DR methods to further boost the reasoning abilities of LLMs. The experimental results on popular LLMs, such as GPT-3.5-turbo and Gemini-pro, show that our IR method enhances the overall accuracy of factual reasoning by 27.33% and mathematical proof by 31.43%, when compared with traditional DR methods. Moreover, the methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.

Bullet Points
- The paper proposes a novel Indirect Reasoning (IR) method that employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof
- The methodology involves leveraging the logical equivalence of counterpositives to enhance the comprehensibility of LLMs, and designing prompt templates to trigger them to conduct IR based on proof by contradiction that is logically equivalent to the original DR process
- The IR method is simple yet effective, and can be easily integrated with existing DR methods to further boost the reasoning abilities
- Experimental results show that our method enhances the overall accuracy of factual Reasoning by 27.33% and mathematical proof by 31.43% when compared with traditional DRM methods
- The methods combining IR and DR significantly outperform the methods solely using IR or DR, further demonstrating the effectiveness of our strategy.
LLM Agents can Autonomously Hack Websites, Richard Fang,Rohan Bindu,Akul Gupta,Qiusi Zhan,Daniel Kang, 06-02-2024

Categories

Cryptography and Security, Artificial Intelligence

Abstract

In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

Bullet Points
- LLM agents can autonomously hack websites without human feedback, using frontier models that are highly capable of tool use and leveraging extended context
- GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild, which raises questions about the widespread deployment of LLMs.
In-Context Principle Learning from Mistakes, Tianjun Zhang,Aman Madaan,Luyu Gao,Steven Zheng,Swaroop Mishra,Yiming Yang,Niket Tandon,Uri Alon, 08-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): First, we intentionally induce the model to make mistakes on these few examples; then we reflect on these mistakes, and learn explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes; finally, we prompt the model to answer unseen test questions using the original few-shot examples and these learned general principles. We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4 turbo and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP, and by 3.3% in HotpotQA. Importantly, LEAP does not require any more input or examples than the standard few-shot prompting settings.

Bullet Points
- In-context learning (ICL) is a standard method of adapting LLMs to downstream tasks by learning from a few input-output examples
- However, all ICL-based approaches only learn from correct input_output pairs
- In this paper, we revisit this paradigm by learning more from the few given inputs
- We introduce Learning Principles (LEAP) by intentionally inducing the model to make mistakes on these few examples, reflecting on these mistakes, and learning explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes
- We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH)
- LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DR
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis, Federico Bianchi,Patrick John Chia,Mert Yuksekgonul,Jacopo Tagliabue,Dan Jurafsky,James Zou, 08-02-2024

Categories

Artificial Intelligence, Computation and Language, Computer Science and Game Theory

Abstract

Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities.

Bullet Points
- The paper explores how LLM agents can negotiate and develops NegotiationArena, a flexible framework for evaluating and probing their negotiation abilities
- Three scenarios were implemented to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations)
- LLMs can significantly boost their negotiation outcomes by employing certain behavioral tactics, such as pretending to be desolate and desperate, and quantifying irrational negotiation behaviors.
An Interactive Agent Foundation Model, Zane Durante,Bidipta Sarkar,Ran Gong,Rohan Taori,Yusuke Noda,Paul Tang,Ehsan Adeli,Shrinidhi Kowshika Lakshmikanth,Kevin Schulman,Arnold Milstein,Demetri Terzopoulos,Ade Famoti,Noboru Kuno,Ashley Llorens,Hoi Vo,Katsu Ikeuchi,Li Fei-Fei,Jianfeng Gao,Naoki Wake,Qiuyuan Huang, 08-02-2024

Categories

Artificial Intelligence, Machine Learning, Robotics

Abstract

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.

Bullet Points
- The Interactive Agent Foundation Model uses a multi-task agent training paradigm to train AI agents across a wide range of domains, datasets, and tasks
- It unifies pre-training strategies, enabling a versatile and adaptable AI framework
- The model generates meaningful and contextually relevant outputs across three domains: Robotics, Gaming AI, and Healthcare
- The approach relies on a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets and textual information
- It provides a promising avenue for developing generalist, action-taking, multimodal systems.
Large Language Models: A Survey, Shervin Minaee,Tomas Mikolov,Narjes Nikzad,Meysam Chenaghlu,Richard Socher,Xavier Amatriain,Jianfeng Gao, 09-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

Bullet Points
- LLMs have gained attention due to their strong performance on natural language tasks and their ability to general-purpose language understanding and generation
- They are developed by training billions of parameters on massive amounts of text data, as predicted by scaling laws citekaplan2020scaling,hoffmann2022training
- Their research area is evolving rapidly in many different ways
- The paper reviews some of the most prominent LLM families, including GPT, LLaMA, and PaLM, and discusses their characteristics, contributions, and limitations
- We also provide an overview of techniques developed to build and augment LLM models, survey popular datasets prepared for training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare their performance on a set of representative benchmarks
- Finally, the paper concludes by discussing open challenges and future research directions.
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement, Zhiyong Wu,Chengcheng Han,Zichen Ding,Zhenmin Weng,Zhoumianze Liu,Shunyu Yao,Tao Yu,Lingpeng Kong, 12-02-2024

Categories

Artificial Intelligence

Abstract

Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.

Bullet Points
- OS-Copilot is a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications
- It is used to create FRIDAY, a self-improving embodied agent for automating general computer tasks, which outperforms previous methods by 35% on GAIA, showcasing strong generalization to unseen applications via accumulated skills from previous tasks
- The framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents.
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model, Ahmet Üstün,Viraat Aryabumi,Zheng-Xin Yong,Wei-Yin Ko,Daniel D'souza,Gbemileke Onilude,Neel Bhandari,Shivalika Singh,Hui-Lee Ooi,Amr Kayid,Freddie Vargus,Phil Blunsom,Shayne Longpre,Niklas Muennighoff,Marzieh Fadaee,Julia Kreutzer,Sara Hooker, 12-02-2024

Categories

Computation and Language

Abstract
DoRA: Weight-Decomposed Low-Rank Adaptation, Shih-Yang Liu,Chien-Yi Wang,Hongxu Yin,Pavlo Molchanov,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Min-Hung Chen, 14-02-2024

Categories

Computation and Language, Computer Vision

Abstract

Among the widely used parameter-efficient finetuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed LowRank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.

Bullet Points
- The work proposes Weight-Decomposed LowRank Adaptation (DoRA) to investigate the accuracy gap between parameter-efficient finetuning (PEFT) methods and full fine-tuned (FT)
- DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tinging, enhancing both learning capacity and training stability while avoiding inference overhead
- DoRA consistently outperforms LoRA on fine-timing LLaMA, LLlaVA, and VL-BART on downstream tasks, including commonsense reasoning and visual instruction tuning.
How to Train Data-Efficient LLMs, Noveen Sachdeva,Benjamin Coleman,Wang-Cheng Kang,Jianmo Ni,Lichan Hong,Ed H. Chi,James Caverlee,Julian McAuley,Derek Zhiyuan Cheng, 15-02-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language

Abstract

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

Bullet Points
- The paper studies data-efficient approaches for pre-training LLMs to optimize the Pareto frontier of model quality and training resource/data consumption
- The paper aims to understand tradeoffs associated with data selection routines based on expensive-to-compute data-quality estimates and maximization of coverage and diversity-based measures in the feature space
- The first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLM models to directly assess the quality of a training example
- To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample
- Models trained on Ask-LLM data consistently outperform full-data training, even when we reject 90% of the original dataset, while converging up to 70% faster.
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models, Yijia Shao,Yucheng Jiang,Theodore A. Kanell,Peter Xu,Omar Khattab,Monica S. Lam, 22-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Compared to articles generated by an outline-driven retrieval-augmented baseline, more of STORM's articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.

Bullet Points
- To evaluate the pre-writing stage of STORM's articles, we curate FreshWiki, formulate outline assessments, and gather feedback from experienced editors
- More articles are organized and broad in coverage compared to an outline-driven retrieval-augmented baseline
- Expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.
INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models, Hanseok Oh,Hyunji Lee,Seonghyeon Ye,Haebin Shin,Hansol Jang,Changwook Jun,Minjoon Seo, 22-02-2024

Categories

Computation and Language

Abstract

Despite the critical need to align search targets with users' intention, retrievers often only prioritize query information without delving into the users' intended search context. Enhancing the capability of retrievers to understand intentions and preferences of users, akin to language model instructions, has the potential to yield more aligned search targets. Prior studies restrict the application of instructions in information retrieval to a task description format, neglecting the broader context of diverse and evolving search scenarios. Furthermore, the prevailing benchmarks utilized for evaluation lack explicit tailoring to assess instruction-following ability, thereby hindering progress in this field. In response to these limitations, we propose a novel benchmark,INSTRUCTIR, specifically designed to evaluate instruction-following ability in information retrieval tasks. Our approach focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. Through experimental analysis, we observe that retrievers fine-tuned to follow task-style instructions, such as INSTRUCTOR, can underperform compared to their non-instruction-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.

Bullet Points
- Retrievers prioritize query information without understanding user intentions and preferences, which hinders progress in information retrieval
- A new benchmark, INSTRUCTIR, focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios
- This approach can underperform retrievers fine-tuned to follow task-style instructions, highlighting potential overfitting issues in constructing retrievers trained on existing instruction-aware retrieval datasets.
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, Omkar Thawakar,Ashmal Vayani,Salman Khan,Hisham Cholakal,Rao M. Anwer,Michael Felsberg,Tim Baldwin,Eric P. Xing,Fahad Shahbaz Khan, 26-02-2024

Categories

Computation and Language

Abstract
AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System, Zhiwei Liu,Weiran Yao,Jianguo Zhang,Liangwei Yang,Zuxin Liu,Juntao Tan,Prafulla K. Choubey,Tian Lan,Jason Wu,Huan Wang,Shelby Heinecke,Caiming Xiong,Silvio Savarese, 23-02-2024

Categories

Multiagent Systems, Artificial Intelligence
A Survey on Data Selection for Language Models, Alon Albalak,Yanai Elazar,Sang Michael Xie,Shayne Longpre,Nathan Lambert,Xinyi Wang,Niklas Muennighoff,Bairu Hou,Liangming Pan,Haewon Jeong,Colin Raffel,Shiyu Chang,Tatsunori Hashimoto,William Yang Wang, 26-02-2024

Categories

Computation and Language, Machine Learning

Abstract

To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

Bullet Points
- The paper presents a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches
- The aim is to accelerate progress in data selection by establishing an entry point for new and established researchers
- The review highlights noticeable holes in the literature and proposes promising research avenues.
ChatGPT in Veterinary Medicine: A Practical Guidance of Generative Artificial Intelligence in Clinics, Education, and Research, Candice P. Chu, 26-02-2024

Categories

Computers and Society

Abstract

ChatGPT, the most accessible generative artificial intelligence (AI) tool, offers considerable potential for veterinary medicine, yet a dedicated review of its specific applications is lacking. This review concisely synthesizes the latest research and practical applications of ChatGPT within the clinical, educational, and research domains of veterinary medicine. It intends to provide specific guidance and actionable examples of how generative AI can be directly utilized by veterinary professionals without a programming background. For practitioners, ChatGPT can extract patient data, generate progress notes, and potentially assist in diagnosing complex cases. Veterinary educators can create custom GPTs for student support, while students can utilize ChatGPT for exam preparation. ChatGPT can aid in academic writing tasks in research, but veterinary publishers have set specific requirements for authors to follow. Despite its transformative potential, careful use is essential to avoid pitfalls like hallucination. This review addresses ethical considerations, provides learning resources, and offers tangible examples to guide responsible implementation. Carefully selected, up-to-date links to platforms that host large language models are provided for advanced readers with programming capability. A table of key takeaways was provided to summarize this review. By highlighting potential benefits and limitations, this review equips veterinarians, educators, and researchers to harness the power of ChatGPT effectively.

Bullet Points
- The review highlights the potential of ChatGPT for veterinary medicine, but lacks a dedicated review of its specific applications
- It provides guidance and actionable examples of how generative AI can be directly utilized by veterinary professionals without a programming background
- The review also addresses ethical considerations, provides learning resources, and provides tangible examples to guide responsible implementation.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, Shuming Ma,Hongyu Wang,Lingxiao Ma,Lei Wang,Wenhui Wang,Shaohan Huang,Li Dong,Ruiping Wang,Jilong Xue,Furu Wei, 27-02-2024

Categories

Computation and Language, Machine Learning

Abstract

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Bullet Points
- BitNet introduces a 1-bit LLM variant called BitNet b1.58, which matches the full-precision Transformer LLM with the same model size and training tokens, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption
- This variant defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-efficient, enabling new computation paradigm and opening the door for designing specific hardware optimized for 1-bitLLMs.
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation, Nihal V. Nayak,Yiyang Nan,Avi Trost,Stephen H. Bach, 28-02-2024

Categories

Computation and Language, Machine Learning

Abstract
CLLMs: Consistency Large Language Models, Siqi Kou,Lanxiang Hu,Zhezhi He,Zhijie Deng,Hao Zhang, 28-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4$\times$ to 3.4$\times$ improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.

Bullet Points
- Parallel decoding methods like Jacobi decoding can improve LLM inference by breaking the sequential nature of the process and transforming it into parallelizable computation
- However, in practice, it achieves little speedup compared to traditional autoregressive decoding
- To address this, we develop a new approach that involves refining the target LLM to consistently predict the fixed point given any state as input
- Extensive experiments demonstrate the effectiveness of this method, resulting in 2.4$times$ to 3.4$$ improvements in generation speed while preserving generation quality across domain-specific and open-domain benchmarks.
ResLoRA: Identity Residual Mapping in Low-Rank Adaption, Shuhua Shi,Shaohan Huang,Minghui Song,Zhoujun Li,Zihan Zhang,Haizhen Huang,Furu Wei,Weiwei Deng,Feng Sun,Qi Zhang, 28-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract
Datasets for Large Language Models: A Comprehensive Survey, Yang Liu,Jiahuan Cao,Chongyu Liu,Kai Ding,Lianwen Jin, 28-02-2024

Categories

Computation and Language, Artificial Intelligence

Abstract
Retrieval-Augmented Generation for AI-Generated Content: A Survey, Penghao Zhao,Hailin Zhang,Qinhan Yu,Zhengren Wang,Yunteng Geng,Fangcheng Fu,Ling Yang,Wentao Zhang,Bin Cui, 29-02-2024

Categories

Computer Vision

Abstract
Random Silicon Sampling: Simulating Human Sub-Population Opinion Using a Large Language Model Based on Group-Level Demographic Information, Seungjong Sun,Eungu Lee,Dongyan Nan,Xiangying Zhao,Wonbyung Lee,Bernard J. Jansen,Jang Hyun Kim, 28-02-2024

Categories

Artificial Intelligence, Computers and Society, Natural Language Processing

Abstract

Large language models exhibit societal biases associated with demographic information, including race, gender, and others. Endowing such language models with personalities based on demographic data can enable generating opinions that align with those of humans. Building on this idea, we propose "random silicon sampling," a method to emulate the opinions of the human population sub-group. Our study analyzed 1) a language model that generates the survey responses that correspond with a human group based solely on its demographic distribution and 2) the applicability of our methodology across various demographic subgroups and thematic questions. Through random silicon sampling and using only group-level demographic information, we discovered that language models can generate response distributions that are remarkably similar to the actual U.S. public opinion polls. Moreover, we found that the replicability of language models varies depending on the demographic group and topic of the question, and this can be attributed to inherent societal biases in the models. Our findings demonstrate the feasibility of mirroring a group's opinion using only demographic distribution and elucidate the effect of social biases in language models on such simulations.

Bullet Points
- Random silicon sampling is a method to emulate the opinions of the human population sub-group by using only group-level demographic information to generate survey responses that correspond with a human group based solely on its demographic distribution
- Our study found that language models can generate response distributions that are remarkably similar to the actual U.S
- public opinion polls, but the replicability of language models varies depending on the demographic group and topic of the question
- Our findings demonstrate the feasibility of mirroring a group's opinion using only demographic distribution and elucidate the effect of social biases in language models on such simulations.
PlanGPT: Enhancing Urban Planning with Tailored Language Model and Efficient Retrieval, He Zhu,Wenjia Zhang,Nuoxian Huang,Boyang Li,Luyao Niu,Zipei Fan,Tianle Lun,Yicheng Tao,Junyou Su,Zhaoya Gong,Chenyu Fang,Xing Liu, 29-02-2024

Categories

Computation and Language

Abstract

In the field of urban planning, general-purpose large language models often struggle to meet the specific needs of planners. Tasks like generating urban planning texts, retrieving related information, and evaluating planning documents pose unique challenges. To enhance the efficiency of urban professionals and overcome these obstacles, we introduce PlanGPT, the first specialized Large Language Model tailored for urban and spatial planning. Developed through collaborative efforts with institutions like the Chinese Academy of Urban Planning, PlanGPT leverages a customized local database retrieval framework, domain-specific fine-tuning of base models, and advanced tooling capabilities. Empirical tests demonstrate that PlanGPT has achieved advanced performance, delivering responses of superior quality precisely tailored to the intricacies of urban planning.

Bullet Points
- PlanGPT is the first specialized Large Language Model tailored for urban and spatial planning, developed through collaboration with institutions like the Chinese Academy of Urban Planning
- It leverages a customized local database retrieval framework, domain-specific fine-tuning of base models, and advanced tooling capabilities, and has achieved advanced performance, delivering responses of superior quality precisely tailored to urban planning intricacies.
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, Xin Men,Mingyu Xu,Qingyu Zhang,Bingning Wang,Hongyu Lin,Yaojie Lu,Xianpei Han,Weipeng Chen, 06-03-2024

Categories

Computation and Language

Abstract

As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

Bullet Points
- The study found that many layers of LLMs exhibit high similarity and some play a negligible role in network functionality
- We define a metric called Block Influence (BI) to gauge the significance of each layer, and propose a simple pruning approach called ShortGPT
- This method significantly outperforms previous state-of-the-art (SOTA) methods in model pruning and is orthogonal to quantization-like methods, enabling further reduction in parameters and computation
- The ability to achieve better results through simple layer removal suggests high redundancy in the model architecture.
Apollo: An Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People, Xidong Wang,Nuo Chen,Junyin Chen,Yan Hu,Yidong Wang,Xiangbo Wu,Anningzhe Gao,Xiang Wan,Haizhou Li,Benyou Wang, 06-03-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.

Bullet Points
- To extend medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion
- This includes the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark
- Released Apollo models at various relatively-small sizes achieve the best performance among models of equivalent size
- These lite models could be used to improve multi-lingual medical capabilities of larger models without fine-tuning
- The open-source training corpora, code, model weights, and evaluation benchmark will be created.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, Jiawei Zhao,Zhenyu Zhang,Beidi Chen,Zhangyang Wang,Anima Anandkumar,Yuandong Tian, 06-03-2024

Categories

Machine Learning

Abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Bullet Points
- Gradient Low-Rank Projection (GaLore) is a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA
- It reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks
- Our approach reduces optimizer memory up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline
- We demonstrate the feasibility of pretraining a 7B model on consumer GPUs with 24GB memory without model parallel, checkpointing, or offloading strategies.
Can Large Language Models Reason and Plan?, Subbarao Kambhampati, 07-03-2024

Categories

Artificial Intelligence, Computation and Language, Machine Learning

Abstract

While humans sometimes do show the capability of correcting their own erroneous guesses with self-critiquing, there seems to be no basis for that assumption in the case of LLMs.

Bullet Points
- LLMs lack self-critiquing ability to correct their own erroneous guesses, despite the ability of humans to do so
- However, there is no basis for this assumption in the case of humans.
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, Wei-Lin Chiang,Lianmin Zheng,Ying Sheng,Anastasios Nikolas Angelopoulos,Tianle Li,Dacheng Li,Hao Zhang,Banghua Zhu,Michael Jordan,Joseph E. Gonzalez,Ion Stoica, 07-03-2024

Categories

Artificial Intelligence, Computation and Language

Abstract
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation, Zihao Wang,Anji Liu,Haowei Lin,Jiaqi Li,Xiaojian Ma,Yitao Liang, 08-03-2024

Categories

Computation and Language, Artificial Intelligence

Abstract
LLM4Decompile: Decompiling Binary Code with Large Language Models, Hanzhuo Tan,Qi Luo,Jing Li,Yuqun Zhang, 08-03-2024

Categories

Programming Languages, Computation and Language

Abstract
AutoDev: Automated AI-Driven Development, Michele Tufano,Anisha Agarwal,Jinu Jang,Roshanak Zilouchian Moghaddam,Neel Sundaresan, 13-03-2024

Categories

Software Engineering, Artificial Intelligence

Abstract

The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the potential capabilities available in an IDE such as building, testing, executing code, git operations, etc. Therefore, they are constrained by their limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. To fill this gap, we present AutoDev, a fully automated AI-driven software development framework, designed for autonomous planning and execution of intricate software engineering tasks. AutoDev enables users to define complex software engineering objectives, which are assigned to AutoDev's autonomous AI Agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more. This enables the AI Agents to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required. Furthermore, AutoDev establishes a secure development environment by confining all operations within Docker containers. This framework incorporates guardrails to ensure user privacy and file security, allowing users to define specific permitted or restricted commands and operations within AutoDev. In our evaluation, we tested AutoDev on the HumanEval dataset, obtaining promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment.

Bullet Points
- AutoDev is a fully automated AI-driven software development framework designed for autonomous planning and execution of intricate software engineering tasks
- It enables users to define complex software engineering objectives and has access to files, compiler output, build and testing logs, static analysis tools, and more
- It establishes a secure development environment by confining all operations within Docker containers, incorporating guardrails to ensure user privacy and file security
- The framework achieved promising results with 91.5% and 87.8% of Pass@1 for code generation and test generation respectively.
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback, Ang Li,Qiugen Xiao,Peng Cao,Jian Tang,Yi Yuan,Zijie Zhao,Xiaoyuan Chen,Liang Zhang,Xiangyang Li,Kaitong Yang,Weidong Guo,Yukang Gan,Xu Yu,Daniell Wang,Ying Shan, 13-03-2024

Categories

Machine Learning, Artificial Intelligence

Abstract

Reinforcement Learning from AI Feedback (RLAIF) has the advantages of shorter annotation cycles and lower costs over Reinforcement Learning from Human Feedback (RLHF), making it highly efficient during the rapid strategy iteration periods of large language model (LLM) training. Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human evaluators' preference win ratio for model responses, but a decrease in evaluators' satisfaction rate. Analysis suggests that the decrease in satisfaction rate is mainly due to some responses becoming less helpful, particularly in terms of correctness and truthfulness, highlighting practical limitations of basic RLAIF. In this paper, we propose Hybrid Reinforcement Learning from AI Feedback (HRLAIF). This method enhances the accuracy of AI annotations for responses, making the model's helpfulness more robust in training process. Additionally, it employs AI for Red Teaming, further improving the model's harmlessness. Human evaluation results show that HRLAIF inherits the ability of RLAIF to enhance human preference for outcomes at a low cost while also improving the satisfaction rate of responses. Compared to the policy model before Reinforcement Learning (RL), it achieves an increase of 2.08% in satisfaction rate, effectively addressing the issue of a decrease of 4.58% in satisfaction rate after basic RLAIF.

Bullet Points
- Reinforcement Learning from AI Feedback (RLLF) has advantages such as shorter annotation cycles and lower costs, making it highly efficient during rapid strategy iteration periods of LLM training
- Human evaluators' preference win ratio for model responses increases, but a decrease in satisfaction rate is observed due to some responses becoming less helpful, highlighting practical limitations of basic RLAIF
- In this paper, we propose Hybrid Representation Learning (HRLAIF) which enhances the accuracy of AI annotations for responses, making the model's helpfulness more robust in the training process and employing AI for Red Teaming
- HRLAIF achieves an increase of 2.08% in satisfaction rates, effectively addressing the issue of a decline of 4.58% after basic RRL.
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking, Eric Zelikman,Georges Harik,Yijia Shao,Varuna Jayasiri,Nick Haber,Noah D. Goodman, 14-03-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%$\rightarrow$10.9%) and CommonsenseQA (36.3%$\rightarrow$47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.

Bullet Points
- Quiet-STaR is a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions
- We propose a tokenwise parallel sampling algorithm using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique to address the computational cost of generating continuations and the need to predict beyond individual next tokens
- These improvements disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions.
RAFT: Adapting Language Model to Domain Specific RAG, Tianjun Zhang,Shishir G. Patil,Naman Jain,Sheng Shen,Matei Zaharia,Ion Stoica,Joseph E. Gonzalez, 15-03-2024

Categories

Computation and Language, Artificial Intelligence

Abstract
TnT-LLM: Text Mining at Scale with Large Language Models, Mengting Wan,Tara Safavi,Sujay Kumar Jauhar,Yujin Kim,Scott Counts,Jennifer Neville,Siddharth Suri,Chirag Shah,Ryen W White,Longqi Yang,Reid Andersen,Georg Buscher,Dhruv Joshi,Nagu Rangan, 18-03-2024

Categories

Computation and Language, Artificial Intelligence, Information Retrieval

Abstract

Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.

Bullet Points
- The paper proposes TnT-LLM, a two-phase framework that automates the process of end-to-end label generation and assignment with minimal human effort for any given use-case
- It employs LLMs as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale
- Extensive experiments using human and automatic evaluation metrics demonstrate that LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines and achieves a favorable balance between accuracy and efficiency for classification at scale, and shares practical experiences and insights on the challenges and opportunities of using LLM for large-scale text mining in real-world applications.
Evolutionary Optimization of Model Merging Recipes, Takuya Akiba,Makoto Shing,Yujin Tang,Qi Sun,David Ha, 19-03-2024

Categories

Neural and Evolutionary Computing

Abstract

We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

Bullet Points
- Evolutionary algorithms are used to automate the creation of powerful foundation models by discovering effective combinations of diverse open-source models without requiring additional training data or compute
- This approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of individual models
- Our Japanese Math LLM achieved state-of-the-art performance on established benchmarks, surpassing models with significantly more parameters despite not being explicitly trained for such tasks
- The approach also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, Zhuoshi Pan,Qianhui Wu,Huiqiang Jiang,Menglin Xia,Xufang Luo,Jue Zhang,Qingwei Lin,Victor Rühle,Yuqing Yang,Chin-Yew Lin,H. Vicky Zhao,Lili Qiu,Dongmei Zhang, 19-03-2024

Categories

Computation and Language, Machine Learning
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, Yaowei Zheng,Richong Zhang,Junhao Zhang,Yanhan Ye,Zheyan Luo,Yongqiang Ma, 20-03-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

and already received over 13,000 stars and 1,600 forks.

Bullet Points
- Stars and forks received over 13,000 stars and 1,600 points for the team's performance
- The team has already received a total of 1,600 stars and a fork for their performance
- It's a testament to their hard work and dedication to the sport.
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity, Soyeong Jeong,Jinheon Baek,Sukmin Cho,Sung Ju Hwang,Jong C. Park, 21-03-2024

Categories

Computation and Language, Artificial Intelligence

Abstract
AIOS: LLM Agent Operating System, Kai Mei,Zelong Li,Shuyuan Xu,Ruosong Ye,Yingqiang Ge,Yongfeng Zhang, 25-03-2024

Categories

Operating Systems, Artificial Intelligence, Computation and Language

Abstract
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text, Elliot Bolton,Abhinav Venigalla,Michihiro Yasunaga,David Hall,Betty Xiong,Tony Lee,Roxana Daneshjou,Jonathan Frankle,Percy Liang,Michael Carbin,Christopher D. Manning, 27-03-2024

Categories

Computation and Language, Artificial Intelligence

Abstract
Octopus v2: On-device language model for super agent, Wei Chen,Zhiyuan Li, 02-04-2024

Categories

Computation and Language

Abstract

Language models have shown effectiveness in a variety of software applications, particularly in tasks related to automatic workflow. These models possess the crucial ability to call functions, which is essential in creating AI agents. Despite the high performance of large-scale language models in cloud environments, they are often associated with concerns over privacy and cost. Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

Bullet Points
- Language models have shown effectiveness in automating workflows, but they are often associated with privacy and cost concerns
- On-device models for function calling face issues with latency and accuracy, and a new method empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%
- Our method enhances latency by 35-fold and reduces the latency to levels suitable for deployment across edge devices in production environments, aligning with performance requirements for real-world applications.
Social Skill Training with Large Language Models, Diyi Yang,Caleb Ziems,William Held,Omar Shaikh,Michael S. Bernstein,John Mitchell, 05-04-2024

Categories

Computation and Language, Human-Computer Interaction

Abstract

People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper identifies social skill barriers to enter specialized fields. Then we present a solution that leverages large language models for social skill training via a generic framework. Our AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. This work ultimately calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.

Bullet Points
- To make social skill training more accessible, accessible, and inviting, interdisciplinary research from communication and psychology identifies social skill barriers to entry specialized fields, and a solution that leverages large language models via a generic framework is presented
- The AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback
- This work calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers, Libo Qin,Qiguang Chen,Yuhang Zhou,Zhi Chen,Yinghui Li,Lizi Liao,Min Li,Wanxiang Che,Philip S. Yu, 07-04-2024

Categories

Computation and Language

Abstract

Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.

Bullet Points
- The paper presents a comprehensive review and unified perspective to summarize the recent progress and emerging trends in multilingual large language models (MLLMs) literature, including a first survey, new taxonomy, new frontiers, and abundant open-source resources
- The paper hopes to provide quick access and spur breakthrough research in these models.
From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications, Yongqiang Ma,Lizhi Qing,Jiawei Liu,Yangyang Kang,Yue Zhang,Wei Lu,Xiaozhong Liu,Qikai Cheng, 10-04-2024

Categories

Computation and Language, Information Retrieval

Abstract

Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task, Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.

Bullet Points
- The study shifts the focus from model-centered to human-centered evaluation in AI-powered writing assistance applications
- The proposed metric, "Revision Distance," uses LLMs to suggest revision edits that mimic the human writing process
- The metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score
- For easy-writing tasks, the metric is consistent with established metrics, but offers more insightful feedback and better distinguishes between texts
- In challenging academic writing tasks, our metric still delivers reliable evaluations and holds significant potential for scenarios lacking reference texts.
ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past, Van Pham,Scott Cunningham, 11-04-2024
Best Practices and Lessons Learned on Synthetic Data for Language Models, Ruibo Liu,Jerry Wei,Fangyu Liu,Chenglei Si,Yanzhe Zhang,Jinmeng Rao,Steven Zheng,Daiyi Peng,Diyi Yang,Denny Zhou,Andrew M. Dai, 11-04-2024

Categories

Computation and Language

Abstract

The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns. This paper provides an overview of synthetic data research, discussing its applications, challenges, and future directions. We present empirical evidence from prior art to demonstrate its effectiveness and highlight the importance of ensuring its factuality, fidelity, and unbiasedness. We emphasize the need for responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.

Bullet Points
- The paper discusses synthetic data research, its applications, challenges, and future directions
- It presents empirical evidence from prior art to demonstrate its effectiveness and emphasizes the importance of ensuring its factuality, fidelity, and unbiasedness
- The paper emphasizes responsible use of synthetic data to build more powerful, inclusive, and trustworthy language models.
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples, Robert Vacareanu,Vlad-Andrei Negru,Vasile Suciu,Mihai Surdeanu, 11-04-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

We analyze how well pre-trained large language models (e.g., Llama2, GPT-4, Claude 3, etc) can do linear and non-linear regression when given in-context examples, without any additional training or gradient updates. Our findings reveal that several large language models (e.g., GPT-4, Claude 3) are able to perform regression tasks with a performance rivaling (or even outperforming) that of traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting. For example, on the challenging Friedman #2 regression dataset, Claude 3 outperforms many supervised methods such as AdaBoost, SVM, Random Forest, KNN, or Gradient Boosting. We then investigate how well the performance of large language models scales with the number of in-context exemplars. We borrow from the notion of regret from online learning and empirically show that LLMs are capable of obtaining a sub-linear regret.

Bullet Points
- Pre-trained large language models can perform linear and non-linear regression when given in-context examples without any additional training or gradient updates
- They are able to perform regression tasks with a performance rivaling or even outperforming traditional supervised methods such as Random Forest, Bagging, or Gradient Boosting
- They also investigate how well their performance scales with the number of in-consumer exemplars and empirically show that LLMs are capable of obtaining a sub-Linear Regret.
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, Jinheon Baek,Sujay Kumar Jauhar,Silviu Cucerzan,Sung Ju Hwang, 11-04-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature. Specifically, starting with a core paper as the primary focus to generate ideas, our ResearchAgent is augmented not only with relevant publications through connecting information over an academic graph but also entities retrieved from an entity-centric knowledge store based on their underlying concepts, mined and shared across numerous papers. In addition, mirroring the human approach to iteratively improving ideas with peer discussions, we leverage multiple ReviewingAgents that provide reviews and feedback iteratively. Further, they are instantiated with human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showcasing its effectiveness in generating novel, clear, and valid research ideas based on human and model-based evaluation results.

Bullet Points
- To enhance scientific research productivity, we propose a large language model-powered research idea writing agent called ResearchAgent that automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature
- It is augmented with relevant publications through connecting information over an academic graph, entities retrieved from an entity-centric knowledge store, peer discussions, multiple ReviewingAgents, and human preference-aligned large language models whose criteria for evaluation are derived from actual human judgments
- The agent is experimentally validated on scientific publications across multiple disciplines, demonstrating its effectiveness in generating novel, clear, and valid research ideas despite its complexity, slow pace, and need for specialized experts.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, Tianbao Xie,Danyang Zhang,Jixuan Chen,Xiaochuan Li,Siheng Zhao,Ruisheng Cao,Toh Jing Hua,Zhoujun Cheng,Dongchan Shin,Fangyu Lei,Yitao Liu,Yiheng Xu,Shuyan Zhou,Silvio Savarese,Caiming Xiong,Victor Zhong,Tao Yu, 11-04-2024

Categories

Artificial Intelligence, Computation and Language
LLM Agents can Autonomously Exploit One-day Vulnerabilities, Richard Fang,Rohan Bindu,Akul Gupta,Daniel Kang, 11-04-2024

Categories

Cryptography and Security, Artificial Intelligence

Abstract

In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.

Bullet Points
- LLM agents can exploit one-day vulnerabilities in real-world systems, based on a dataset of 15 vulnerabilities categorized as critical severity in the CVE description
- GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs), but without the description, it can only exploit 7% of the vulnerabilities
- The findings raise questions about the widespread deployment of highly capable LLM Agents.
Reducing hallucination in structured outputs via Retrieval-Augmented Generation, Patrice Béchard,Orlando Marquez Ayala, 12-04-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language, Information Retrieval

Abstract

A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.

Bullet Points
- We developed a system leveraging Retrieval Augmented Generation (RAG) to improve the quality of structured output representing natural language workflows, which reduces hallucinations and improves generalization of LLM in out-of-domain settings
- Additionally, using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, making deployments of GLM-based systems less resource-intensive.
Is ChatGPT Transforming Academics' Writing Style?, Mingmeng Geng,Roberto Trotta, 12-04-2024

Categories

Computation and Language, Artificial Intelligence, Digial Libraries, Machine Learning

Abstract

Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT-revised abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics' writing style.

Bullet Points
- The textual density of ChatGPT's writing style in arXiv abstracts is assessed using a statistical analysis of word frequency changes
- The model is calibrated and validated on a mixture of real abstracts and chatGPT-modified abstracts after a careful noise analysis
- The fraction of chatGPP-revised abstracts estimated to be approximately 35%, and the output of one of the simplest prompts, "revise the following sentences" as a baseline
- The analysis of both positive and negative aspects of the penetration into academics' writing style is conducted.
Pre-training Small Base LMs with Fewer Tokens, Sunny Sanyal,Sujay Sanghavi,Alexandros G. Dimakis, 12-04-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, Xuezhe Ma,Xiaomeng Yang,Wenhan Xiong,Beidi Chen,Lili Yu,Hao Zhang,Jonathan May,Luke Zettlemoyer,Omer Levy,Chunting Zhou, 12-04-2024

Categories

Machine Learning, Computation and Language
A Survey on Retrieval-Augmented Text Generation for Large Language Models, Yizheng Huang,Jimmy Huang, 17-04-2024

Categories

Information Retrieval, Artificial Intelligence, Computation and Language

Abstract

Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

Bullet Points
- Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning to address static limitations of LLMs by enabling the dynamic integration of up-to-date external information
- This method focuses on the text domain and provides a cost-effective solution to the generation of plausible but incorrect responses by LLM
- The paper organizes the RAG paradigm into four categories: pre-retreeval, retrieval, post-retraceval and generation, offering a detailed perspective from the retrieval viewpoint
- It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies
- It introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions
- The study aims to consolidate existing research, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LMLs.
Many-Shot In-Context Learning, Rishabh Agarwal,Avi Singh,Lei M. Zhang,Bernd Bohnet,Luis Rosias,Stephanie Chan,Biao Zhang,Ankesh Anand,Zaheer Abbas,Azade Nova,John D. Co-Reyes,Eric Chu,Feryal Behbahani,Aleksandra Faust,Hugo Larochelle, 17-04-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language

Abstract

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

Bullet Points
- LLMs excel at few-shot in-context learning (ICL) by learning from a few examples provided in context at inference without any weight updates
- Expanded context windows allow us to investigate ICL with hundreds or thousands of examples, and we observe significant performance gains across a wide variety of generative and discriminative tasks
- However, many-shot ICL can be bottlenecked by the available amount of human-generated examples
- To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL, which use model-generated chain-of-thought rationales in place of human examples
- Reinforcement ICL is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning
- Our analysis also highlights the limitations of next-token prediction loss as an indicator of downstream ICL performance.
Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative Study, Zooey Nguyen,Anthony Annunziata,Vinh Luong,Sang Dinh,Quynh Le,Anh Hai Ha,Chanh Le,Hong An Phan,Shruti Raghavan,Christopher Nguyen, 17-04-2024

Categories

Artificial Intelligence

Abstract

This paper investigates the impact of domain-specific model fine-tuning and of reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG). Using the FinanceBench SEC financial filings dataset, we observe that, for RAG, combining a fine-tuned embedding model with a fine-tuned LLM achieves better accuracy than generic models, with relatively greater gains attributable to fine-tuned embedding models. Additionally, employing reasoning iterations on top of RAG delivers an even bigger jump in performance, enabling the Q&A systems to get closer to human-expert quality. We discuss the implications of such findings, propose a structured technical design space capturing major technical components of Q&A AI, and provide recommendations for making high-impact technical choices for such components. We plan to follow up on this work with actionable guides for AI teams and further investigations into the impact of domain-specific augmentation in RAG and into agentic AI capabilities such as advanced planning and reasoning.

Bullet Points
- The paper investigates the impact of domain-specific model fine-tuning and reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG)
- The paper identifies that combining a well-tuned embedding model with a LLM achieves better accuracy than generic models, and employing reasoning iterations on top of RAG delivers an even bigger jump in performance, enabling Q&A systems to get closer to human-expert quality
- The implications of these findings are discussed, and a structured technical design space is proposed to capture major technical components of Q&AI AI and provide recommendations for making high-impact technical choices for these components
- The paper plans to follow up on this work with actionable guides for AI teams, and further investigations into the impact on agentic AI capabilities such as advanced planning
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey, Tula Masterman,Sandi Besen,Mason Sawtell,Alex Chao, 17-04-2024

Categories

Artificial Intelligence, Computation and Language

Abstract

This survey paper examines the recent advancements in AI agent implementations, with a focus on their ability to achieve complex goals that require enhanced reasoning, planning, and tool execution capabilities. The primary objectives of this work are to a) communicate the current capabilities and limitations of existing AI agent implementations, b) share insights gained from our observations of these systems in action, and c) suggest important considerations for future developments in AI agent design. We achieve this by providing overviews of single-agent and multi-agent architectures, identifying key patterns and divergences in design choices, and evaluating their overall impact on accomplishing a provided goal. Our contribution outlines key themes when selecting an agentic architecture, the impact of leadership on agent systems, agent communication styles, and key phases for planning, execution, and reflection that enable robust AI agent systems.

Bullet Points
- The survey paper examines recent advancements in AI agent implementations, focusing on their ability to achieve complex goals that require enhanced reasoning, planning, and tool execution capabilities
- The primary objectives of this work are to communicate current capabilities and limitations, share insights, and suggest important considerations for future AI agent design
- The paper outlines key themes when selecting an agentic architecture, the impact of leadership on agent systems, agent communication styles, and key phases for planning, execution, and reflection that enable robust AI agent systems.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Marah Abdin,Sam Ade Jacobs,Ammar Ahmad Awan,Jyoti Aneja,Ahmed Awadallah,Hany Awadalla,Nguyen Bach,Amit Bahree,Arash Bakhtiari,Harkirat Behl,Alon Benhaim,Misha Bilenko,Johan Bjorck,Sébastien Bubeck,Martin Cai,Caio César Teodoro Mendes,Weizhu Chen,Vishrav Chaudhary,Parul Chopra,Allie Del Giorno,Gustavo de Rosa,Matthew Dixon,Ronen Eldan,Dan Iter,Amit Garg,Abhishek Goswami,Suriya Gunasekar,Emman Haider,Junheng Hao,Russell J. Hewett,Jamie Huynh,Mojan Javaheripi,Xin Jin,Piero Kauffmann,Nikos Karampatziakis,Dongwoo Kim,Mahoud Khademi,Lev Kurilenko,James R. Lee,Yin Tat Lee,Yuanzhi Li,Chen Liang,Weishung Liu,Eric Lin,Zeqi Lin,Piyush Madan,Arindam Mitra,Hardik Modi,Anh Nguyen,Brandon Norick,Barun Patra,Daniel Perez-Becker,Thomas Portet,Reid Pryzant,Heyang Qin,Marko Radmilac,Corby Rosset,Sambudha Roy,Olatunji Ruwase,Olli Saarikivi,Amin Saied,Adil Salim,Michael Santacroce,Shital Shah,Ning Shang,Hiteshi Sharma,Xia Song,Masahiro Tanaka,Xin Wang,Rachel Ward,Guanhua Wang,Philipp Witte,Michael Wyatt,Can Xu,Jiahang Xu,Sonali Yadav,Fan Yang,Ziyi Yang,Donghan Yu,Chengruidong Zhang,Cyril Zhang,Jianwen Zhang,Li Lyna Zhang,Yi Zhang,Yue Zhang,Yunan Zhang,Xiren Zhou, 22-04-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).

Bullet Points
- Phi-3-mini is a 3.8 billion parameter language model trained on 3.3 trillion tokens
- Its performance compares to models like Mixtral 8x7B and GPT-3.5, despite being small enough to be deployed on a phone
- The innovation lies in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data
- The model is also aligned for robustness, safety, and chat format
- Initial parameter-scaling results are provided with 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi3-medium.
Better Synthetic Data by Retrieving and Transforming Existing Datasets, Saumya Gandhi,Ritu Gala,Vijay Viswanathan,Tongshuang Wu,Graham Neubig, 22-04-2024

Categories

Computation and Language
OpenELM: An Efficient Language Model Family with Open Training and Inference Framework, Sachin Mehta,Mohammad Hossein Sekhavat,Qingqing Cao,Maxwell Horton,Yanzi Jin,Chenfan Sun,Iman Mirzadeh,Mahyar Najibi,Dmitry Belenko,Peter Zatloukal,Mohammad Rastegari, 22-04-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning
Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice, Ranim Khojah,Mazen Mohamad,Philipp Leitner,Francisco Gomes de Oliveira Neto, 23-04-2024

Categories

Software Engineering, Artificial Intelligence, Computation and Language, Human-Computer Interaction, Machine Learning

Abstract

Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that, rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.

Bullet Points
- LLMs are commonly discussed in academia and the general public as support tools for software engineering
- However, there is little empirical evidence regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry
- An observational study of 24 professional software engineers who have been using chatGPT over a week in their jobs and qualitatively analysed their dialogues with the chatbot as well as their overall experience
- The researchers propose a theoretical framework for how the purpose of the interaction, internal factors, and external factors shape the experience in terms of perceived usefulness and trust
- This framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners and serve as a reference point for future empirical LLM research in this domain.
Autonomous LLM-driven research from data to human-verifiable research papers, Tal Ifargan,Lukas Hafner,Maor Kern,Ori Alcalay,Roy Kishony, 24-04-2024

Categories

Quantitative Biology, Artificial Intelligence

Abstract

As AI promises to accelerate scientific discovery, it remains unclear whether fully AI-driven research is possible and whether it can adhere to key scientific values, such as transparency, traceability and verifiability. Mimicking human scientific practices, we built data-to-paper, an automation platform that guides interacting LLM agents through a complete stepwise research process, while programmatically back-tracing information flow and allowing human oversight and interactions. In autopilot mode, provided with annotated data alone, data-to-paper raised hypotheses, designed research plans, wrote and debugged analysis codes, generated and interpreted results, and created complete and information-traceable research papers. Even though research novelty was relatively limited, the process demonstrated autonomous generation of de novo quantitative insights from data. For simple research goals, a fully-autonomous cycle can create manuscripts which recapitulate peer-reviewed publications without major errors in about 80-90%, yet as goal complexity increases, human co-piloting becomes critical for assuring accuracy. Beyond the process itself, created manuscripts too are inherently verifiable, as information-tracing allows to programmatically chain results, methods and data. Our work thereby demonstrates a potential for AI-driven acceleration of scientific discovery while enhancing, rather than jeopardizing, traceability, transparency and verifiability.

Bullet Points
- We built data-to-paper, an automation platform that guides interacting LLM agents through a complete stepwise research process while programmatically back-tracing information flow and allowing human oversight and interactions
- The process demonstrated autonomous generation of de novo quantitative insights from data, and for simple research goals, a fully-autonomous cycle can create manuscripts that recapitulate peer-reviewed publications without major errors in about 80-90%
- However, as goal complexity increases, human co-piloting becomes critical for assuring accuracy
- Moreover, created manuscripts are inherently verifiable.
Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models, Eren Dogan,M. Egemen Uzun,Atahan Uz,H. Emre Seyrek,Ahmed Zeer,Ezgi Sevi,H. Toprak Kesgin,M. Kaan Yuce,M. Fatih Amasyali, 25-04-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

The developments that language models have provided in fulfilling almost all kinds of tasks have attracted the attention of not only researchers but also the society and have enabled them to become products. There are commercially successful language models available. However, users may prefer open-source language models due to cost, data privacy, or regulations. Yet, despite the increasing number of these models, there is no comprehensive comparison of their performance for Turkish. This study aims to fill this gap in the literature. A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. Turkish datasets for contextual learning and question-answering were prepared, and both automatic and human evaluations were conducted. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish and that in-context learning performances do not much related to question-answering performances.

Bullet Points
- The study compares language models based on contextual learning and question-answering abilities for Turkish
- The results show that continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish and that in-context learning performances do not significantly relate to question-anwering performances.
Introducing cosmosGPT: Monolingual Training for Turkish Language Models, H. Toprak Kesgin,M. Kaan Yuce,Eren Dogan,M. Egemen Uzun,Atahan Uz,H. Emre Seyrek,Ahmed Zeer,M. Fatih Amasyali, 26-04-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

The number of open source language models that can produce Turkish is increasing day by day, as in other languages. In order to create the basic versions of such models, the training of multilingual models is usually continued with Turkish corpora. The alternative is to train the model with only Turkish corpora. In this study, we first introduce the cosmosGPT models that we created with this alternative method. Then, we introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. Finally, a comprehensive comparison of the adapted Turkish language models on different capabilities is presented. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.

Bullet Points
- The number of open source language models that can produce Turkish is increasing, and training multilingual models is usually continued with Turkish corpora
- The study introduces cosmosGPT models, finetunes, evaluation datasets, and a comprehensive comparison of the adapted Turkish language models on different capabilities
- The monolingual corpus language models have promising performance despite being about 10 times smaller than the others.
A Primer on the Inner Workings of Transformer-based Language Models, Javier Ferrando,Gabriele Sarti,Arianna Bisazza,Marta R. Costa-jussà, 30-04-2024

Categories

Computation and Language

Abstract

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

Bullet Points
- The primer provides a technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture
- It covers the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions.
RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing, Yucheng Hu,Yuxing Lu, 30-04-2024

Categories

Computation and Language, Artificial Intelligence
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim,Juyoung Suk,Shayne Longpre,Bill Yuchen Lin,Jamin Shin,Sean Welleck,Graham Neubig,Moontae Lee,Kyungjae Lee,Minjoon Seo, 02-05-2024

Categories

Computation and Language
Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents, Junkai Li,Siyu Wang,Meng Zhang,Weitao Li,Yunghwei Lai,Xinhui Kang,Weizhi Ma,Yang Liu, 05-05-2024

Categories

Artificial Intelligence

Abstract

In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. All patients, nurses, and doctors are autonomous agents powered by large language models (LLMs). Our central goal is to enable a doctor agent to learn how to treat illness within the simulacrum. To do so, we propose a method called MedAgent-Zero. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. Simulation experiments show that the treatment performance of doctor agents consistently improves on various tasks. More interestingly, the knowledge the doctor agents have acquired in Agent Hospital is applicable to real-world medicare benchmarks. After treating around ten thousand patients (real-world doctors may take over two years), the evolved doctor agent achieves a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset that covers major respiratory diseases. This work paves the way for advancing the applications of LLM-powered agent techniques in medical scenarios.

Bullet Points
- The paper introduces a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness, powered by large language models (LLMs)
- The goal is to enable a doctor agent to learn how to treat illness within the simulator
- We propose a method called MedAgent-Zero, which can simulate disease onset and progression based on knowledge bases and LLMs
- The treatment performance of doctor agents consistently improves on various tasks, and the knowledge acquired in Agent Hospital is applicable to real-world medicare benchmarks
- After treating around ten thousand patients, the evolved doctor agent achieved a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset covering major respiratory diseases
- This work advances the applications of LLM-powered agent techniques in medical scenarios.
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking, Emre Can Acikgoz,Mete Erdogan,Deniz Yuret, 07-05-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.

Bullet Points
- The study explores the impact of training strategies, model choices, and data availability on the performance of LLMs designed for low-resource languages, with a particular focus on Turkish
- We conduct an in-depth analysis to evaluate the effectiveness of these methods and conduct experiments on data and model scaling to enhance knowledge transfer across languages and address the challenges of catastrophic forgetting during fine-tuning on a different language
- The aim is to provide a detailed guide for advancing the LLM framework in low linguistic contexts and making NLP benefits more globally accessible.
RLHF Workflow: From Reward Modeling to Online RLHF, Hanze Dong,Wei Xiong,Bo Pang,Haoxiang Wang,Han Zhao,Yingbo Zhou,Nan Jiang,Doyen Sahoo,Caiming Xiong,Tong Zhang, 13-05-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language, Machine Learning

Abstract
(Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts, Minghao Wu,Yulin Yuan,Gholamreza Haffari,Longyue Wang, 20-05-2024

Categories

Computation and Language

Abstract

Recent advancements in machine translation (MT) have significantly enhanced translation quality across various domains. However, the translation of literary texts remains a formidable challenge due to their complex language, figurative expressions, and cultural nuances. In this work, we introduce a novel multi-agent framework based on large language models (LLMs) for literary translation, implemented as a company called TransAgents, which mirrors traditional translation publication process by leveraging the collective capabilities of multiple agents, to address the intricate demands of translating literary works. To evaluate the effectiveness of our system, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP). MHP assesses translations from the perspective of monolingual readers of the target language, while BLP uses advanced LLMs to compare translations directly with the original texts. Empirical findings indicate that despite lower d-BLEU scores, translations from TransAgents are preferred by both human evaluators and LLMs over human-written references, particularly in genres requiring domain-specific knowledge. We also highlight the strengths and limitations of TransAgents through case studies and suggests directions for future research.

Bullet Points
- TransAgents is a multi-agent framework based on large language models (LLMs) for literary translation that mirrors the traditional translation publication process by leveraging multiple agents to address the complex language, figurative expressions, and cultural nuances of translating literary works
- To evaluate the effectiveness of the system, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (2BLP)
- MHP assesses translations from the perspective of monolingual readers of the target language, while BLP uses advanced LLMs to compare translations directly with the original texts
- Empirical findings indicate that despite lower d-BLEU scores, translations are preferred by both human evaluators and LMLs over human-written references, particularly in genres requiring domain-specific knowledge.
The Prompt Report: A Systematic Survey of Prompting Techniques, Sander Schulhoff,Michael Ilie,Nishant Balepur,Konstantine Kahadze,Amanda Liu,Chenglei Si,Yinheng Li,Aayush Gupta,HyoJung Han,Sevien Schulhoff,Pranav Sandeep Dulepet,Saurav Vidyadhara,Dayeon Ki,Sweta Agrawal,Chau Pham,Gerson Kroiz,Feileen Li,Hudson Tao,Ashay Srivastava,Hevander Da Costa,Saloni Gupta,Megan L. Rogers,Inna Goncearenco,Giuseppe Sarli,Igor Galynker,Denis Peskoff,Marine Carpuat,Jules White,Shyamal Anadkat,Alexander Hoyle,Philip Resnik, 06-06-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area's nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.

Bullet Points
- The paper establishes a taxonomy of prompting techniques for Generative Artificial Intelligence (GenAI) systems, including vocabulary, text-only, and other modalities
- It also presents a meta-analysis of the literature on natural language prefix-prompting
- The paper identifies conflicting terminology and poor ontological understanding of prompts due to the area's nascency.
Mixture-of-Agents Enhances Large Language Model Capabilities, Junlin Wang,Jue Wang,Ben Athiwaratkun,Ce Zhang,James Zou, 07-06-2024

Categories

Computation and Language

Abstract

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.

Bullet Points
- To harness the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology, we construct a layered MoA architecture where each layer comprises multiple LML agents
- Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response
- Our MoA model achieves state of-art performance on AlpacaEval 2.0, MT-Bench, and FLASK, surpassing GPT-4 Omni.
Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching, Xiaoying Zhang,Baolin Peng,Ye Tian,Jingyan Zhou,Yipeng Zhang,Haitao Mi,Helen Meng, 10-06-2024

Categories

Computation and Language

Abstract

Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM's ability to effectively acquire new knowledge from raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. In addition, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM's knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on Llama2 family models reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.

Bullet Points
- Self-Tuning is a learning framework aimed at improving LLMs' ability to acquire new knowledge from raw documents through self-teaching
- It focuses on improving memorization, comprehension, and self-reflection through a set of knowledge-intensive tasks created in a self-supervised manner
- Three Wiki-Newpages-2023-QA datasets are introduced to facilitate an in-depth analysis of an LLM's knowledge acquisition ability
- Extensive experimental results on Llama2 family models demonstrate its superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.
Towards Lifelong Learning of Large Language Models: A Survey, Junhao Zheng,Shengjie Qiu,Chengming Shi,Qianli Ma, 10-06-2024

Categories

Machine Learning, Computation and Language

Abstract

As the applications of large language models (LLMs) expand across diverse fields, the ability of these models to adapt to ongoing changes in data, tasks, and user preferences becomes crucial. Traditional training methods, relying on static datasets, are increasingly inadequate for coping with the dynamic nature of real-world information. Lifelong learning, also known as continual or incremental learning, addresses this challenge by enabling LLMs to learn continuously and adaptively over their operational lifetime, integrating new knowledge while retaining previously learned information and preventing catastrophic forgetting. This survey delves into the sophisticated landscape of lifelong learning, categorizing strategies into two primary groups: Internal Knowledge and External Knowledge. Internal Knowledge includes continual pretraining and continual finetuning, each enhancing the adaptability of LLMs in various scenarios. External Knowledge encompasses retrieval-based and tool-based lifelong learning, leveraging external data sources and computational tools to extend the model's capabilities without modifying core parameters. The key contributions of our survey are: (1) Introducing a novel taxonomy categorizing the extensive literature of lifelong learning into 12 scenarios; (2) Identifying common techniques across all lifelong learning scenarios and classifying existing literature into various technique groups within each scenario; (3) Highlighting emerging techniques such as model expansion and data selection, which were less explored in the pre-LLM era. Through a detailed examination of these groups and their respective categories, this survey aims to enhance the adaptability, reliability, and overall performance of LLMs in real-world applications.

Bullet Points
- The survey aims to enhance the adaptability, reliability, and overall performance of LLMs in real-world applications by categorizing strategies into two primary groups: Internal Knowledge and External Knowledge
- Internal Knowledge includes continual pretraining and finetuning, while External Knowledge encompasses retrieval-based and tool-based lifelong learning
- The survey highlights emerging techniques such as model expansion and data selection, which were less explored in pre-LLM era.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing, Zhangchen Xu,Fengqing Jiang,Luyao Niu,Yuntian Deng,Radha Poovendran,Yejin Choi,Bill Yuchen Lin, 12-06-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.

Bullet Points
- Yes, it is possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM, such as Magpie
- This method can generate 4 million instructions along with their corresponding responses and perform a comprehensive analysis of the extracted data
- We fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fined models
- This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology, Minh Huynh Nguyen,Thang Phan Chau,Phong X. Nguyen,Nghi D. Q. Bui, 16-06-2024

Categories

Software Engineering, Artificial Intelligence

Abstract

Software agents have emerged as promising tools for addressing complex software engineering tasks. Existing works, on the other hand, frequently oversimplify software development workflows, despite the fact that such workflows are typically more complex in the real world. Thus, we propose AgileCoder, a multi agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles - such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally developing software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi agent systems in advanced software engineering environments.

Bullet Points
- AgileCoder is a multi agent system that integrates Agile Methodology (AM) into the framework, assigning specific AM roles to agents, and enhancing development efficiency by organizing work into sprints
- It also introduces Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase, leading to more precise code generation and modifications throughout the software development process
- It surpasses existing benchmarks and demonstrates the capabilities of multi agent systems in advanced software engineering environments.
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models, Guanting Dong,Keming Lu,Chengpeng Li,Tingyu Xia,Bowen Yu,Chang Zhou,Jingren Zhou, 19-06-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract
Instruction Pre-Training: Language Models are Supervised Multitask Learners, Daixuan Cheng,Yuxian Gu,Shaohan Huang,Junyu Bi,Minlie Huang,Furu Wei, 20-06-2024

Categories

Computation and Language

Abstract
Banishing LLM Hallucinations Requires Rethinking Generalization, Johnny Li,Saksham Consul,Eda Zhou,James Wong,Naila Farooqui,Yuxin Ye,Nithyashree Manohar,Zhuxiaona Wei,Tian Wu,Ben Echols,Sharon Zhou,Gregory Diamos, 25-06-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations -- Lamini-1 -- that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.

Bullet Points
- LLMs frequently hallucinate due to a balance between creativity and factuality, which can be mitigated, but not eliminated by grounding them in external knowledge sources
- Traditional approaches to mitigating hallucinations fail to explain why they do so
- We propose a first generation model called Lamini-1 that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.
ColPali: Efficient Document Retrieval with Vision Language Models, Manuel Faysse,Hugues Sibille,Tony Wu,Bilel Omrani,Gautier Viaud,Céline Hudelot,Pierre Colombo, 27-06-2024

Categories

Information Retrieval, Computation and Language, Computer Vision

Abstract

Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

Bullet Points
- Visual Document Retrieval Benchmark ViDoRe is a visual document retrieval system that utilizes visual cues efficiently for page-level retrieval tasks
- ColPali, a new retrieval model architecture, utilizes the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages, outperforms modern retrieval pipelines while being faster and end-to-end trainable.
Scaling Synthetic Data Creation with 1,000,000,000 Personas, Xin Chan,Xiaoyang Wang,Dian Yu,Haitao Mi,Dong Yu, 28-06-2024

Categories

Computation and Language, Machine Learning

Abstract

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

Bullet Points
- We propose a persona-driven data synthesis methodology that utilizes various perspectives within a large language model (LLM) to create diverse synthetic data
- We introduce Persona Hub, a collection of 1 billion diverse personas automatically curated from web data, which can tap into almost every perspective encapsulated within the LLM, facilitating the creation of diverse artificial data at scale for various scenarios
- By demonstrating its use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs, and tools (functions) at scale, we demonstrate that persona synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
Exploring Advanced Large Language Models with LLMsuite, Giorgio Roffo, 01-07-2024

Categories

Computation and Language, Computer Vision

Abstract
Searching for Best Practices in Retrieval-Augmented Generation, Xiaohua Wang,Zhenghua Wang,Xuan Gao,Feiran Zhang,Yixin Wu,Zhibo Xu,Tianyuan Shi,Zhengyuan Wang,Shizheng Li,Qi Qian,Ruicheng Yin,Changze Lv,Xiaoqing Zheng,Xuanjing Huang, 01-07-2024

Categories

Computation and Language

Abstract

Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy.

Bullet Points
- Retrieval-augmented generation (RAG) techniques have been effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality in specialized domains
- However, RAG approaches have complex implementation and prolonged response times
- To identify optimal RAG practices, we investigate existing approaches and their potential combinations
- We suggest strategies for deploying RAG that balance both performance and efficiency, and demonstrate that multimodal retrieval techniques can enhance question-answering capabilities and accelerate the generation of multimodal content using a "retreeval as generation" strategy.
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models, Haritz Puerto,Tilek Chubakov,Xiaodan Zhu,Harish Tayyar Madabushi,Iryna Gurevych, 03-07-2024

Categories

Computation and Language

Abstract
AgentInstruct: Toward Generative Teaching with Agentic Flows, Arindam Mitra,Luciano Del Corro,Guoqing Zheng,Shweti Mahajan,Dany Rouhana,Andres Codas,Yadong Lu,Wei-ge Chen,Olga Vrousgos,Corby Rosset,Fillipe Silva,Hamed Khanpour,Yash Lara,Ahmed Awadallah, 03-07-2024

Categories

Artificial Intelligence, Computation and Language, Machine Learning

Abstract

Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

Bullet Points
- Synthetic data is important for accelerating language models, but it also raises concerns about model collapse and drawbacks of imitating other models
- The use of synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, is called Generative Teaching
- AgentInstruct is an extensible agentic framework that automatically creates large amounts of diverse and high-quality synthetic data
- It can create both prompts and responses, using only raw data sources like text documents and code files as seeds
- We post-train Mistral-7b with the data and observe significant improvements across many benchmarks, such as 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH, and 45% improvement on AlpacaEval
- Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and
Mixture of A Million Experts, Xu Owen He, 04-07-2024

Categories

Machine Learning, Artificial Intelligence

Abstract

The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

Bullet Points
- The feedforward layers in transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows
- Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost
- The fine-grained MoE scaling law shows that higher granularity leads to better performance
- Existing MoE models are limited to a small number of experts due to computational and optimization challenges
- This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million)
- Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-greased MoEs in terms of performance-compute trade-off
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence, Weize Chen,Ziming You,Ran Li,Yitong Guan,Chen Qian,Chenyang Zhao,Cheng Yang,Ruobing Xie,Zhiyuan Liu,Maosong Sun, 09-07-2024

Categories

Computation and Language

Abstract
Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning, Yulong Wang,Tianhao Shen,Lifeng Liu,Jian Xie, 15-07-2024

Categories

Artificial Intelligence, Computation and Language

Abstract

Existing agents based on large language models (LLMs) demonstrate robust problem-solving capabilities by integrating LLMs' inherent knowledge, strong in-context learning and zero-shot capabilities, and the use of tools combined with intricately designed LLM invocation workflows by humans. However, these agents still exhibit shortcomings in long-term reasoning and under-use the potential of existing tools, leading to noticeable deficiencies in complex real-world reasoning scenarios. To address these limitations, we introduce Sibyl, a simple yet powerful LLM-based agent framework designed to tackle complex reasoning tasks by efficiently leveraging a minimal set of tools. Drawing inspiration from Global Workspace Theory, Sibyl incorporates a global workspace to enhance the management and sharing of knowledge and conversation history throughout the system. Furthermore, guided by Society of Mind Theory, Sibyl implements a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach. This approach aims to reduce system complexity while expanding the scope of problems solvable-from matters typically resolved by humans in minutes to those requiring hours or even days, thus facilitating a shift from System-1 to System-2 thinking. Sibyl has been designed with a focus on scalability and ease of debugging by incorporating the concept of reentrancy from functional programming from its inception, with the aim of seamless and low effort integration in other LLM applications to improve capabilities. Our experimental results on the GAIA benchmark test set reveal that the Sibyl agent instantiated with GPT-4 achieves state-of-the-art performance with an average score of 34.55%, compared to other agents based on GPT-4. We hope that Sibyl can inspire more reliable and reusable LLM-based agent solutions to address complex real-world reasoning tasks.

Bullet Points
- Existing agents based on large language models (LLMs) have robust problem-solving capabilities, but they exhibit shortcomings in long-term reasoning and under-use of existing tools, leading to noticeable deficiencies in complex real-world reasoning scenarios
- To address these limitations, we introduce Sibyl, a simple yet powerful LLM-based agent framework designed to tackle complex reasoning tasks efficiently leveraging a minimal set of tools
- The framework incorporates a global workspace to enhance the management and sharing of knowledge and conversation history throughout the system, and a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach to reduce system complexity while expanding the scope of problems solvable-from matters typically resolved by humans in minutes to those requiring hours or even days, thus facilitating a shift from System-1 to System-2 thinking
- The system has been designed with scalability and ease of de
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks, Shubham Vatsal,Harsh Dubey, 17-07-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Large language models (LLMs) have shown remarkable performance on many different Natural Language Processing (NLP) tasks. Prompt engineering plays a key role in adding more to the already existing abilities of LLMs to achieve significant performance gains on various NLP tasks. Prompt engineering requires composing natural language instructions called prompts to elicit knowledge from LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models, prompt engineering does not require extensive parameter re-training or fine-tuning based on the given NLP task and thus solely operates on the embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently extract LLMs' knowledge through a basic natural language conversational exchange or prompt engineering, allowing more and more people even without deep mathematical machine learning background to experiment with LLMs. With prompt engineering gaining popularity in the last two years, researchers have come up with numerous engineering techniques around designing prompts to improve accuracy of information extraction from the LLMs. In this paper, we summarize different prompting techniques and club them together based on different NLP tasks that they have been used for. We further granularly highlight the performance of these prompting strategies on various datasets belonging to that NLP task, talk about the corresponding LLMs used, present a taxonomy diagram and discuss the possible SoTA for specific datasets. In total, we read and present a survey of 44 research papers which talk about 39 different prompting methods on 29 different NLP tasks of which most of them have been published in the last two years.

Bullet Points
- Prompt engineering is a technique used by large language models (LLMs) to elicit knowledge from LLMs in a structured way
- It involves composing natural language instructions called prompts, which do not require extensive parameter re-training or fine-tuning based on the given NLP task
- LLM enthusiasts can extract knowledge through a basic natural language conversational exchange or prompt engineering, allowing more people even without deep mathematical machine learning background to experiment with them
- The paper discusses different prompting techniques and their performance on different NLP tasks, and presents a survey of 44 research papers that discuss 39 different methods on 29, of which most of them have been published in the last two years
- This paper highlights the performance of these prompting strategies on various datasets belonging to the given task, talk about the corresponding LLM used, and present a taxonomy diagram and possible SoTA for specific datasets.
Open Artificial Knowledge, Vadim Borisov,Richard H. Schreiber, 19-07-2024

Categories

Computation and Language, Machine Learning

Abstract
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering, Rujun Han,Yuhao Zhang,Peng Qi,Yumo Xu,Jenyuan Wang,Lan Liu,William Yang Wang,Bonan Min,Vittorio Castelli, 19-07-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

Bullet Points
- The article proposes Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains
- It proposes RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators, and shows that RAR-QA arena and human judgments on answer quality are highly correlated, with only 41.3% of the most competitive LLM answers preferring LLFQA answers for future research.
A Survey on Employing Large Language Models for Text-to-SQL Tasks, Liang Shi,Zhengju Tang,Zhi Yang, 21-07-2024

Categories

Computation and Language

Abstract

The increasing volume of data stored in relational databases has led to the need for efficient querying and utilization of this data in various sectors. However, writing SQL queries requires specialized knowledge, which poses a challenge for non-professional users trying to access and query databases. Text-to-SQL parsing solves this issue by converting natural language queries into SQL queries, thus making database access more accessible for non-expert users. To take advantage of the recent developments in Large Language Models (LLMs), a range of new methods have emerged, with a primary focus on prompt engineering and fine-tuning. This survey provides a comprehensive overview of LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering, fine-tuning methods, and future research directions. We hope this review will enable readers to gain a broader understanding of the recent advances in this field and offer some insights into its future trajectory.

Bullet Points
- This survey provides a comprehensive overview of LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering, fine-tuning methods, and future research directions
- It aims to provide readers with a broader understanding of recent advances in this field and offer insights into its future trajectory.
TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON, John Chong Min Tan,Prince Saroj,Bharat Runwal,Hardik Maheshwari,Brian Lim Yi Sheng,Richard Cottrill,Alankrit Chona,Ambuj Kumar,Mehul Motani, 22-07-2024

Categories

Artificial Intelligence, Multiagent Systems

Abstract

TaskGen is an open-sourced agentic framework which uses an Agent to solve an arbitrary task by breaking them down into subtasks. Each subtask is mapped to an Equipped Function or another Agent to execute. In order to reduce verbosity (and hence token usage), TaskGen uses StrictJSON that ensures JSON output from the Large Language Model (LLM), along with additional features such as type checking and iterative error correction. Key to the philosophy of TaskGen is the management of information/memory on a need-to-know basis. We empirically evaluate TaskGen on various environments such as 40x40 dynamic maze navigation with changing obstacle locations (100% solve rate), TextWorld escape room solving with dense rewards and detailed goals (96% solve rate), web browsing (69% of actions successful), solving the MATH dataset (71% solve rate over 100 Level-5 problems), Retrieval Augmented Generation on NaturalQuestions dataset (F1 score of 47.03%)

Bullet Points
- TaskGen is an open-source agentic framework that uses an Agent to solve an arbitrary task by breaking them down into subtasks
- Each subtask is mapped to an Equipped Function or another Agent to execute
- StrictJSON ensures JSON output from the Large Language Model (LLM) and additional features such as type checking and iterative error correction
- The philosophy of TaskGen emphasizes the management of information/memory on a need-to-know basis
- We have evaluated TaskGen on various environments such as 40x40 dynamic maze navigation, TextWorld escape room solving with dense rewards and detailed goals, web browsing (69% of actions successful), solving the MATH dataset (71% solve rate over 100 Level-5 problems), and Retrieval Augmented Generation on NaturalQuestions dataset (F1 score of 47.03%).
PersonaGym: Evaluating Persona Agents and LLMs, Vinay Samuel,Henry Peng Zou,Yue Zhou,Shreyas Chaudhari,Ashwin Kalyan,Tanmay Rajpurohit,Ameet Deshpande,Karthik Narasimhan,Vishvak Murahari, 25-07-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

Persona agents, which are LLM agents that act according to an assigned persona, have demonstrated impressive contextual response capabilities across various applications. These persona agents offer significant enhancements across diverse sectors, such as education, healthcare, and entertainment, where model developers can align agent responses to different user requirements thereby broadening the scope of agent applications. However, evaluating persona agent performance is incredibly challenging due to the complexity of assessing persona adherence in free-form interactions across various environments that are relevant to each persona agent. We introduce PersonaGym, the first dynamic evaluation framework for assessing persona agents, and PersonaScore, the first automated human-aligned metric grounded in decision theory for comprehensive large-scale evaluation of persona agents. Our evaluation of 6 open and closed-source LLMs, using a benchmark encompassing 200 personas and 10,000 questions, reveals significant opportunities for advancement in persona agent capabilities across state-of-the-art models. For example, Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore than GPT 3.5 despite being a much more advanced model. Importantly, we find that increased model size and complexity do not necessarily imply enhanced persona agent capabilities thereby highlighting the pressing need for algorithmic and architectural invention towards faithful and performant persona agents.

Bullet Points
- Persona agents, LLM agents that act according to an assigned persona, have demonstrated contextual response capabilities across various applications
- They offer enhancements across different sectors, allowing model developers to align agent responses to different user requirements
- However, evaluating persona agent performance is challenging due to the complexity of assessing persona adherence in free-form interactions across various environments
- We introduce PersonaGym and PersonaScore, which are the first automated human-aligned metric grounded in decision theory for comprehensive large-scale evaluation of persona agents
- Our evaluation of 6 open and closed-source LLMs, using a benchmark encompassing 200 personas and 10,000 questions, highlights the need for algorithmic and architectural invention towards faithful and performant persona Agents.
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher, Zehui Chen,Kuikun Liu,Qiuchen Wang,Jiangning Liu,Wenwei Zhang,Kai Chen,Feng Zhao, 29-07-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine.

Bullet Points
- MindSearch has already developed a competitive solution to the proprietary AI search engine, which implies that it has the ability to offer a unique and effective solution to these applications
- This implies that MindSearch is already able to provide a reliable and efficient solution to this market.
Adaptive Retrieval-Augmented Generation for Conversational Systems, Xi Wang,Procheta Sen,Ruizhe Li,Emine Yilmaz, 31-07-2024

Categories

Computation and Language, Information Retrieval

Abstract

Despite the success of integrating large language models into the development of conversational systems, many studies have shown the effectiveness of retrieving and augmenting external knowledge for informative responses. Hence, many existing studies commonly assume the always need for Retrieval Augmented Generation (RAG) in a conversational system without explicit control. This raises a research question about such a necessity. In this study, we propose to investigate the need for each turn of system response to be augmented with external knowledge. In particular, by leveraging human judgements on the binary choice of adaptive augmentation, we develop RAGate, a gating model, which models conversation context and relevant inputs to predict if a conversational system requires RAG for improved responses. We conduct extensive experiments on devising and applying RAGate to conversational models and well-rounded analyses of different conversational scenarios. Our experimental results and analysis indicate the effective application of RAGate in RAG-based conversational systems in identifying system responses for appropriate RAG with high-quality responses and a high generation confidence. This study also identifies the correlation between the generation's confidence level and the relevance of the augmented knowledge.

Bullet Points
- The study investigates the need for Retrieval Augmented Generation (RAG) in conversational systems without explicit control
- We develop RAGate, a gating model that models conversation context and relevant inputs to predict if a conversational system requires RAG for improved responses
- We conduct experiments on RAG-based conversational models and well-rounded analyses of different conversational scenarios to identify system responses for appropriate RAG with high-quality responses and a high generation confidence level
- The study also identifies the correlation between generation's confidence level and the relevance of the augmented knowledge.
Agentic LLM Workflows for Generating Patient-Friendly Medical Reports, Malavikha Sudarshan,Sophie Shih,Estella Yee,Alina Yang,John Zou,Cathy Chen,Quan Zhou,Leon Chen,Chinmay Singhal,George Shih, 02-08-2024

Categories

Multiagent Systems
A Survey of Mamba, Haohao Qu,Liangbo Ning,Rui An,Wenqi Fan,Tyler Derr,Xin Xu,Qing Li, 02-08-2024

Categories

Machine Learning, Artificial Intelligence

Abstract

Deep learning, as a vital technique, has sparked a notable revolution in artificial intelligence. As the most representative architecture, Transformers have empowered numerous advanced models, especially the large language models that comprise billions of parameters, becoming a cornerstone in deep learning. Despite the impressive achievements, Transformers still face inherent limitations, particularly the time-consuming inference resulting from the quadratic computation complexity of attention calculation. Recently, a novel architecture named Mamba, drawing inspiration from classical state space models, has emerged as a promising alternative for building foundation models, delivering comparable modeling abilities to Transformers while preserving near-linear scalability concerning sequence length. This has sparked an increasing number of studies actively exploring Mamba's potential to achieve impressive performance across diverse domains. Given such rapid evolution, there is a critical need for a systematic review that consolidates existing Mamba-empowered models, offering a comprehensive understanding of this emerging model architecture. In this survey, we therefore conduct an in-depth investigation of recent Mamba-associated studies, covering from three main aspects: the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel. Specifically, we first recall the foundational knowledge of various representative deep learning models and the details of Mamba as preliminaries. Then, to showcase the significance of Mamba, we comprehensively review the related studies focusing on Mamba models' architecture design, data adaptability, and applications. Finally, we present an discussion of current limitations and explore various promising research directions to provide deeper insights for future investigations.

Bullet Points
- The study aims to consolidate existing Mamba-empowered models and provide a comprehensive understanding of Mamba's architecture, architecture design, data adaptability, and applications
- The study covers the foundational knowledge of representative deep learning models and Mamba models as preliminaries, and explores the techniques of adapting Mamba to diverse data to showcase its potential to achieve impressive performance across diverse domains
- It also presents a discussion of current limitations and promising research directions.
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, Kunlun Zhu,Yifan Luo,Dingling Xu,Ruobing Wang,Shi Yu,Shuo Wang,Yukun Yan,Zhenghao Liu,Xu Han,Zhiyuan Liu,Maosong Sun, 02-08-2024

Categories

Computation and Language, Information Retrieval

Abstract

Retrieval-Augmented Generation (RAG) systems have demonstrated their advantages in alleviating the hallucination of Large Language Models (LLMs). Existing RAG benchmarks mainly focus on evaluating whether LLMs can correctly answer the general knowledge. However, they are unable to evaluate the effectiveness of the RAG system in dealing with the data from different vertical domains. This paper introduces RAGEval, a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios. Specifically, RAGEval summarizes a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations. We propose three novel metrics, Completeness, Hallucination, and Irrelevance, to carefully evaluate the responses generated by LLMs. By benchmarking RAG models in vertical domains, RAGEval has the ability to better evaluate the knowledge usage ability of LLMs, which avoids the confusion regarding the source of knowledge in answering question in existing QA datasets--whether it comes from parameterized memory or retrieval.

Bullet Points
- RAGEval is a framework for automatically generating evaluation datasets to evaluate the knowledge usage ability of different LLMs in different scenarios
- It uses a schema from seed documents, applies the configurations to generate diverse documents, and constructs question-answering pairs according to both articles and configurations
- Three novel metrics are Completeness, Hallucination, and Irrelevance
- By benchmarking RAG models in vertical domains, it can better evaluate their knowledge usage abilities, which avoids confusion regarding the source of knowledge in answering questions in existing QA datasets.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone, Yuan Yao,Tianyu Yu,Ao Zhang,Chongyi Wang,Junbo Cui,Hongji Zhu,Tianchi Cai,Haoyu Li,Weilin Zhao,Zhihui He,Qianyu Chen,Huarong Zhou,Zhensheng Zou,Haoye Zhang,Shengding Hu,Zhi Zheng,Jie Zhou,Jie Cai,Xu Han,Guoyang Zeng,Dahai Li,Zhiyuan Liu,Maosong Sun, 03-08-2024

Categories

Computer Vision

Abstract

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Bullet Points
- The recent surge of Multimodal Large Language Models (MLLMs) has led to a promising path towards AI research and industry, but significant challenges remain preventing them from being practical in real-world applications due to the high cost of running an MLLM with a massive number of parameters and extensive computation
- The latest MiniCPM-V 2.5 has several notable features, including strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, trustworthy behavior with low hallucination rates, multilingual support for 30+ languages, efficient deployment on mobile phones, and promising trend: model sizes for achieving usable (e.g., GPT4V) level performance are rapidly decreasing along with the fast growth of end-side computation capacity.
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future, Haolin Jin,Linghan Huang,Haipeng Cai,Jun Yan,Bo Li,Huaming Chen, 05-08-2024

Categories

Software Engineering, Artificial Intelligence, Computation and Language

Abstract

With the rise of large language models (LLMs), researchers are increasingly exploring their applications in var ious vertical domains, such as software engineering. LLMs have achieved remarkable success in areas including code generation and vulnerability detection. However, they also exhibit numerous limitations and shortcomings. LLM-based agents, a novel tech nology with the potential for Artificial General Intelligence (AGI), combine LLMs as the core for decision-making and action-taking, addressing some of the inherent limitations of LLMs such as lack of autonomy and self-improvement. Despite numerous studies and surveys exploring the possibility of using LLMs in software engineering, it lacks a clear distinction between LLMs and LLM based agents. It is still in its early stage for a unified standard and benchmarking to qualify an LLM solution as an LLM-based agent in its domain. In this survey, we broadly investigate the current practice and solutions for LLMs and LLM-based agents for software engineering. In particular we summarise six key topics: requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance. We review and differentiate the work of LLMs and LLM-based agents from these six topics, examining their differences and similarities in tasks, benchmarks, and evaluation metrics. Finally, we discuss the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering. We anticipate this work will shed some lights on pushing the boundaries of LLM-based agents in software engineering for future research.

Bullet Points
- The survey investigates the current practice and solutions for LLMs and LLM-based agents for software engineering, examining their similarities and differences in tasks, benchmarks, and evaluation metrics
- It identifies six key topics, including requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance, and discusses the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering.
Self-Taught Evaluators, Tianlu Wang,Ilia Kulikov,Olga Golovneva,Ping Yu,Weizhe Yuan,Jane Dwivedi-Yu,Richard Yuanzhe Pang,Maryam Fazel-Zarandi,Jason Weston,Xian Li, 05-08-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

Bullet Points
- The work proposes an approach to im-prove evaluators without human annotations using synthetic training data, which generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using improved predictions
- This approach outperforms commonly used LLM judges such as GPT-4 and matches the performance of top-performing reward models trained with labeled examples.
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, Daniel Fleischer,Moshe Berchansky,Moshe Wasserblat,Peter Izsak, 05-08-2024

Categories

Computation and Language, Artificial Intelligence, Information Retrieval, Machine Learning
Transformer Explainer: Interactive Learning of Text-Generative Models, Aeree Cho,Grace C. Kim,Alexander Karpekov,Alec Helbling,Zijie J. Wang,Seongmin Lee,Benjamin Hoover,Duen Horng Chau, 08-08-2024

Categories

Machine Learning, Artificial Intelligence, Computation and Language, Human-Computer Interaction
AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems, Victor Dibia,Jingya Chen,Gagan Bansal,Suff Syed,Adam Fourney,Erkang Zhu,Chi Wang,Saleema Amershi, 09-08-2024

Categories

Software Engineering, Artificial Intelligence, Computation and Language, Human-Computer Interaction, Machine Learning
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers, Zhenting Qi,Mingyuan Ma,Jiahang Xu,Li Lyna Zhang,Fan Yang,Mao Yang, 12-08-2024

Categories

Computation and Language

Abstract
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, Chris Lu,Cong Lu,Robert Tjarko Lange,Jakob Foerster,Jeff Clune,David Ha, 12-08-2024

Categories

Artificial Intelligence, Computation and Language, Machine Learning
OpenResearcher: Unleashing AI for Accelerated Scientific Research, Yuxiang Zheng,Shichao Sun,Lin Qiu,Dongyu Ru,Cheng Jiayang,Xuefeng Li,Jifan Lin,Binjie Wang,Yun Luo,Renjie Pan,Yang Xu,Qingkai Min,Zizhao Zhang,Yiwen Wang,Wenjie Li,Pengfei Liu, 13-08-2024

Categories

Information Retrieval
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation, Dongyu Ru,Lin Qiu,Xiangkun Hu,Tianhang Zhang,Peng Shi,Shuaichen Chang,Cheng Jiayang,Cunxiang Wang,Shichao Sun,Huanyu Li,Zizhao Zhang,Binjie Wang,Jiarong Jiang,Tong He,Zhiguo Wang,Pengfei Liu,Yue Zhang,Zheng Zhang, 15-08-2024

Categories

Computation and Language, Artificial Intelligence
FuseChat: Knowledge Fusion of Chat Models, Fanqi Wan,Longguang Zhong,Ziyi Yang,Ruijun Chen,Xiaojun Quan, 15-08-2024

Categories

Computation and Language
Automated Design of Agentic Systems, Shengran Hu,Cong Lu,Jeff Clune, 15-08-2024

Categories

Artificial Intelligence

Abstract

Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We formulate a new research area, Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways. We further demonstrate that there is an unexplored yet promising approach within ADAS where agents can be defined in code and new agents can be automatically discovered by a meta agent programming ever better ones in code. Given that programming languages are Turing Complete, this approach theoretically enables the learning of any possible agentic system: including novel prompts, tool use, control flows, and combinations thereof. We present a simple yet effective algorithm named Meta Agent Search to demonstrate this idea, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries. Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. Importantly, we consistently observe the surprising result that agents invented by Meta Agent Search maintain superior performance even when transferred across domains and models, demonstrating their robustness and generality. Provided we develop it safely, our work illustrates the potential of an exciting new research direction toward automatically designing ever-more powerful agentic systems to benefit humanity.

Bullet Points
- Researchers are developing powerful general-purpose agents using Foundation Models, but machine learning is replacing hand-designed solutions
- A new research area is Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways
- The approach is unexplored yet promising, where agents can be defined in code and new agents are automatically discovered by a meta agent programming ever better ones in code
- We present a simple yet effective algorithm named Meta Agent Search, which can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents
- This results in superior performance even when transferred across domains and models, demonstrating the potential of an exciting new research direction toward automatically designing ever-more powerful agenti systems to benefit humanity.
Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, Ming Jiang,Tingting Huang,Biao Guo,Yao Lu,Feng Zhang, 20-08-2024

Categories

Computation and Language

Abstract

In recent years, Large language models (LLMs) have garnered significant attention due to their superior performance in complex reasoning tasks. However, recent studies may diminish their reasoning capabilities markedly when problem descriptions contain irrelevant information, even with the use of advanced prompting techniques. To further investigate this issue, a dataset of primary school mathematics problems containing irrelevant information, named GSMIR, was constructed. Testing prominent LLMs and prompting techniques on this dataset revealed that while LLMs can identify irrelevant information, they do not effectively mitigate the interference it causes once identified. A novel automatic construction method, ATF, which enhances the ability of LLMs to identify and self-mitigate the influence of irrelevant information, is proposed to address this shortcoming. This method operates in two steps: first, analysis of irrelevant information, followed by its filtering. The ATF method, as demonstrated by experimental results, significantly improves the reasoning performance of LLMs and prompting techniques, even in the presence of irrelevant information on the GSMIR dataset.

Bullet Points
- Large language models (LLMs) have gained attention due to their superior performance in complex reasoning tasks
- However, recent studies may diminish their reasoning capabilities when problem descriptions contain irrelevant information, even with the use of advanced prompting techniques
- To address this issue, a dataset of primary school mathematics problems containing irrelevant information named GSMIR was constructed
- Testing prominent LLMs and prompting Techniques on this dataset revealed that while LLMS can identify relevant information, they do not effectively mitigate the interference it causes once identified
- A novel automatic construction method, ATF, is proposed to address this shortcoming
- This method operates in two steps: analysis of irrelevant information followed by its filtering
- Experimental results demonstrate that the ATF method significantly improves the reasoning performance of LLM models and promptting techniques, even in the presence of relevant information on the GSMIRE dataset.
Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution, Yucheng Ruan,Xiang Lan,Jingying Ma,Yizhi Dong,Kai He,Mengling Feng, 20-08-2024

Categories

Computation and Language
Automating Thought of Search: A Journey Towards Soundness and Completeness, Daniel Cao,Michael Katz,Harsha Kokel,Kavitha Srinivas,Shirin Sohrabi, 21-08-2024

Categories

Artificial Intelligence

Abstract

Planning remains one of the last standing bastions for large language models (LLMs), which now turn their attention to search. Most of the literature uses the language models as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having the language models produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. At the same time LLMs have demonstrated significant progress in code generation and refinement for complex reasoning tasks. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.

Bullet Points
- Thought of Search (ToS) proposes to define the search space with code, requiring a human in the loop to produce a sound successor function and goal test
- The test datasets were solved with 100% accuracy, and LLMs have demonstrated progress in code generation and refinement for complex reasoning tasks
- AutoToS guides the language model step by step towards the generation of sound and complete search components through feedback from both generic and domain specific unit tests, achieving 100% accuracy.
Efficient Detection of Toxic Prompts in Large Language Models, Yi Liu,Junzhe Yu,Huijia Sun,Ling Shi,Gelei Deng,Yuqi Chen,Yang Liu, 21-08-2024

Categories

Cryptography and Security, Artificial Intelligence, Computation and Language, Software Engineering

Abstract

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods. Additionally, ToxicDetector's processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.

Bullet Points
- ToxicDetector is a lightweight greybox method designed to efficiently detect toxic prompts in LLMs
- It leverages LLM models to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification
- It achieves high accuracy, efficiency, and scalability, outperforming state-of-the-art methods
- Its processing time is 0.0780 seconds per prompt and is suitable for real-time applications.
Automating Thought of Search: A Journey Towards Soundness and Completeness, Daniel Cao,Michael Katz,Harsha Kokel,Kavitha Srinivas,Shirin Sohrabi, 21-08-2024

Categories

Artificial Intelligence

Abstract

Planning remains one of the last standing bastions for large language models (LLMs), which now turn their attention to search. Most of the literature uses the language models as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having the language models produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. At the same time LLMs have demonstrated significant progress in code generation and refinement for complex reasoning tasks. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.

Bullet Points
- Thought of Search (ToS) proposes to define the search space with code, requiring a human in the loop to produce a sound successor function and goal test
- The test datasets were solved with 100% accuracy, and LLMs have demonstrated progress in code generation and refinement for complex reasoning tasks
- AutoToS guides the language model step by step towards the generation of sound and complete search components through feedback from both generic and domain specific unit tests, achieving 100% accuracy.
MEDCO: Medical Education Copilots Based on A Multi-Agent Framework, Hao Wei,Jianing Qiu,Haibao Yu,Wu Yuan, 22-08-2024

Categories

Artificial Intelligence, Multiagent Systems

Abstract

Large language models (LLMs) have had a significant impact on diverse research domains, including medicine and healthcare. However, the potential of LLMs as copilots in medical education remains underexplored. Current AI-assisted educational tools are limited by their solitary learning approach and inability to simulate the multi-disciplinary and interactive nature of actual medical training. To address these limitations, we propose MEDCO (Medical EDucation COpilots), a novel multi-agent-based copilot system specially developed to emulate real-world medical training environments. MEDCO incorporates three primary agents: an agentic patient, an expert doctor, and a radiologist, facilitating a multi-modal and interactive learning environment. Our framework emphasizes the learning of proficient question-asking skills, multi-disciplinary collaboration, and peer discussions between students. Our experiments show that simulated virtual students who underwent training with MEDCO not only achieved substantial performance enhancements comparable to those of advanced models, but also demonstrated human-like learning behaviors and improvements, coupled with an increase in the number of learning samples. This work contributes to medical education by introducing a copilot that implements an interactive and collaborative learning approach. It also provides valuable insights into the effectiveness of AI-integrated training paradigms.

Bullet Points
- MEDCO is a multi-agent-based copilot system developed to emulate real-world medical training environments
- It incorporates three primary agents: an agentic patient, an expert doctor, and a radiologist, facilitating multi-modal and interactive learning environments
- The system emphasizes the learning of proficient question-asking skills, multi-disciplinary collaboration, and peer discussions between students
- Simulated virtual students who underwent training with the system achieved substantial performance enhancements, human-like learning behaviors, improvements, and an increase in the number of learning samples
- This work contributes to medical education by introducing a copilot that implements an interactive and collaborative learning approach and provides valuable insights into the effectiveness of AI-integrated training paradigms.
Persuasion Games using Large Language Models, Ganesh Prasath Ramani,Shirish Karande,Santhosh V,Yash Bhatia, 28-08-2024

Categories

Artificial Intelligence, Computation and Language

Abstract

We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase).

Bullet Points
- Simulated personas are used to evaluate LLMs' proficiency in recognizing, adjusting to, and influencing personality types in insurance, banking, and retail domains
- Resistance mechanisms are also examined, including surveys before and after interaction, LLM-generated scores on conversation, and user decisions.
OLMoE: Open Mixture-of-Experts Language Models, Niklas Muennighoff,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Jacob Morrison,Sewon Min,Weijia Shi,Pete Walsh,Oyvind Tafjord,Nathan Lambert,Yuling Gu,Shane Arora,Akshita Bhagia,Dustin Schwenk,David Wadden,Alexander Wettig,Binyuan Hui,Tim Dettmers,Douwe Kiela,Ali Farhadi,Noah A. Smith,Pang Wei Koh,Amanpreet Singh,Hannaneh Hajishirzi, 03-09-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

Bullet Points
- OLMoE is an open language model that uses sparse Mixture-of-Experts (MoE)
- It has 7 billion parameters but uses only 1B per input token
- We pretrain it on 5 trillion tokens and then adapt it to create a 1-B-7B-Instruct
- Our models outperform all available models with similar active parameters, surpassing larger ones
- We present experiments on MoE training, analyze routing, and open-source all aspects of our work including model weights, training data, code, and logs.
Large Language Model-Based Agents for Software Engineering: A Survey, Junwei Liu,Kaixin Wang,Yixuan Chen,Xin Peng,Zhenpeng Chen,Lingming Zhang,Yiling Lou, 04-09-2024

Categories

Software Engineering, Artificial Intelligence

Abstract
Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities, Wei Lu,Rachel K. Luu,Markus J. Buehler, 05-09-2024

Categories

Computation and Language, Material Science, Artificial Intelligence

Abstract

The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.

Bullet Points
- The study explores how fine-tuning strategies such as Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and preference-based optimization can affect LLM performance, leading to the emergence of new functionalities beyond the individual contributions of the parent models
- Experiments with different model architectures, such as Llama 3.1 8B and Mistral 7B models, show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting model scaling may be a key component
- The study also explores the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design.
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation, Yu Wang,Shiwan Zhao,Zhihu Wang,Heyuan Huang,Ming Fan,Yubo Zhang,Zhixing Wang,Haijun Wang,Ting Liu, 05-09-2024

Categories

Artificial Intelligence, Computation and Language, Human-Computer Interaction

Abstract

The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge, we propose the \textbf{Strategic Chain-of-Thought} (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers. Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05% increase on the GSM8K dataset and 24.13% on the Tracking_Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks.

Bullet Points
- The textbfStrategic Chain-of-Thought (SCoT) is a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps
- It employs a two-stage approach that involves eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers
- Experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05% increase on the GSM8K dataset and 24.13% on the Tracking_Objects dataset, respectively using the Llama3-8b model
- Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results
- These findings underscore the efficacy of SCoTS, highlighting its potential to substantially enhance LML performance
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers, Chenglei Si,Diyi Yang,Tatsunori Hashimoto, 06-09-2024

Categories

Computation and Language, Artificial Intelligence, Computers and Society, Human-Computer Interaction, Machine Learning

Abstract

Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.

Bullet Points
- LLMs have the potential to accelerate scientific discovery, but no evaluations have shown that they can produce novel, expert-level ideas, or perform the entire research process
- An experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent results in the first statistically significant conclusion on current LLM capabilities for research ideation
- LLM-generated ideas are judged as more novel than human expert ideas while being judged slightly weaker on feasibility
- We identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and lack of diversity in generation
- We propose a study design that recruits researchers to execute these ideas into full projects, allowing us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.
Can Large Language Models Unlock Novel Scientific Research Ideas?, Sandeep Kumar,Tirthankar Ghosal,Vinayak Goyal,Asif Ekbal, 10-09-2024

Categories

Computation and Language, Artificial Intelligence, Computers and Society, Human-Computer Interaction, Machine Learning

Abstract

"An idea is nothing more nor less than a new combination of old elements" (Young, J.W.). The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people's everyday lives. This study explores the capability of LLMs in generating novel research ideas based on information from research papers. We conduct a thorough examination of 4 LLMs in five domains (e.g., Chemistry, Computer, Economics, Medical, and Physics). We found that the future research ideas generated by Claude-2 and GPT-4 are more aligned with the author's perspective than GPT-3.5 and Gemini. We also found that Claude-2 generates more diverse future research ideas than GPT-4, GPT-3.5, and Gemini 1.0. We further performed a human evaluation of the novelty, relevancy, and feasibility of the generated future research ideas. This investigation offers insights into the evolving role of LLMs in idea generation, highlighting both its capability and limitations. Our work contributes to the ongoing efforts in evaluating and utilizing language models for generating future research ideas. We make our datasets and codes publicly available.

Bullet Points
- The study explores the capability of LLMs in generating novel research ideas based on information from research papers
- It conducts a thorough examination of 4 LLM models in five domains and found that Claude-2 and GPT-4 generate more diverse future research ideas than GPT-3.5 and Gemini 1.0
- Human evaluation of novelty, relevancy, and feasibility of the generated research ideas is also conducted
- The study contributes to ongoing efforts in evaluating and utilizing language models for generating future research Ideas.
What is the Role of Small Models in the LLM Era: A Survey, Lihu Chen,Gaël Varoquaux, 10-09-2024

Categories

Computation and Language

Abstract
On the Diagram of Thought, Yifan Zhang,Yang Yuan,Andrew Chi-Chih Yao, 16-09-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract
Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent, Fatemeh Haji,Mazal Bethany,Maryam Tabar,Jason Chiang,Anthony Rios,Peyman Najafirad, 17-09-2024

Categories

Artificial Intelligence

Abstract

Multi-agent strategies have emerged as a promising approach to enhance the reasoning abilities of Large Language Models (LLMs) by assigning specialized roles in the problem-solving process. Concurrently, Tree of Thoughts (ToT) methods have shown potential in improving reasoning for complex question-answering tasks by exploring diverse reasoning paths. A critical limitation in multi-agent reasoning is the 'Reasoner' agent's shallow exploration of reasoning paths. While ToT strategies could help mitigate this problem, they may generate flawed reasoning branches, which could harm the trustworthiness of the final answer. To leverage the strengths of both multi-agent reasoning and ToT strategies, we introduce a novel approach combining ToT-based Reasoner agents with a Thought Validator agent. Multiple Reasoner agents operate in parallel, employing ToT to explore diverse reasoning paths. The Thought Validator then scrutinizes these paths, considering a Reasoner's conclusion only if its reasoning is valid. This method enables a more robust voting strategy by discarding faulty reasoning paths, enhancing the system's ability to tackle tasks requiring systematic and trustworthy reasoning. Our method demonstrates superior performance compared to existing techniques when evaluated on the GSM8K dataset, outperforming the standard ToT strategy by an average 5.6% across four LLMs.

Bullet Points
- Multi-agent strategies and Tree of Thoughts (ToT) methods can enhance reasoning abilities of LLMs by assigning specialized roles in the problem-solving process
- However, the 'Reasoner' agent's shallow exploration of reasoning paths can harm the trustworthiness of the final answer
- To mitigate this problem, ToT strategies may generate flawed reasoning branches
- To leverage these strengths, we propose a novel approach combining ToT-based Reasoner agents with a Thought Validator agent
- This method enables a more robust voting strategy by discarding faulty reasoning paths, enhancing the system's ability to tackle tasks requiring systematic and trustworthy reasoning
- Our method outperforms the standard ToT strategy by an average 5.6% across four LLM models.
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B, Jemin Lee,Sihyeong Park,Jinse Kwon,Jihun Oh,Yongin Kwon, 17-09-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

Bullet Points
- The paper evaluates the performance of instruction-tuned LLMs across various quantization methods, including GPTQ, AWQ, SmoothQuant, and FP8 on models ranging from 7B to 405B
- The results show that quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following
- Performance varies significantly with different quantization Methods, model size, and bit-width, with weight-only methods often yielding better results in larger models
- Task difficulty does not significantly impact accuracy degradation due to quantization, and the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLM.
Jailbreaking Large Language Models with Symbolic Mathematics, Emet Bethany,Mazal Bethany,Juan Arturo Nolazco Flores,Sumit Kumar Jha,Peyman Najafirad, 17-09-2024

Categories

Cryptography and Security, Artificial Intelligence, Computation and Language, Machine Learning

Abstract

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

Bullet Points
- The paper introduces MathPrompt, a jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms, demonstrating a critical vulnerability in current AI safety measures
- The experiment shows an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs
- The study emphasizes the importance of a holistic approach to AI safety and calls for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, Zayne Sprague,Fangcong Yin,Juan Diego Rodriguez,Dongwei Jiang,Manya Wadhwa,Prasann Singhal,Xinyu Zhao,Xi Ye,Kyle Mahowald,Greg Durrett, 18-09-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

Bullet Points
- CoT via prompting is a de facto method for eliciting reasoning capabilities from large language models (LLMs)
- However, it can be useful for tasks involving math or logic, with smaller gains on other types of tasks
- A quantitative meta-analysis of over 100 papers using CoT and evaluations of 20 datasets across 14 models found that CoT gives strong performance benefits primarily on tasks related to mathematics or logic
- On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning
- The behavior of CoT on these problems was analyzed by separating planning and execution and comparing against tool-augmented LLMs
- Much of its gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver
- CoT can be applied selectively, maintaining performance while saving
LLMs + Persona-Plug = Personalized LLMs, Jiongnan Liu,Yutao Zhu,Shuting Wang,Xiaochi Wei,Erxue Min,Yu Lu,Shuaiqiang Wang,Dawei Yin,Zhicheng Dou, 18-09-2024

Categories

Computation and Language

Abstract

Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, \ours{}. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.

Bullet Points
- Personalization plays a critical role in language tasks and applications, as users prefer diverse outputs based on their individual interests
- Personalized approaches aim to adapt LLMs to generate customized outputs aligned with user preferences, but alternative approaches may break the continuity of user history and fail to capture user styles and patterns
- To address these challenges, we propose a novel personalized LLM model, ours, which constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module
- The proposed model significantly outperforms existing customized LLM approaches.
Training Language Models to Self-Correct via Reinforcement Learning, Aviral Kumar,Vincent Zhuang,Rishabh Agarwal,Yi Su,John D Co-Reyes,Avi Singh,Kate Baumli,Shariq Iqbal,Colton Bishop,Rebecca Roelofs,Lei M Zhang,Kay McKinney,Disha Shrivastava,Cosmin Paduraru,George Tucker,Doina Precup,Feryal Behbahani,Aleksandra Faust, 19-09-2024

Categories

Machine Learning

Abstract

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Bullet Points
- SCoRe is a multi-turn online reinforcement learning (RL) approach that improves an LLM's self-correction ability using entirely self-generated data
- It addresses the challenges of supervised fine-tuning (SFT) on offline model-generated correction traces by training under the model's distribution and using appropriate regularization to steer the learning process into learning a self- correction strategy that is effective at test time rather than just fitting high-reward responses for a given prompt
- The approach achieves state-of-the-art performance on the MATH and HumanEval benchmarks by 15.6% and 9.1% respectively, with Gemini 1.0 Pro and 1.5 Flash models.
Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts, Ming Wang,Yuanzhong Liu,Xiaoyu Liang,Yijie Huang,Daling Wang,Xiaocui Yang,Sijia Shen,Shi Feng,Xiaoming Zhang,Chaofeng Guan,Yifei Zhang, 20-09-2024

Categories

Computation and Language

Abstract

LLMs have demonstrated commendable performance across diverse domains. Nevertheless, formulating high-quality prompts to assist them in their work poses a challenge for non-AI experts. Existing research in prompt engineering suggests somewhat scattered optimization principles and designs empirically dependent prompt optimizers. Unfortunately, these endeavors lack a structural design, incurring high learning costs and it is not conducive to the iterative updating of prompts, especially for non-AI experts. Inspired by structured reusable programming languages, we propose LangGPT, a structural prompt design framework. Furthermore, we introduce Minstrel, a multi-generative agent system with reflection to automate the generation of structural prompts. Experiments and the case study illustrate that structural prompts generated by Minstrel or written manually significantly enhance the performance of LLMs. Furthermore, we analyze the ease of use of structural prompts through a user survey in our online community.

Bullet Points
- LLMs have excellent performance across diverse domains, but formulating high-quality prompts for them poses a challenge for non-AI experts
- Existing research in prompt engineering suggests scattered optimization principles and designs empirically dependent prompt optimizers
- These endeavors lack a structural design, incurring high learning costs, and are not conducive to iterative updating of prompts
- We propose LangGPT and Minstrel to automate the generation of structural prompts, and analyze the ease of use through a user survey in our online community.
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, Siyun Zhao,Yuqing Yang,Zilong Wang,Zhiyuan He,Luna K. Qiu,Lili Qiu, 23-09-2024

Categories

Computation and Language, Artificial Intelligence

Abstract

Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.

Bullet Points
- Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks
- Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application
- However, effective deployment of data-augmented LLM applications presents significant challenges, including retrieving relevant data and accurately interpreting user intent
- There is no one-size-fits-all solution, and underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution
- A survey proposes a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit truth queries, interpret
Small Language Models: Survey, Measurements, and Insights, Zhenyan Lu,Xiang Li,Dongqi Cai,Rongjie Yi,Fangming Liu,Xiwen Zhang,Nicholas D. Lane,Mengwei Xu, 24-09-2024

Categories

Computation and Language, Artificial Intelligence, Machine Learning

Abstract

Small language models (SLMs), despite their widespread adoption in modern smart devices, have received significantly less academic attention compared to their large language model (LLM) counterparts, which are predominantly deployed in data centers and cloud environments. While researchers continue to improve the capabilities of LLMs in the pursuit of artificial general intelligence, SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks. Focusing on transformer-based, decoder-only language models with 100M-5B parameters, we survey 59 state-of-the-art open-source SLMs, analyzing their technical innovations across three axes: architectures, training datasets, and training algorithms. In addition, we evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding. To gain further insight into their on-device runtime costs, we benchmark their inference latency and memory footprints. Through in-depth analysis of our benchmarking data, we offer valuable insights to advance research in this field.

Bullet Points
- Small language models (SLMs) have received less academic attention compared to LLMs, which are predominantly deployed in data centers and cloud environments
- SLM research aims to make machine intelligence more accessible, affordable, and efficient for everyday tasks
- We survey 59 open-source SLMs and analyze their technical innovations across three axes: architectures, training datasets, and training algorithms
- We evaluate their capabilities in various domains, including commonsense reasoning, in-context learning, mathematics, and coding
- To gain insight into their on-device runtime costs, we benchmark their inference latency and memory footprints.
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, Ruihao Gong,Yifu Ding,Zining Wang,Chengtao Lv,Xingyu Zheng,Jinyang Du,Haotong Qin,Jinyang Guo,Michele Magno,Xianglong Liu, 25-09-2024

Categories

Artificial Intelligence, Computation and Language, Machine Learning

Abstract

Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.

Bullet Points
- The paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies
- It covers basic concepts and new data formats, categorizes and analyzes techniques and toolkits for efficient low bit training and inference, and concludes with a discussion of future trends and potential advancements.
Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models, Tongxuan Liu,Wenjiang Xu,Weizhe Huang,Xingyu Wang,Jiaxing Wang,Hailong Yang,Jing Li, 26-09-2024

Categories

Computation and Language

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks but their performance in complex logical reasoning tasks remains unsatisfactory. Although some prompting methods, such as Chain-of-Thought, can improve the reasoning ability of LLMs to some extent, they suffer from an unfaithful issue where derived conclusions may not align with the generated reasoning chain. To address this issue, some studies employ the approach of propositional logic to further enhance logical reasoning abilities of LLMs. However, the potential omissions in the extraction of logical expressions in these methods can cause information loss in the logical reasoning process, thereby generating incorrect results. To this end, we propose Logic-of-Thought (LoT) prompting which employs propositional logic to generate expanded logical information from input context, and utilizes the generated logical information as an additional augmentation to the input prompts, thereby enhancing the capability of logical reasoning. The LoT is orthogonal to existing prompting methods and can be seamlessly integrated with them. Extensive experiments demonstrate that LoT boosts the performance of various prompting methods with a striking margin across five logical reasoning tasks. In particular, the LoT enhances Chain-of-Thought's performance on the ReClor dataset by +4.35%; moreover, it improves Chain-of-Thought with Self-Consistency's performance on LogiQA by +5%; additionally, it boosts performance of Tree-of-Thoughts on ProofWriter dataset by +8%.

Bullet Points
- Logic-of-Thought (LoT) prompting enhances the performance of LLMs in complex logical reasoning tasks by using propositional logic to generate expanded logical information from input context
- It can be seamlessly integrated with existing prompting methods
- Extensive experiments demonstrate that LoT boosts the performances of various prompting Methods with a striking margin across five logical thinking tasks.

Files

2024.md

Latest commit

History

2024.md

File metadata and controls

2024 (199 papers)

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Categories

Abstract

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points

Categories

Abstract

Bullet Points